Source: arXiv:2605.10848
Authors: Tz-Huan Hsu (UWaterloo), Jheng-Hong Yang (Stencilzeit), Jimmy Lin (UWaterloo)
Date: May 2026
"Does a lexical retriever suffice as LLMs become more capable in an agentic loop?"
The paper argues that prior BM25 baselines scored low because of poor parameter configuration and shallow retrieval depth, not because lexical retrieval is fundamentally inadequate.
Answer: Yes. A well-configured lexical retriever (BM25 with k1=25, b=1, depth=1000) matches or beats dense-retriever baselines when paired with a capable LLM through a proper tool interface.
| System |
Accuracy |
Surfaced Recall |
Total Cost |
| Pi-Serini (gpt-5.5) |
83.1% |
94.7% |
$291.6 |
| Prior dense baseline (gpt-5 + qwen3) |
73.0% |
79.0% |
$360.7 |
| Pi-Serini (deepseek-v4-flash) |
68.1% |
94.5% |
$28.9 |
| Pi-Serini (claude-opus-4.7) |
69.8% |
— |
$246.6 |
Tuning impact (default BM25 → tuned BM25): Accuracy +18.0%, surfaced recall +11.1%
Depth impact (k=5 → k=1000): Surfaced recall +25.3%
Cost reduction vs. dense baselines: 3.3×–10×
Pi-Serini is a deliberately minimal search agent that isolates the agent–retriever interaction. It has three components:
The main abstraction layer between the LLM agent and the Anserini BM25 backend. It manages:
- A cached, session-local ranking (up to 32 search IDs)
- Paginated access to results
- Spill files for large outputs
Decouples retrieval from context management:
search(query) — Issues a BM25 query, retrieves up to 1000 documents, caches the ranking, exposes only top 5 excerpts
read_search_results(search_id, offset, limit) — Browses the cached ranking without a new backend query
read_document(docid, offset, limit) — Reads a document in line-based chunks
- Hard timeout: 300 seconds per query
- Submit steer at 0.7T: Injects a message forcing the agent to stop using tools and answer immediately. All three tools are blocked afterwards.
- Prevents runaway loops — prior work used ~74 tool calls/query; Pi-Serini uses ~15–24
To understand agent behavior, Pi-Serini tracks:
- Dsurfaced — returned by search
- Dpreviewed — excerpts shown via read_search_results
- Dopened — full document read via read_document
- Dcited — used in final answer
The paper's central insight is that default BM25 parameters are optimized for passage retrieval (~100 words), not long documents (BrowseComp-Plus has median ~2k tokens, 90th percentile ~14k tokens).
| Parameter |
Default (Anserini) |
Tuned (Pi-Serini) |
| k1 |
0.9 |
25 |
| b |
0.4 |
1.0 |
A grid search over k1=[0–32] and b=[0–1] shows Anserini's default (k1=0.9, b=0.4) sits in a low-performing region. The tuned parameters adapt term-frequency saturation (high k1) and length normalization (b=1.0) for long documents.
Result on a 100-query subset: accuracy jumps from 64.0% (default) to 82.0% (tuned).
| k (depth) |
Surfaced Recall |
| 5 |
70.5% |
| 50 |
~89% |
| 1000 |
95.8% |
Previewed recall saturates at k=50 (~74.7%). Deeper rankings offer more opportunity, but agents don't automatically inspect them. The bottleneck shifts from "Can the retriever find it?" to "Can the agent recognize and spend context on evidence it already has?"
A key behavioral difference between models with similar costs:
- GPT-5.5 keeps candidate-specific probes reversible. If a probe fails (e.g., "Warrington", "Vinegar Strokes"), it returns to original clues (town population, spelling history).
- Claude Opus 4.7 tends to commit to a branch early and doesn't revisit alternatives even when initial probes are unproductive.
This suggests agent architecture matters as much as retriever choice.
Pi-Serini achieves massive cost reduction through two mechanisms:
1. No dense retriever overhead — BM25 is cheap to query
2. Prefix-cache-friendly loop — The same system prompt and repeated context means 82–90% of total tokens are served from cache at 10% of the input price
| Model |
Input price |
Output price |
Cache read |
| deepseek-v4-flash |
$0.14 |
$0.28 |
$0.028 |
| gpt-5.5 |
$5.00 |
$30.00 |
$0.50 |
- Lexical retrieval is not dead for agentic search. A tuned BM25 with sufficient depth is competitive with dense retrievers at a fraction of the cost.
- The bottleneck has moved up the stack. The agent's ability to use surfaced evidence — not the retriever's ability to find it — is now the primary constraint.
- Default configurations are misleading. Papers comparing against BM25 should confirm they're using appropriate (not default) parameters for their document domain.
- Time budgets beat iteration caps. Steering with a hard timeout prevents runaway tool calls and reduces variance.
- Dataset: BrowseComp-Plus — 830 queries, 100,195 long documents (avg. 5,179 words/doc; 6.1 evidence docs, 2.9 gold docs per query)
- Subjects: Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Computation and Language (cs.CL)
- Code: github.com/justram/pi-serini
- Pages: 15 pages, 4 figures