Hybrid search: combining BM25 with embeddings

Pure vector search misses exact-match queries. Adding BM25 alongside it can lift recall@10 by ten to thirty percent on corpora heavy on identifiers, and changes the kinds of questions the product can answer at all.

Vector search is the default for retrieval in any new RAG system today, and reasonably so. It handles paraphrase, synonyms, multilingual queries, and most of the long tail of natural-language phrasing better than any lexical method. The trap is that the same property that makes embeddings flexible (their tolerance of surface form) also makes them weak on queries where the surface form is the signal. Issue 005 covered reranking, the precision pass that sits between retrieval and the LLM. This issue is one layer upstream: getting better candidates into the retrieval pool in the first place, by running a lexical retriever alongside the dense one and fusing the results.

Why pure vector search misses queries

Three families of query consistently underperform on dense embedding retrieval. Each has the same root cause and the same fix.

The first family is queries that contain rare or out-of-vocabulary tokens. A user types in a product code like SKU-49301-X or an error string like ECONNREFUSED. The embedding model has rarely or never seen these strings during training, so its tokeniser breaks them into subword pieces (EC, ONN, REF, USED for ECONNREFUSED, roughly) whose individual embeddings carry weak, generic signal. The pooled vector loses the specific identifier, and two documents that share ECONNREFUSED end up nowhere near each other in vector space, even though to a human the shared token is the only thing that matters.

The second family is code and identifier queries. A developer searches for useEffect cleanup function or parse_iso_8601. The vector embedding produces a generic semantic representation of "React hook cleanup" or "ISO 8601 parser", which is correct in spirit but does not necessarily prioritise documents that mention the literal identifier. BM25, by contrast, retrieves the documents with exact identifier matches in milliseconds.

The third family is multi-token exact phrases. A user searches for "minimum viable product" or "principle of least surprise". Dense models tolerate word order and lose much of the bigram signal, so a paragraph that uses the constituent words separately scores almost as high as a paragraph that uses the exact phrase. For citation-style retrieval, where the user is looking for a specific stable formulation, this is a real loss.

The combined effect is a class of queries on which dense retrieval is meaningfully worse than the lexical approach it was supposed to obsolete. Recall on the full query distribution drops, and the LLM that consumes the retrieval results gets the wrong documents on those queries while looking entirely confident about the wrong answer.

Why BM25 and embeddings complement each other

BM25 and dense embeddings make opposite tradeoffs.

BM25 is a sparse, lexical retriever. It scores documents by the overlap of query terms with document terms, weighted by inverse document frequency so that rare terms count more. It handles exact matches, identifiers, and rare proper nouns natively, but has no concept of meaning. A document that uses the words "delete account" instead of "cancel subscription" is invisible to it, because no surface tokens overlap.

Dense embedding retrieval is the opposite. It compares a query and a document in a learned semantic space where "delete account" and "cancel subscription" sit close. It loses precision on the surface form because the embedding normalises away the literal tokens, which is exactly the property that made it bad on identifier queries above.

The two methods cover different failure modes of each other. A hybrid retriever that consults both and fuses the results catches the queries each would miss alone, at a per-query cost of a few milliseconds for the fusion compute (running in parallel with the vector call) and a per-document storage cost that adds a small sparse-index footprint. On standard retrieval benchmarks (BEIR is the canonical reference set), hybrid configurations typically lift recall@10 over dense-only by zero to twenty percent depending on the corpus and the dense baseline. The lift is largest on identifier-heavy corpora, where dense retrieval misses on surface form, and shrinks toward zero on paraphrase-heavy natural-language corpora where a strong dense model already dominates.

The two main fusion patterns

Hybrid retrieval is a fan-out followed by a merge. Both retrievers run in parallel against the same query, each returns its top-K, and a fusion step combines them into a single ranked list. The two patterns that dominate production are reciprocal rank fusion and linear score combination.

Reciprocal rank fusion (RRF). The merge ignores the underlying scores and uses only the rank position of each document in each list. The score of a document is the sum, across retrievers, of 1 / (k + rank), where rank is the document's position in that retriever's list and k is a smoothing constant typically set to 60. Documents that rank high in either list are boosted; documents that rank in both are boosted further. The pattern requires no score normalisation, no per-query tuning, and works robustly across retrievers whose scores are not comparable. It is the default starting point for hybrid retrieval today.

Linear score combination. The merge computes a weighted sum of the per-retriever scores, typically alpha * bm25_norm + (1 - alpha) * vector_norm, where each score is first normalised (min-max or z-score) into a common range and alpha is a tunable weight. The pattern is more flexible than RRF and can express preferences such as "weight BM25 more on the first quarter of the candidate list", but it requires score normalisation and per-corpus tuning. Most production teams begin with RRF and graduate to linear combination only when they need fine-grained control.

The diagram above shows the fan-out-and-merge architecture. The two retrievers run in parallel, each producing a top-K list. The fusion step combines them into a single ranked list whose top items are passed to the LLM, optionally through the reranker described in Issue 005. The pattern composes cleanly: hybrid sits upstream of reranking, and both sit upstream of generation.

A worked example: RRF in twelve lines

The implementation below accepts any number of ranked lists and returns the fused ranking. The BM25 and vector retrievers are substitutable; the merge is retriever-agnostic.

# rrf.py - reciprocal rank fusion across an arbitrary number of ranked lists.
def rrf(rank_lists: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """Each rank_list is a list of doc_ids in retriever-specific rank order."""
    scores: dict[str, float] = {}
    for rank_list in rank_lists:
        for rank, doc_id in enumerate(rank_list):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda pair: -pair[1])

# Usage:
bm25_ranked = bm25_retriever.search(query, top_k=50)        # list of doc_ids
vector_ranked = vector_retriever.search(query, top_k=50)
fused = rrf([bm25_ranked, vector_ranked])
top_10 = [doc_id for doc_id, _ in fused[:10]]

The smoothing constant k = 60 is the value introduced in the original Cormack, Clarke, and Buettcher RRF paper and remains the de facto default. A larger k reduces the dominance of the top-ranked documents in each list and is worth tuning only when the candidate lists are noisy or the corpus is small. The pattern adds at most a few milliseconds of compute; the dominant cost is the two retrievers running in parallel, which most retrieval stacks can do without architectural changes.

How production systems implement hybrid

Three operational patterns recur in production hybrid retrieval.

The first pattern is symmetric hybrid by default, where every query consults both retrievers and the fusion step decides the final ranking. This is the simplest setup and the right starting point for a heterogeneous query mix. The cost is the additional retriever call per query. For a tuned BM25 implementation, the call itself is typically under twenty milliseconds, and the RRF fusion adds a few milliseconds on top.

The second pattern is query routing, where a lightweight classifier or rule decides which retriever to consult based on the query shape. A query that contains an obvious identifier (a SKU- prefix, an ERR_ token, a UUID-shaped string) routes to BM25 alone; a free-form natural-language question routes to dense alone; ambiguous queries go to both. Routing reduces the per-query cost on the unambiguous cases at the price of an additional decision component that has to be evaluated and tuned. The pattern pays off above a few hundred queries per second, where the cost difference compounds.

The third pattern is native vendor support. Weaviate's hybrid API, Qdrant's sparse vectors, Pinecone's sparse-dense vectors, and Elasticsearch's rrf retriever all implement variants of hybrid retrieval inside the database. The benefit is that the two retrievers operate against the same index, share the same filter pushdown, and return a single fused result. The cost is a small loss of flexibility on the fusion step and a stronger coupling to the vendor. For most teams, native support is the right choice once the team has validated that hybrid improves their metric.

Common mistakes

Four mistakes recur in early hybrid retrieval implementations.

The first is not measuring the lift on the team's own workload. Hybrid is consistently better on average across the BEIR benchmark set, but the actual gain on any specific corpus and query mix varies from roughly zero to thirty percent. Run the golden-set evaluation from Issue 003 with and without the BM25 leg before committing to hybrid as a permanent component, because the maintenance overhead is non-trivial if the lift is small.

The second is linearly combining scores without normalising them. Raw BM25 scores and raw cosine similarity scores live on different scales and have different distributions. Adding them with a constant alpha produces a ranking that is dominated by whichever score happens to have larger magnitude. Either normalise each score per-query (min-max into the zero-to-one range is the simplest approach) before combining, or use RRF, which avoids the problem by ignoring scores entirely.

The third is letting the BM25 index drift out of sync with the vector index. When a document is added, updated, or deleted, both indexes must reflect the change before the next query. A single-write path that updates both inside the same transaction, or a write-once-publish-twice queue, avoids the bug where a document exists in one index and not the other.

The fourth is skipping hybrid for queries that "obviously do not need it". A senior reader will recognise the trap: until the query mix is measured, the obviousness is a guess. Run hybrid on every query during the validation phase, log the per-query lift, and only then optimise away the queries where hybrid provides no benefit.

Summary

Pure vector search is good for paraphrase and bad for exact form. Pure lexical search is the opposite. Hybrid retrieval runs both in parallel and fuses the results, recovering the queries each loses alone. The default fusion is reciprocal rank fusion, which requires no score normalisation and tunes well across heterogeneous retrievers. Production patterns range from symmetric hybrid for any query, through routed hybrid for cost-sensitive workloads, to native vendor support once a team has measured the lift on its own corpus. Combined with the reranking pass from Issue 005, hybrid retrieval gives the LLM the best chance of seeing the right documents at the top of its context.

Production checklist

Add a BM25 retriever in parallel to the existing dense retriever. Self-host using rank_bm25 for a quick start, or use the BM25 implementation built into Elasticsearch, OpenSearch, Qdrant, or Weaviate.
Implement reciprocal rank fusion with k = 60 as the default merge. Reach for linear score combination only after RRF has been validated and a specific tuning need has been identified.
Measure recall@10 with and without the BM25 leg against the golden-set evaluation from Issue 003. Commit to hybrid only if the lift is meaningful on the queries the product actually receives.
Ensure both indexes are written in the same transaction or via a single publish path, so the two retrievers cannot drift out of sync.
Cap each retriever's top-K at fifty for the fan-out, and pass the fused top-ten to the reranker described in Issue 005.
Track per-query latency for each retriever as a separate metric, so a BM25 slowdown does not hide inside the aggregate retrieval latency number.
Consider native vendor hybrid (Weaviate, Qdrant, Pinecone, Elasticsearch) once the lift is validated, because operating one index is cheaper than operating two when the abstraction is right.
Re-evaluate the fusion choice annually. Improvements in BM25 implementations and learned-to-rank fusion methods continue to shift the cost-benefit balance.

Hybrid search: combining BM25 with embeddings

Why pure vector search misses queries

Why BM25 and embeddings complement each other

The two main fusion patterns

A worked example: RRF in twelve lines

How production systems implement hybrid

Common mistakes

Summary

Production checklist

Further reading

One more step — check your inbox.

Production AI engineering. One Tuesday at a time.

Check your inbox.