Reranking 101: the 50ms layer that decides whether your RAG works.

This issue covers what reranking is, why retrieval alone is not enough, the two families of reranker that dominate production usage, the cost and latency tradeoff between them, and the decision matrix for picking one. Issue 002 was about selecting the vector database that returns good candidates; this issue is about what to do between those candidates and the LLM call.

Why retrieval alone falls short

A standard RAG retriever is a bi-encoder: a model that embeds the query and every document independently, then ranks documents by cosine similarity in the embedding space. The bi-encoder is fast at scale because the document embeddings are precomputed; the query embedding is the only work done at request time. Hundreds of millions of documents can be searched in tens of milliseconds.

The cost of this speed is precision. The bi-encoder never sees the query and a document together. It optimises for "this document is generally similar to this query", which is close to relevance but not the same thing. A document that uses the same vocabulary as the query, covers a similar topic, or semantically gestures at the right answer will score highly even when it does not actually contain the information the user needs. Bi-encoders also struggle with negation, multi-aspect queries, and queries that mix specific entities with general questions.

The result is a top-K retrieval that contains the right answer somewhere in the list, but not at position one. Recall is often high; precision at the top is not. The LLM that generates from that context then has to either find the relevant document in 50 candidates (expensive, and easily distracted by the other 49) or work from the top three to five candidates (often the wrong ones). Either way, output quality suffers in a way that is hard to debug because the bug is not in the LLM call.

The two-stage retrieval pattern

Reranking solves this by adding a precision pass after the retrieval pass.

The pattern is straightforward. Stage one retrieves a generously large candidate set, typically 25 to 100 documents, using the bi-encoder. Stage two reranks that candidate set with a smaller, slower model that does see the query and each document together. The top three to five results of stage two are what gets passed to the LLM. The retrieval pass optimises for recall over speed; the rerank pass optimises for precision over scale.

The diagram above shows the two-stage architecture. The bi-encoder retrieval handles the volume, and the reranker handles the precision. Each stage uses a model that is well suited to its scale: a fast, precomputed bi-encoder for the retrieval over the full corpus, and a slower but more accurate model for the rerank over a bounded candidate set. The combined pipeline gives you both high recall over a large corpus and high precision on the top results.

Two families of reranker dominate production usage today, and they make different tradeoffs.

Family one: cross-encoder rerankers

A cross-encoder takes the query and a candidate document as a single concatenated input, runs them through a transformer, and produces a single relevance score. Because the model sees both inputs together, it can attend to whether each part of the query is actually addressed by the document. The result is consistently better ranking than a bi-encoder for the same compute budget per pair.

The cost of seeing both inputs is that ranking N candidates requires N separate forward passes. Practical latency for a 50-candidate rerank is in the range of 50 to 300 milliseconds, depending on the model size, the average document length, and the hardware. Latency stays predictable because the work scales linearly with candidate count.

Cross-encoders come in two flavours: hosted APIs and self-hosted open-source models. Both are well represented in production.

Hosted APIs include Cohere Rerank, Voyage's rerank-2, and Mixedbread's reranker. Cohere's published pricing as of late 2025 is approximately two dollars per one thousand "search units", where a search unit covers up to one hundred documents (longer documents are chunked and counted as multiple). A typical 50-document rerank is one search unit, so the effective cost is around two dollars per one thousand reranks. The advantage of a hosted API is that there is no model to operate; the disadvantage is that the per-query cost stays flat regardless of scale.

Self-hosted options include BAAI/bge-reranker-v2-m3 and the older cross-encoder/ms-marco-MiniLM-L-6-v2. The BGE family is currently the strongest open-source rerank model and the default many teams pick when they need to keep the precision pass on their own infrastructure. A bge-reranker-v2-m3 instance falls within the 50 to 300 millisecond range above, with the exact latency depending on the GPU class and the average document length. The cost reduces to the GPU instance amortised across throughput, which at moderate scale is significantly cheaper per query than any hosted API. Below a few requests per second, however, the GPU sits idle and the hosted API wins on cost.

Family two: LLM rerankers

The other approach is to use a general-purpose LLM as the reranker. The interface is a prompt: give the model the query and a list of candidates, ask it to return the top K in order, and treat its output as the ranking. The flexibility is that you can specify any rubric (recency, source preference, multi-aspect relevance, "prefer documents that contain a numeric answer") and the model will apply it. A cross-encoder, by contrast, returns a single relevance score that bakes the rubric in at training time.

The cost of that flexibility is latency and money. Even with a small fast model, sending 50 documents at a few hundred tokens each to an LLM and waiting for it to consider them takes hundreds of milliseconds to a few seconds. At typical late-2025 pricing, a 50-document rerank ranges from below a cent per query with the cheapest small models (GPT-4o-mini, Gemini Flash), through around a cent with Claude Haiku, through a few cents with mid-tier models (Claude Sonnet, GPT-4o), to ten cents or more with larger reasoning models. The per-query cost can dominate the rest of the RAG pipeline once traffic scales.

LLM rerank also takes more maintenance than a cross-encoder. The prompt is a moving artefact: it has to be tuned, evaluated against the same golden set you built for Issue 003, and re-tuned when a model upgrade shifts the ranking behaviour. A cross-encoder, by comparison, is pick-once-and-stop infrastructure.

The decision matrix

The following table summarises the tradeoff. The numbers are approximate ranges drawn from vendor documentation and community benchmarks as of late 2025; they are starting points, not measured optima for your workload.

Dimension	Cross-encoder	LLM rerank
Latency for 50 candidates	50 to 300 ms	500 ms to 2 s
Cost per 1000 reranks of 50 docs	$1 to $2 hosted; lower per-query if self-hosted at moderate scale	$1 to $10 with small models; $10 to $100 with mid-tier models
Flexibility	Single relevance score, fixed by the model	Arbitrary rubric, custom instructions per query
Maintenance	Pick a model once	Prompt iteration, eval re-runs at each model upgrade
Quality on standard tasks	Strong	Excellent at the high end; variable at the low end

A second diagram captures the practical choice.

The decision tree starts with latency because most production RAG has a hard latency budget that the LLM generation call itself consumes most of. Cross-encoder reranking adds 50 to 300 milliseconds; LLM reranking can add a second or more. If the budget is tight, cross-encoder is the only realistic answer. If the budget is generous and the relevance rubric is genuinely beyond what a single score can capture, LLM rerank earns its cost. The hybrid pattern, in which a cross-encoder narrows 50 candidates to 20 and an LLM reranks those 20 against a custom rubric, captures most of the LLM-rerank quality at a fraction of the cost and is increasingly common at scale.

A worked example: cross-encoder rerank in twelve lines

The following example uses sentence-transformers and BAAI/bge-reranker-v2-m3. The retriever and document type are placeholders; substitute your own.

# rerank.py - two-stage retrieval with a cross-encoder reranker.
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)

def search(query: str, retriever, top_k: int = 5) -> list:
    candidates = retriever.search(query, top_k=50)        # stage 1: recall
    pairs = [(query, doc.text) for doc in candidates]
    scores = reranker.predict(pairs)                       # stage 2: precision
    ranked = sorted(zip(candidates, scores), key=lambda p: -p[1])
    return [doc for doc, _ in ranked[:top_k]]

The function returns the top five documents by reranker score, having considered fifty. The bi-encoder retriever and the cross-encoder reranker are completely decoupled, so either side can be swapped without changing the other. A production version adds latency tracking per stage, score logging for offline evaluation, and a feature flag so the reranker can be bypassed during a vendor outage or a debugging session.

For the LLM-rerank equivalent, the same overall shape applies: the function calls a model with the candidates and asks for an ordered list of identifiers. The prompt becomes the moving artefact and has to be evaluated against the same golden set after every change.

Common mistakes

Four mistakes recur in early reranking implementations.

The first is reranking the entire corpus rather than a candidate set. A reranker is expensive per pair, so using it on more than a few hundred candidates per query removes the speed advantage of the bi-encoder upstream and pushes latency past most production budgets. The correct pattern is to retrieve generously (25 to 100 candidates) and then rerank a bounded set.

The second is adopting a reranker without measuring whether it helps. Reranking has a non-trivial latency cost and, for some retrieval setups, the precision gain is marginal. Run the golden-set evaluation introduced in Issue 003 with and without the reranker before committing to the additional component. If the lift is below five percent on the metric you actually care about, the maintenance overhead may not be worth carrying.

The third is mixing the rerank score with the retrieval score in an ad-hoc way. The two scores live on different scales and are produced by different models, so blending them with a hand-tuned weight tends to introduce subtle ranking bugs. The safer pattern is to treat the rerank score as the final ranking signal and to ignore the retrieval score after stage one.

The fourth is logging rerank inputs without logging the rerank outputs and scores. The most useful artefact for debugging a regression is the per-query record of which candidates were retrieved, what scores the reranker gave them, and which ones were ultimately passed to the LLM. Without that record, every reranking issue becomes a black box, and the debugging path defaults to re-running the query by hand, which is slow under ordinary conditions and unbearable under incident pressure.

Summary

Reranking is the precision pass that sits between bi-encoder retrieval and LLM generation. It exists because retrieval optimises for recall while LLMs need precision, and the two are not the same problem. Two families of reranker dominate: cross-encoders that produce a single relevance score in 50 to 300 milliseconds at a sub-cent per-query cost, and LLM rerankers that can apply arbitrary rubrics for one cent to ten cents per query and several hundred milliseconds of additional latency. Most production RAG should start with a cross-encoder, measure the lift against a golden set, and graduate to LLM rerank or to a hybrid only when the relevance rubric outgrows what a single score can express.

Production checklist

Add a reranker between bi-encoder retrieval and the LLM call. Start with BAAI/bge-reranker-v2-m3 self-hosted or Cohere Rerank as a hosted alternative.
Retrieve 50 candidates at stage one, rerank to the top five at stage two. Tune both numbers from observed quality and latency, not from defaults.
Measure the rerank lift against the golden-set evaluation from Issue 003 before committing to the reranker as a permanent component.
Track per-query latency and cost for the rerank stage as a separate metric, sliced by tenant and by prompt template.
Log the candidate set, the rerank scores, and the final selection per request, so any regression is reproducible against a known input.
If a custom relevance rubric is needed (recency weighting, source preference, multi-aspect), evaluate LLM rerank or the hybrid pattern (cross-encoder narrows to 20, LLM reranks those) against the same golden set rather than committing on intuition.
Add a feature flag that bypasses the reranker, so the stage can be turned off during a vendor outage or a regression investigation without redeploying the application.
Re-evaluate the choice of reranker quarterly. The space moves quickly, and the best open-source model two quarters ago is rarely the best open-source model today.