The vector DB benchmark lies about your workload. Here is the matrix that doesn't.

You read the benchmark. Pinecone says they are 100x faster. Weaviate says they have the best ranking. Qdrant says they are cheaper. pgvector says you do not need a separate system. You pick one, you build the POC, you launch. Six months later your bill is twice what you modelled, your p99 has crept up to 800ms under real filters, and your cold-start latency is breaking onboarding. None of this surfaced in the benchmark. It never does.

The single most useful framing for picking a vector DB is the opposite of the benchmark frame. Measure the axes that get expensive at production scale. Those are not the same axes the benchmark measures.

Why benchmarks lie

Three structural reasons

They measure unfiltered queries. Almost every public benchmark tests recall and throughput on filterless ANN search. Production is the opposite: 80 to 95 percent of real queries carry at least one filter, usually tenant or ACL, often a date range or content type. Filter pushdown is the single largest source of latency variance between vendors, and the benchmark hides it because the benchmark never turned it on.

They pick the dataset that suits them. SIFT1M, GIST1M, GloVe, MS MARCO. Each benchmark vendor publishes their wins on whichever dataset their index structure happens to be best at. Your corpus is unlikely to match any of them, and the differences in dimensionality, distribution, and cardinality are large enough to flip rankings.

They report steady-state, not edge. Benchmarks measure throughput once the index is warm and the system is in equilibrium. Production hits the edge cases: cold start after autoscale, index rebuild during deploy, throttling under burst load, multi-tenant noisy-neighbour. Benchmarks omit all of them, and they are where production bills get made.

The result is that benchmark wins do not predict production fit. They predict "this vendor's marketing team picked a workload they could win on". Pick the vendor whose worst axis you can live with, not the one whose best axis won the benchmark.

The axes that actually matter

Eight axes for a production decision, in roughly the order they bite.

1. Cost model. Per-query (Pinecone serverless), per-instance-hour (Pinecone pods, Qdrant, Weaviate cloud), or "free" because it runs on your existing Postgres (pgvector). Each has a different break-even curve. The free tier never predicts your real bill. Calculate at projected 12-month scale, not at POC scale.

2. Filter pushdown. Does the filter narrow the ANN candidate set during search, or run as a post-filter on retrieved results? Post-filter ANN is the single most expensive surprise people hit in production - if the filter matches 5 percent of the corpus and the index returns top-20, you get one to two matching results when you wanted twenty. The fix is filter pushdown into the index, which not every vendor handles equally. (See the Production RAG Eval Cheatsheet, entry 13.)

3. p99 latency under realistic filters. Most benchmarks publish p50 unfiltered. The number that bites you is p99 with a 5 percent selectivity filter on a 10M vector corpus, at peak QPS. Vendors differ by 5 to 10 times on this number, as ANN-Benchmarks results for filtered queries show.

4. Cold-start behaviour. Serverless and autoscaling architectures pay a cold-start tax. Pinecone serverless cold start is widely reported in the hundreds of milliseconds to seconds for sparsely-queried namespaces. pgvector with HNSW index loading can stall the first request after a Postgres restart. Qdrant's memory-mapped index has a different shape again. Pick the cold-start behaviour you can live with, and tag cold requests in your traces (see the RAG profiling issue from last week).

5. Multi-tenancy primitives. Three patterns. Namespaces (Pinecone). Filter-based on every query (most others). Per-tenant index or collection (high isolation, high cost). The choice shapes both your ops surface and your cost scaling, and getting it wrong forces a migration when you cross from 50 tenants to 5000.

6. Hybrid search and reranking. Built-in BM25 + vector fusion (Weaviate native, Qdrant via sparse vectors, Pinecone via sparse-dense API, pgvector via tsvector + pgvector). For workloads with rare keywords, IDs, or error codes - which is to say almost any real corpus with proper nouns - hybrid wins, and built-in saves meaningful integration time.

7. Ops burden. Fully managed (Pinecone), managed open-source cloud (Weaviate Cloud, Qdrant Cloud), self-hosted (Weaviate, Qdrant, pgvector). Each has a different on-call cost. For a two-engineer team, ops burden is often the dominant decision factor and the one benchmarks never show.

8. Lock-in and portability. Open-source format, standard APIs, export paths. The cost of being wrong about a vector DB is migration, and migration cost compounds with re-embedding if you change models at the same time.

The decision matrix

Numbers below are starting reference points from public benchmarks, vendor documentation, and community reports as of late 2025. They exist to tell you which axes to measure, not to substitute for measurement on your own workload.

Axis	Pinecone (serverless)	Weaviate (cloud)	Qdrant (cloud)	pgvector (Postgres)
Cost model	Per read/write unit, storage	Per-instance-hour	Per-instance-hour or free tier	Your existing Postgres cost
Filter pushdown	Native, strong	Native, strong	Native, strong (payload index)	HNSW + filter has caveats at low selectivity
p99 under 5% filter, 10M vec	~100 to 300ms	~80 to 200ms	~50 to 150ms	~150 to 500ms
Cold start	Hundreds of ms to seconds	Cluster-managed, low	Instance startup, low	Postgres restart / index reload
Multi-tenancy	Namespaces native	Filter-based, multi-tenant collections (v2)	Filter-based, multi-tenant collections	Filter-based, row-level security
Hybrid search	Sparse-dense API	Built-in BM25 + vector	Sparse vectors + payload	`tsvector` + `pgvector` (DIY)
Ops burden	None (fully managed)	Low (managed cluster)	Low to medium	Already-running Postgres
Lock-in	Closed format, API-shaped portability	Open core, portable	Open-source, fully portable	Open-source, fully portable

The p99 row is the row most worth re-measuring yourself. Run the benchmark below against each vendor with your real query distribution, your real filter selectivity, your real top_k, and your real vector dimension. Three runs each: once warm, once after restart, once at peak hour. The numbers you get will not match the public benchmarks, and the ranking may not match either.

# benchmark_vector_store.py - measure p50 and p99 on your real workload.
# Use the same wrapper against each vendor; fill in client-specific query().

import time, statistics, random

def benchmark(client, queries, *, k=10, filter_fn=None, repeats=3):
    """queries: list of (vec, sample_filter_args). filter_fn(args) -> vendor-specific filter."""
    latencies_ms = []
    for vec, filter_args in queries:
        for _ in range(repeats):
            t0 = time.perf_counter()
            if filter_fn and filter_args is not None:
                client.query(vector=vec, top_k=k, filter=filter_fn(filter_args))
            else:
                client.query(vector=vec, top_k=k)
            latencies_ms.append((time.perf_counter() - t0) * 1000)
    ls = sorted(latencies_ms)
    return {
        "n": len(ls),
        "p50_ms": ls[len(ls) // 2],
        "p95_ms": ls[int(len(ls) * 0.95)],
        "p99_ms": ls[int(len(ls) * 0.99)],
        "mean_ms": statistics.mean(ls),
    }

# Run three times for each vendor:
#   1. Warm: index just loaded, several warmup queries first
#   2. Cold: restart / scale-down / scale-up immediately before run
#   3. Peak: schedule against the time of day your real traffic peaks
# Compare warm-vs-cold (cold-start cost) and warm-vs-peak (queueing cost).

The thing the benchmark does that vendor benchmarks do not: it uses your queries, your filters, and your scale shape. The numbers you get fill the rows of the matrix accurately for your workload, which is the only workload that matters.

Heuristics from real production

Four shortcuts that beat reading more benchmarks.

Under 1M vectors with Postgres already running, use pgvector. The instinct to reach for a "real vector DB" at this scale is almost always wrong. You add an operational surface, a new query path, a new cost line, a new failure mode, and a new vendor relationship to get performance you do not need. pgvector with HNSW handles this scale on a single instance and gives you transactional joins to your other tables for free.

If 90 percent of your queries are filtered, filter performance trumps raw ANN throughput. Most benchmarks invert this. Run a filtered version of any benchmark before you commit. If the vendor cannot run a filtered benchmark at your scale, that itself is the answer.

If your team is under five engineers, ops burden is your biggest hidden cost. Self-hosted Qdrant or Weaviate is often faster per dollar than Pinecone in steady state, but if you cannot afford the on-call to keep it healthy through upgrades, scaling events, and the inevitable 2am page, that math does not hold. Managed costs more per query and less per engineer hour.

Plan for the migration you swear you will not do. Vector DB decisions often get revisited within the first two years as scale, query shape, or cost economics shift. Pick a vendor whose data is portable to your second choice, and bake the migration plan into the original decision. The plan does not have to be detailed; it has to exist.

Migration realities

The cost of being wrong about a vector DB:

Re-embedding. If you change embedding models during the migration, which the temptation often makes happen, you pay the embedding token bill twice. See the Cheatsheet entry 7 on embedding model swaps.
Dual-write window. During cutover you write to both stores. Storage doubles, write cost doubles, and you carry two ops surfaces.
Re-tuning. Scores are not comparable across vendors even with the same embedding model. Your threshold for abstention (Cheatsheet 14), your top_k (Cheatsheet 11), your rerank weights, all need re-validation.
Re-testing. Your eval set runs against the new store before you flip. Production traces from the old store need to be replayable so you can compare.

Six weeks is a reasonable estimate for a non-trivial migration with a real eval gate. Plan for it the day you sign the contract for the first vendor, not the day you decide to switch.

Production checklist

Run the benchmark above on your real workload, with your filters, your dimensions, your peak QPS, and your real top_k.
Measure p99 under filters, not p50 unfiltered.
Test cold start explicitly, after a scale-up or restart.
Calculate cost at 12-month projected scale, with your real read/write ratio.
Account for multi-tenant isolation cost in the cost model.
Verify filter pushdown by checking returned result counts against k.
Run a migration dry-run before you commit, so you know the exit cost.
Re-evaluate annually. The vector DB landscape moves faster than most infra decisions.

The takeaway

Benchmarks measure what is easy to measure. Production cost is shaped by what they leave out: filter performance, cold start, multi-tenant overhead, ops burden, lock-in. The matrix is not a verdict. It is a map of the axes you should measure on your own workload. Pick the vendor whose worst axis you can live with, not the one whose best axis won the benchmark.

The vector DB benchmark lies about your workload. Here is the matrix that doesn't.

Why benchmarks lie

Three structural reasons

The axes that actually matter

The decision matrix

Heuristics from real production

Migration realities

Production checklist

The takeaway

Further reading

One more step — check your inbox.

Production AI engineering. One Tuesday at a time.

Check your inbox.