LLMOps Report
Server rack infrastructure supporting vector search and retrieval-augmented generation workloads
infrastructure

Best Vector Database for RAG: A Practical Comparison (2026)

Pinecone, Weaviate, Qdrant, pgvector, Chroma, Milvus — benchmarked on recall@k, p99 latency, filtered search, and cost at real production scale.

By Llmops Editorial · · 7 min read

The best vector database for RAG comparison hinges on a single question that most benchmarks bury in the footnotes: what does your filtered recall look like at production cardinality? Unfiltered ANN benchmarks tell you how fast a database can find approximate nearest neighbors on clean synthetic data. They don’t tell you what happens when a user query needs to restrict by tenant ID, access control list, document type, and a timestamp range simultaneously — which is what every real multi-tenant RAG system actually does.

Here is what the numbers look like when you test honestly, and how to pick the right database for your specific situation.

The metric that matters: recall@k under filter

Raw QPS numbers from vendor benchmarks are almost always measured on a single namespace with no metadata filtering. The number worth tracking is recall@10 under your specific filter selectivity — the fraction of the true 10 nearest neighbors that the index returns when you apply a where clause that matches, say, 5% of your corpus.

HNSW — the index structure underlying Pinecone, Weaviate, and Qdrant — degrades under high-selectivity filters because the graph traversal can’t skip large swaths of filtered-out nodes without re-exploring dead ends. Databases handle this differently:

  • Qdrant uses a payload index that switches to brute-force scan when filter selectivity drops below a heuristic threshold, trading latency for recall.
  • Weaviate applies pre-filtering, building an allow-list before the HNSW traversal.
  • Milvus (DISKANN + IVF variants) handles filtering at the segment level, which scales better at billion-vector cardinality.
  • pgvector delegates to Postgres’s planner; under tight filters it may full-scan rather than use the HNSW index, which is either a bug or a feature depending on corpus size.

For systems monitoring embedding quality and retrieval health in production, the SentryML observability stack surfaces per-request recall estimates by comparing retrieved contexts against a golden set — useful when your vector DB’s built-in metrics don’t expose filter-hit rates.

The contenders

Pinecone is the operator’s choice when you cannot staff a DBA for your vector infrastructure. The serverless tier introduced in 2024 eliminated the pod-sizing tax that made early Pinecone uncompetitive on price. Query latency sits at roughly 7ms p99 at 1M vectors on standard hardware. At 100M vectors the managed cost climbs to $4,000–7,000+/month — the point at which self-hosted alternatives become hard to ignore.

Weaviate has the best hybrid search story of the major options. Its built-in BM25 + vector combination (the hybrid query type) handles the common RAG pattern where you want lexical matching on exact terms alongside semantic similarity. This matters more than benchmarks suggest: a chunk containing an exact identifier like a specific error code or part number should surface on that exact query even if the embedding distance is not minimal. Memory overhead is the real operational concern — Weaviate’s JVM and internal object store consumed 48GB RAM for 2M vectors in independent testing, versus 18GB for an equivalent Qdrant cluster.

Qdrant is a strong managed option at the mid-scale range (1M–50M vectors). At 50M vectors and 99% recall, it delivers 41 QPS at approximately 38ms p50 / 121ms p95 / 340ms p99 latency in the pgvectorscale-vs-Qdrant benchmark. Those tail numbers are the catch: pgvectorscale clears the same recall target at much lower p99, so Qdrant’s appeal here is the Rust implementation’s low memory overhead and the operational simplicity of a managed offering, not throughput or tail latency. The free tier (1GB forever) is the most generous among managed options.

pgvector + pgvectorscale is the outlier. At 50M vectors and 99% recall, pgvectorscale delivers 471 QPS — 11.4x Qdrant’s throughput at the same recall threshold. If your application is already PostgreSQL-native and your corpus fits under 100M vectors, staying in-database avoids the operational surface of a separate vector store, simplifies transactions, and keeps filtered queries consistent with your relational data. The trade-off: HNSW index build times are longer, and you’re relying on the Postgres query planner to make good decisions under selective filters.

Chroma is for prototyping. The 2025 Rust rewrite brought 4x faster writes and queries over the original Python implementation. Production multi-tenancy and billion-scale are not design targets. Chroma is appropriate for RAG MVP development; plan to migrate before you hit 10M vectors.

Milvus (or its managed form, Zilliz Cloud) targets the billion-vector tier. p95 latency under 30ms at million-vector scale, horizontal scaling via Kubernetes, and a pluggable index layer (DISKANN for on-disk, IVF_FLAT for filtered recall, GPU-accelerated CAGRA for throughput) make it the right call for enterprise deployments that cannot afford the Pinecone bill. The ops burden is real: Milvus’s distributed architecture involves separate components for storage, coordination, and querying. Budget engineering time.

Wiring it up

A Qdrant filtered search in Python — the kind of query pattern you’ll write for most multi-tenant RAG systems:

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

client = QdrantClient(url="https://your-cluster.qdrant.io", api_key="YOUR_KEY")

results = client.search(
    collection_name="documents",
    query_vector=embedding,           # float list from your embedding model
    query_filter=Filter(
        must=[
            FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id)),
            FieldCondition(key="acl", match=MatchValue(value=user_group)),
            FieldCondition(
                key="published_at",
                range=Range(gte=cutoff_ts),
            ),
        ]
    ),
    limit=10,
    with_payload=True,
    score_threshold=0.72,             # tune per embedding model
)

The score_threshold parameter is not a distraction. Without it, retrieving the top-10 by cosine similarity from a corpus where only 200 chunks are actually in-scope can return garbage that degrades generation quality more than no retrieval at all. Set it empirically against your embedding model’s score distribution on your domain data.

What good and bad look like

A healthy RAG retrieval signal: recall@10 stable above 0.85 across filter combinations, p99 latency flat as corpus grows (HNSW’s O(log N) property), and retriever score distributions centered around 0.75–0.85 rather than collapsing below 0.5 (embedding model mismatch) or clustering near 1.0 (contamination from exact duplicates in the index).

Red flags: sudden p99 spikes after a re-index job (HNSW build displacing memory), recall@10 dropping after metadata schema changes (filter predicate changes can invalidate payload indexes), and score distributions shifting down after an embedding model upgrade (scores are not comparable across model versions — reindex and recalibrate thresholds).

RAG systems also introduce retrieval-level attack surface. If your system is customer-facing, review LLM prompt injection risks in RAG pipelines before deploying — a document in your index that contains adversarial instruction text is a different threat model than a standard search index.

Caveats

Vendor benchmarks versus independent benchmarks. Numbers from ann-benchmarks.com are hardware-to-hardware QPS comparisons on clean unfiltered data. Treat them as ceiling estimates. The Tensoria and pgvectorscale comparisons cited here used consistent hardware and filter conditions, which is more representative — but still not your corpus.

Dimensionality matters. All numbers above assume 768- or 1536-dimensional embeddings (OpenAI, Cohere, E5). At 3072 dimensions (text-embedding-3-large), HNSW memory overhead roughly doubles; QPS drops 20–30%. If you’re using a high-dimensional model, benchmark at your actual dimensions.

Cardinality blowup from metadata indexing. Weaviate and Qdrant both build inverted indexes on filterable payload fields. High-cardinality string fields (raw document IDs, UUIDs) turned into filter predicates will bloat index size substantially. Keep filterable fields low-cardinality or use hash-bucketing before indexing.

Decision framework

SituationRecommended
Prototyping / MVPChroma
PostgreSQL stack, <100M vectorspgvector + pgvectorscale
Managed, <50M vectorsQdrant
Hybrid search requiredWeaviate
Managed, no ops, cost secondaryPinecone
Billion-scale self-hostedMilvus

The shortest version: if you’re already on Postgres, stay there until pgvector’s filter behavior breaks you. If you need managed and have budget, Pinecone. If you need managed and want to control costs, Qdrant. If keyword + semantic hybrid matters, Weaviate. At billion scale, Milvus.

Sources

Sources

  1. Vector Database Comparison: Pinecone vs Qdrant vs Weaviate vs pgvector
  2. Best Vector Databases in 2026: A Complete Comparison Guide
  3. Best Vector Databases for RAG: Complete 2025 Comparison Guide
Subscribe

LLMOps Report — in your inbox

Operating LLMs in production — eval, observability, cost, latency. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments