Vector Databases in 2026: Choosing the Right One for RAG

A practical 2026 guide to picking a vector database for RAG and embeddings. Deep comparison of pgvector, Qdrant, Pinecone, Weaviate, Milvus, and Chroma across indexing algorithms (HNSW, IVFFlat, IVF-PQ, DiskANN), distance metrics, filtered and hybrid search, quantization, scale and latency tradeoffs, self-host vs managed, and real cost -- with a decision table by use case, production operations, and runnable client code.

Vector DatabasespgvectorQdrantPineconeWeaviateMilvusChromaHNSWIVFFlatDiskANNQuantizationHybrid SearchEmbeddingsANNRAGPostgreSQL

1. Vector Databases: Overview

A vector database stores high-dimensional embedding vectors and answers one question fast: "which stored vectors are closest to this query vector?" That nearest-neighbor search is the retrieval engine behind RAG, semantic search, recommendations, deduplication, and image or audio search. The database is not what makes your results good -- your embedding model sets the quality ceiling -- but it is what makes retrieval fast, filterable, durable, and affordable at scale. For RAG specifically, it is the component that turns a pile of embeddings into sub-100ms grounded context.

A vector database does three jobs: store vectors plus their metadata (payload) durably, index them with an approximate-nearest-neighbor (ANN) structure so search does not scan every row, and search with a distance metric while applying metadata filters. In 2026 the real choice is not "which database has vector search" -- almost all do -- but which tradeoff between recall, latency, filtering power, operational burden, and cost fits your RAG workload. This guide compares the six that matter: pgvector, Qdrant, Pinecone, Weaviate, Milvus, and Chroma.

VECTORS

Embeddings as Vectors

Each chunk of text (or image, or audio) is turned into a fixed-length array of floats -- typically 384 to 3072 dimensions -- by an embedding model. Semantically similar content maps to nearby points in that space. The vector database stores these arrays as a vector(N) column or a named vector field. The dimension is fixed per collection and must match your embedding model exactly; changing models means re-embedding the whole corpus. Store the source text and metadata alongside the vector so retrieval can return usable context, not just IDs.

ANN

Approximate Nearest Neighbor

Exact nearest-neighbor search is O(n) per query -- fine for 10K vectors, hopeless at 10M. ANN indexes (HNSW, IVF, DiskANN) trade a small amount of recall for orders-of-magnitude speedup by searching only a promising subset of the space. You tune a knob (HNSW ef_search, IVF nprobe) that moves along the recall-vs-latency curve. "Recall@10 = 0.98" means the ANN index returned 98% of the true top-10 neighbors. Every production vector database is, at its core, an ANN index with storage and filtering bolted on.

QUERY PATH

The Query Path

A RAG query does: (1) embed the user question with the same model used at ingestion, (2) send the query vector plus any metadata filter to the database, (3) the ANN index returns the top-k nearest vectors with scores and payloads, (4) optionally rerank or fuse with keyword results, (5) hand the retrieved text to the LLM. The database owns steps 2-3. A well-tuned index keeps that at 5-50ms even over millions of vectors, leaving your latency budget for embedding and generation.

RECALL/LATENCY

Recall vs Latency Tradeoff

There is no free lunch: higher recall costs more latency and memory. Pushing HNSW ef_search from 40 to 200 raises recall but multiplies query time; a larger m improves recall but grows the index. The right operating point is workload-specific -- a legal search tool needs recall@10 near 1.0 and can tolerate 100ms, while an autocomplete needs 5ms and tolerates 0.9. Measure recall against a brute-force ground truth on a sample, then tune the knob to your latency SLO rather than guessing.

PAYLOAD

Metadata & Payloads

Real RAG almost never does pure vector search. You filter: only this tenant, only docs from 2025+, only content the user may see, only this document type. The database stores structured metadata (JSON payload, or typed columns in pgvector) alongside each vector and lets you constrain search by it. How well a database combines filtering with ANN search -- pre-filter vs post-filter, and whether it keeps returning full top-k under strict filters -- is one of the biggest practical differentiators between engines.

RAG FIT

Where It Fits in RAG

In a RAG stack the vector database sits between ingestion (chunk -> embed -> upsert) and generation (retrieve -> rerank -> prompt). It is one dependency among several: an embedding model, an optional reranker, an LLM, and often Redis for caching. Choosing it well means matching its scale ceiling, filtering model, and hosting story to the rest of your architecture -- not chasing the fastest benchmark number. For the full end-to-end picture, see the RAG pipelines guide.

2. Indexing Algorithms (HNSW, IVFFlat, DiskANN)

The index is the single biggest determinant of a vector database's speed, recall, memory use, and build time. Almost every engine defaults to HNSW today, but IVF variants and DiskANN matter for large or memory-constrained corpora. Understanding what each does -- and which knobs move recall vs latency -- lets you tune any database instead of trusting defaults. The families below are the ones you will actually meet in pgvector, Qdrant, Pinecone, Weaviate, and Milvus.

DEFAULT

HNSW (Graph)

Hierarchical Navigable Small World builds a multi-layer graph where each vector links to its nearest neighbors; search greedily walks the graph from a top-layer entry point down to the base layer. It is the default in Qdrant, Weaviate, Milvus, and pgvector because it gives the best recall-latency tradeoff for in-memory workloads. Key knobs: m (neighbors per node, 16-64; higher = better recall, more memory), ef_construction (build-time candidate list, 100-500; higher = better graph, slower build), and query-time ef_search (higher = more recall, more latency). Downside: the whole graph lives in RAM, so memory scales with vectors x dimensions.

CLUSTER

IVFFlat (Inverted File)

IVFFlat partitions the space into lists (Voronoi cells) via k-means, then at query time scans only the nprobe closest cells instead of the whole set. It builds far faster than HNSW and uses less memory, but recall is lower and it needs representative data to build good centroids -- create the index after loading data, not before. In pgvector, set lists ~= rows/1000 (up to ~sqrt(rows) for very large tables) and raise probes at query time to trade latency for recall. A pragmatic choice when index build time or memory matters more than peak recall.

COMPRESSED

IVF-PQ (Product Quantization)

IVF combined with Product Quantization compresses each vector into a short code (e.g. 64 bytes) by splitting it into sub-vectors and quantizing each against a small codebook. This shrinks memory 10-50x, letting a single node hold hundreds of millions of vectors, at the cost of approximate distances. Standard in Milvus (IVF_PQ) and Faiss for billion-scale sets. Pair it with a re-scoring pass over full-precision vectors for the final top-k to recover most of the lost recall.

ON-DISK

DiskANN (SSD-Resident)

DiskANN (Microsoft's Vamana graph) keeps the bulk of the index on NVMe SSD with a compressed copy in RAM, so you can serve datasets far larger than memory at a fraction of the cost of an all-RAM HNSW. Milvus offers DISKANN; Qdrant supports on-disk HNSW and payloads; pgvector's pgvectorscale extension (Timescale) adds a DiskANN-style StreamingDiskANN index. The tradeoff is higher tail latency from SSD reads, mitigated by keeping quantized vectors in RAM. The go-to when your corpus outgrows affordable memory but you still need good recall.

EXACT

Flat (Brute Force / Exact)

A flat index does no approximation -- it computes the exact distance to every vector. Recall is a perfect 1.0 and there is nothing to tune, but latency grows linearly with the row count. It is the right choice below roughly 50K-100K vectors (where a scan is already sub-10ms), for correctness-critical use cases, and as the ground truth you measure ANN recall against. pgvector with no index, Qdrant's exact search flag, and Milvus FLAT all provide it. Always benchmark ANN recall versus a flat baseline before shipping.

TUNING

Tuning the Knobs

Whatever the index, the workflow is the same: fix build-time parameters for your memory budget, then move the query-time knob to hit your recall SLO at the lowest latency. HNSW: raise ef_search until recall plateaus. IVF: raise nprobe/probes. Rebuild with larger m/ef_construction or more lists only if the query knob cannot reach target recall. Measure on your own data with your own filters -- published benchmarks use clean, unfiltered datasets that rarely match production, where filtering can collapse HNSW recall if the engine post-filters.

3. Distance Metrics & Similarity

The distance metric defines what "closest" means, and it must match how your embedding model was trained. Pick the wrong one and recall silently collapses -- the index works, but returns the wrong neighbors. Each metric also maps to a specific index operator class (pgvector) or a config field (Qdrant, Milvus, Weaviate, Pinecone). The rule of thumb: read your embedding model's card and use the metric it recommends, then normalize if that lets you use the fastest operator.

DEFAULT

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude, so it captures direction (semantic orientation) rather than length. It is the default for most text embedding models (OpenAI, Cohere, BGE, E5) and the safe choice when unsure. pgvector exposes it as vector_cosine_ops and the <=> operator; Qdrant/Milvus/Weaviate call it Cosine. Note that cosine on normalized vectors is equivalent to dot product -- so many engines normalize internally and run the faster inner-product kernel.

FAST

Dot Product (Inner Product)

The raw inner product of two vectors; higher means more similar. It is the cheapest to compute and the correct choice for models trained with it (many retrieval and recommendation models, and MIPS-style setups). In pgvector it is vector_ip_ops with the <#> operator (which returns the negative inner product so smaller is nearer). If your vectors are L2-normalized, dot product and cosine rank results identically -- so normalize once at ingestion and use inner product for speed.

GEOMETRIC

Euclidean (L2) Distance

Straight-line distance in the embedding space; smaller means more similar. It accounts for magnitude, which matters for some image, audio, and clustering embeddings where vector length carries information. pgvector uses vector_l2_ops and the <-> operator; there is also squared-L2 (vector_l2_ops avoids the sqrt for ranking). Use L2 when the model card specifies it -- forcing cosine on an L2-trained model degrades results.

PREP

Normalization

L2-normalizing vectors (scaling each to unit length) is the most common preprocessing step. It makes dot product equal cosine, stabilizes distances, and lets you use the fastest inner-product kernel and binary quantization. Some models already return normalized vectors (check the card); if not, normalize once at ingestion and again for each query vector. Be consistent -- mixing normalized documents with un-normalized queries produces garbage rankings.

CHOICE

Choosing for Your Model

Match the metric to the model, not to intuition: OpenAI text-embedding-3 and most BGE/E5/GTE models -> cosine; some rerank/recommendation models -> dot product; certain vision models -> L2. When in doubt, normalize and use cosine -- it is rarely wrong for text. Set the metric at collection creation time; changing it later means rebuilding the index. A mismatched metric is one of the most common silent causes of poor RAG recall.

BINARY

Hamming & Manhattan

For binary-quantized vectors (each dimension reduced to one bit), similarity is computed with Hamming distance -- a XOR-and-popcount that is extraordinarily fast and cache-friendly, enabling 32x memory savings. Manhattan (L1) distance is offered by some engines (Qdrant, Milvus) for models trained with it. These are specialized: use Hamming as the coarse first pass in a binary-quantization + rescoring pipeline, and L1 only when a model explicitly calls for it.

MULTI-VEC

MaxSim & Multi-Vector

Late-interaction models (ColBERT, ColPali) represent each document as many token-level vectors and score with MaxSim -- summing, over query tokens, the maximum similarity to any document token. This captures fine-grained matches a single pooled vector misses. Qdrant and Weaviate now support multi-vector fields natively, and Milvus supports multiple vector fields per entity. It costs more storage but boosts recall on hard queries; treat it as an advanced option once single-vector retrieval plateaus.

4. Filtered & Hybrid Search

Pure vector search is rare in production RAG. You almost always constrain results by metadata (tenant, permissions, date, document type) and often blend semantic similarity with keyword matching. How a database handles filtered and hybrid search -- correctly and without wrecking recall -- is one of the biggest practical differentiators between engines. This is where Qdrant's pre-filtering and Weaviate's native hybrid search earn their reputation, and where naive post-filtering quietly returns too few results.

FILTER

Metadata Filtering

Every serious RAG query attaches a filter: only this tenant_id, only year >= 2025, only docs the user may read. Databases store structured metadata (JSON payload in Qdrant/Weaviate/Pinecone/Milvus, typed columns or JSONB in pgvector) and let you express boolean, range, and set conditions. The subtle part is not the syntax but how the filter interacts with the ANN index -- get that wrong and you either lose recall or scan too much. Design your filterable fields up front and index them.

PRE VS POST

Pre-filter vs Post-filter

Post-filtering runs ANN search first, then discards non-matching results -- fast, but if the filter is selective you may get far fewer than k results (or zero). Pre-filtering restricts the candidate set before/inside the graph walk, guaranteeing a full top-k but requiring the engine to integrate filtering with the index. Qdrant is built around fast pre-filtering via payload indexes; Milvus and Weaviate support filtered search with partitions/indexes; pgvector applies WHERE clauses that the planner may or may not push into the HNSW scan. Strict filters + post-filtering is a classic silent-recall bug.

PAYLOAD IDX

Payload & Field Indexes

Filtering fast requires the metadata itself to be indexed, not scanned. Qdrant lets you create payload indexes (keyword, integer, float, geo, datetime, bool) so filters resolve in a bitmap before the vector walk. pgvector filters ride on ordinary PostgreSQL B-tree/GIN indexes on your columns or JSONB fields -- one of its underrated strengths, since you get the full relational indexing toolbox. Milvus builds scalar indexes; Weaviate indexes properties. Always index the fields you filter on, or filtering becomes the bottleneck.

HYBRID

Hybrid Search (Dense + Sparse)

Hybrid search fuses dense vector similarity with sparse/keyword scoring. Vectors nail semantic matches ("cost" ~ "pricing"); keywords nail exact terms vectors miss (error codes, product SKUs, acronyms, rare names). Weaviate, Qdrant, Milvus, and Pinecone all support hybrid natively; pgvector combines vector search with PostgreSQL tsvector full-text search in SQL. For RAG over technical docs, code, or catalogs, hybrid consistently beats pure vector search -- it is the highest-value feature after basic filtering.

SPARSE

Sparse Vectors (BM25 / SPLADE)

The keyword half of hybrid search is a sparse vector -- mostly zeros, with weights on the terms that appear. Classic BM25 weights raw terms; learned sparse models (SPLADE, BGE-M3's sparse output) predict term weights including expansions, capturing some semantics while staying keyword-precise. Qdrant and Milvus store sparse vectors as first-class fields; Weaviate computes BM25 internally; Pinecone accepts sparse-dense records. Sparse vectors are cheap to store and immune to the "different wording" failure of dense-only search.

FUSION

Score Fusion (RRF)

Dense and sparse scores live on different scales, so you cannot just add them. Reciprocal Rank Fusion combines the two ranked lists by rank, not raw score: RRF(d) = sum(1 / (k + rank_i)) with k around 60. It is parameter-light, robust, and the default fusion in Weaviate, Qdrant, and Milvus hybrid endpoints; Weaviate also offers relative-score fusion with an alpha weight. Prefer RRF unless you have labeled data to tune a weighted or learned fusion.

DatabaseFiltering modelPayload/field indexesHybrid (dense+sparse)Fusion
pgvectorSQL WHERE (B-tree/GIN)Full PostgreSQL indexesVia tsvector FTS in SQLManual / RRF in SQL
QdrantPre-filter (payload index)Yes (typed payload idx)Native sparse + denseRRF / DBSF
PineconeMetadata filterMetadata indexedSparse-dense recordsWeighted / RRF
WeaviateFiltered vector searchProperty indexesNative BM25 + vectorRRF / relative score
MilvusFiltered search + partitionsScalar indexesNative sparse + denseRRF / weighted
ChromaMetadata where filterBasic metadata indexNo native (full-text add-on)Manual

5. The Databases Compared

These six cover essentially every RAG scenario in 2026, from a side project on an existing Postgres box to a billion-vector search platform. They all do HNSW, metadata filtering, and sub-100ms queries at million-scale -- so the decision comes down to filtering power, hybrid search, scale ceiling, and how much operational surface you want to own. Read each with your own workload in mind; the decision table in section 8 maps use cases to picks.

POSTGRES

pgvector

A PostgreSQL extension that adds a vector type plus HNSW and IVFFlat indexes to the database you already run. Its superpower is not raw speed but zero new infrastructure: embeddings live next to your relational data, so you get ACID transactions, joins, foreign keys, WHERE-clause filtering, and your existing backup/monitoring/replication for free. The pgvectorscale extension (Timescale) adds a DiskANN-style index and label-based filtering that push it well past its old limits. The pragmatic default for teams up to a few million vectors who already have PostgreSQL -- and often the right answer even when a dedicated engine would be marginally faster.

PERFORMANCE

Qdrant

A Rust-native, purpose-built vector database that consistently tops filtered-search benchmarks. Its defining strength is fast pre-filtering: payload indexes resolve metadata conditions before the vector walk, so strict filters still return a full top-k without the recall collapse post-filtering causes. Supports scalar, binary, and product quantization (4-32x memory savings), sparse vectors and multi-vector points for hybrid and late-interaction search, and on-disk storage for large sets. Run it self-hosted via Docker/Kubernetes or on Qdrant Cloud. The performance-first pick when low-latency filtered RAG at scale is the priority.

MANAGED

Pinecone

Fully managed and serverless -- there is no cluster to size, patch, or scale. Create an index, upsert vectors, query; storage and compute scale automatically and you pay per usage. Supports namespaces for multi-tenant isolation, sparse-dense hybrid search, metadata filtering, and integrated inference (embed + rerank without leaving Pinecone). The lowest-ops option: ideal for teams that want production RAG without owning a database. Tradeoffs are vendor lock-in, no self-hosting, and costs that can climb with high query volume -- model your pricing before committing.

HYBRID

Weaviate

Open-source with the strongest out-of-the-box hybrid search: native BM25 + vector fusion (RRF or relative-score) in a single query, so semantic and keyword matching combine without extra plumbing. Optional vectorizer and reranker modules embed raw text for you, and it offers multi-tenancy, RBAC, and named/multi-vector fields. Available self-hosted or as Weaviate Cloud. The natural choice when hybrid search is a first-class requirement -- technical docs, product catalogs, mixed keyword/semantic corpora -- and you want it built in rather than assembled.

SCALE

Milvus

Built for billion-vector scale with a distributed, storage-compute-separated architecture. Offers the widest index menu (HNSW, IVF variants, IVF-PQ, DISKANN, GPU-accelerated CAGRA/GPU_IVF), native sparse+dense hybrid search, and horizontal scaling on Kubernetes. Run it as Milvus Lite (embedded, pip-installable for dev), Standalone (single node), or Distributed (production cluster); Zilliz Cloud is the managed version. The pick for 100M+ vectors, GPU-accelerated indexing, or when you need to scale a single collection far beyond one node.

PROTOTYPE

Chroma

The most developer-friendly option for getting started. pip install chromadb and it runs in-process with persistent storage and zero config; a client-server mode and the managed Chroma Cloud exist for when you outgrow embedded. Built-in embedding functions (OpenAI, Cohere, sentence-transformers) make demos and notebooks trivial. It is excellent for prototyping, local RAG, and single-machine apps, but not aimed at billion-scale production or heavy hybrid/filtering workloads -- graduate to pgvector, Qdrant, or Milvus when you outgrow it. For fully local setups, pair it with local inference.

DatabaseDeploymentIndexesHybrid searchPractical scaleBest for
pgvectorPostgreSQL ext (self / any managed PG)HNSW, IVFFlat (+DiskANN via pgvectorscale)Via SQL + tsvector~1-5M/node (more with pgvectorscale)Existing Postgres stacks
QdrantSelf-hosted + Qdrant CloudHNSW (on-disk), quantizationNative sparse + denseBillions (sharded)Low-latency filtered RAG
PineconeManaged serverless onlyProprietary (managed)Sparse-denseBillionsZero-ops teams
WeaviateSelf-hosted + Weaviate CloudHNSW, flat, quantizationNative BM25 + vectorBillionsBuilt-in hybrid + vectorizers
MilvusSelf-hosted + Zilliz CloudHNSW, IVF*, IVF-PQ, DISKANN, GPUNative sparse + dense10B+ (distributed)Massive / GPU scale
ChromaEmbedded + server + Chroma CloudHNSWNo native (add-on)~1M (single machine)Prototyping, local RAG

6. Quantization & Memory

Memory is the dominant cost of vector search. A million 1536-dim float32 vectors is ~6 GB just for the raw arrays, before the HNSW graph doubles it -- and HNSW wants it all in RAM. Quantization compresses vectors to fit more per node and cut cost, trading a little accuracy that a rescoring pass usually recovers. Understanding the memory math and the quantization ladder is how you keep a large RAG index affordable without wrecking recall.

MATH

The Memory Math

Raw storage = vectors x dimensions x bytes-per-component. float32 = 4 bytes, so 1M x 1536 x 4 = ~6.1 GB; HNSW adds roughly m x 8-16 bytes per vector for graph links on top. This is why 10M+ full-precision vectors need tens of GB of RAM. Two levers shrink it: fewer dimensions (Matryoshka truncation) and fewer bytes per dimension (quantization). Compute this number first -- it decides your node size, your index choice, and often your whole architecture.

4X

Scalar Quantization (int8)

Maps each float32 component to an 8-bit integer using per-dimension min/max ranges, cutting memory ~4x with typically under 1% recall loss -- the safest first step. Qdrant, Milvus, and Weaviate all offer int8 scalar quantization as a config flag; Cohere and Voyage even emit int8 embeddings directly. Keep the original vectors on disk for optional rescoring. If you do one quantization, do this one: it is nearly free accuracy-wise and quarters your RAM bill.

32X

Binary Quantization

Reduces each dimension to a single bit (sign), giving a brutal ~32x memory cut and Hamming-distance comparisons that are extremely fast. It works surprisingly well for high-dimensional, normalized embeddings (1024+ dims from OpenAI, Cohere, Voyage) but loses too much on low-dim vectors. Always pair it with oversampling + rescoring: retrieve a wide candidate set with binary distance, then re-rank the top few hundred with full-precision vectors. Qdrant popularized this pattern; Milvus and Weaviate support it too.

PQ

Product Quantization (PQ)

Splits each vector into sub-vectors and encodes each against a learned codebook, compressing to a handful of bytes per vector (10-50x). It underpins Milvus IVF_PQ and Faiss for billion-scale sets where even int8 is too big. PQ distances are more approximate than scalar quantization, so a full-precision rescoring stage is essential for good top-k. Choose PQ when your corpus is so large that fitting it in memory is the binding constraint, not when a simpler scalar quant would do.

MRL

Matryoshka Dimension Reduction

Matryoshka-trained models (OpenAI text-embedding-3, Gemini, Voyage, Nomic) let you simply truncate a vector to fewer dimensions -- 3072 to 1536 to 768 to 256 -- with graceful, not catastrophic, quality loss. Halving dimensions halves storage and speeds distance math, orthogonally to quantization (you can do both). It is the cheapest lever: no index change, just store fewer components. Benchmark the recall drop on your data and keep the smallest dimension that still meets your target.

RESCORE

Oversampling & Rescoring

The trick that makes aggressive quantization safe: search the compressed index for more candidates than you need (e.g. top-200 with binary), then recompute exact distances on full-precision vectors for just those candidates to produce the final top-10. You keep most of the memory savings while recovering most of the recall. Qdrant, Milvus, and Weaviate expose this as a rescore/refine option -- keep the originals on disk (they are not in the hot path) so rescoring can read them.

HALF

float16 & On-Disk Storage

Between float32 and int8 sits float16/bfloat16 -- a simple 2x cut with negligible accuracy loss and no codebook. Beyond in-memory tricks, most engines can keep vectors and/or payloads on disk (Qdrant on-disk HNSW, Milvus MMAP/DISKANN, pgvector on standard table storage with buffer cache) so RAM holds only the graph and hot data. Combine dimension reduction, quantization, and on-disk storage to serve large corpora on modest, affordable hardware.

7. Scale, Latency & Performance

"Fast" is meaningless without saying at what recall, what dataset size, and under what filters. Vendor benchmarks cherry-pick clean, unfiltered datasets at the recall that flatters them. To choose well, learn to read the recall-vs-QPS curve, size memory honestly, and account for build time and cold starts. This section is about measuring performance on your workload rather than trusting a marketing number.

BENCH

How to Read Benchmarks

The neutral references are ANN-Benchmarks (algorithm-level recall vs QPS) and VectorDBBench (end-to-end, includes filtering and build time). Any latency number is only meaningful paired with a recall number -- 5ms at recall 0.80 is not comparable to 20ms at 0.99. Watch for apples-to-oranges: single-thread vs concurrent, in-memory vs on-disk, unfiltered vs filtered. Always reproduce on your own vectors, dimensions, filters, and hardware before believing a ranking.

RECALL

Recall@k Tuning

Recall@k is the fraction of the true top-k neighbors your ANN index actually returns, measured against a flat/brute-force baseline. Build a ground-truth set from a query sample, then raise the query knob (ef_search for HNSW, nprobe for IVF) until recall meets your target -- usually 0.95-0.99 for RAG. Going higher costs latency for shrinking gains. Re-measure after adding filters: post-filtering engines can drop far below their headline recall once a selective filter is applied.

QPS

Throughput & Concurrency

Single-query latency and sustained QPS are different metrics. HNSW search is CPU-bound and parallelizes across cores, so throughput scales with vCPUs until memory bandwidth saturates. Measure p50/p95/p99 under realistic concurrency, not one query at a time -- tail latency is what users feel. Rust engines (Qdrant) and C++ cores (Milvus) tend to hold lower tails under load; pgvector shares CPU with the rest of your database, so isolate or replica-offload vector queries if they compete with OLTP traffic.

SHARD

Sharding & Horizontal Scale

When one node cannot hold the index or serve the QPS, you shard: partition vectors across nodes, fan queries out, and merge top-k. Milvus and Qdrant do this natively; Pinecone hides it entirely behind serverless; pgvector scales reads with replicas but a single collection generally lives on one primary. Replicas add read throughput and HA; shards add capacity. Know your growth curve -- retrofitting sharding onto a single-node design is painful.

SIZING

Memory Sizing per Vector

Budget RAM as: (dimensions x bytes-per-component) + HNSW graph overhead (~m x 8-16 bytes) + payload, times the vector count, plus headroom. A 1536-dim float32 HNSW point costs roughly 6-8 KB all-in; int8 quantization drops it toward ~1.5-2 KB. Multiply by your target count and you know whether it fits one node or forces quantization/DiskANN/sharding. This single calculation drives most architecture decisions -- do it before picking a database.

BUILD

Build Time & Cold Start

Index build is not free. HNSW construction on millions of vectors can take minutes to hours and is CPU-heavy; IVF builds faster but needs a representative training sample for its centroids. pgvector can build HNSW concurrently and in parallel workers; Milvus builds indexes as background jobs. Plan for cold starts too: memory-mapped/on-disk indexes must warm their cache before hitting steady-state latency. Factor build and warm-up time into deploys, re-indexing, and failover -- not just steady-state query speed.

8. Self-Host vs Managed & Cost

The deployment model shapes cost, control, and how much of your team's time the database eats. The spectrum runs from fully managed serverless (someone else's problem, metered pricing) through managed clusters and self-hosted engines to an embedded library inside your app. There is no universally right answer -- only the right fit for your scale, compliance needs, existing infrastructure, and appetite for operations. Model the total cost, not just the sticker price.

SERVERLESS

Managed Serverless

Pinecone is the archetype: no clusters to size, patch, or scale, and you pay for what you use (storage + reads/writes). It is the fastest path to production RAG and the lowest operational burden -- ideal for small teams and spiky, unpredictable traffic. The costs are metered pricing that can surprise you at high query volume, no infrastructure control, and vendor lock-in on a proprietary API. Model a realistic month of reads and writes before committing; serverless is cheapest at low-to-medium, steady volume.

MANAGED CLOUD

Managed Clusters

Qdrant Cloud, Weaviate Cloud, and Zilliz Cloud (managed Milvus) run the open-source engine for you on provisioned nodes you size. You get most of the ops relief of serverless while keeping the engine's full feature set and an escape hatch to self-host the same software. Pricing is typically per-node/per-hour plus storage, which is more predictable than pure usage metering at scale. The middle ground for teams that want the engine's capabilities without running Kubernetes themselves.

SELF-HOST

Self-Hosted

Run Qdrant, Weaviate, or Milvus yourself on Docker or Kubernetes. You get maximum control, data residency, no per-query fees, and the lowest cost per vector at large scale -- in exchange for owning upgrades, backups, monitoring, scaling, and on-call. It pays off when volume is high and steady, when compliance demands data stay in your environment, or when you already operate infrastructure. Budget real engineering time; the "free" software has a very real operational cost.

NO NEW INFRA

pgvector: No New Database

The cheapest deployment is often the one you already run. If you have PostgreSQL -- self-managed or on RDS/Cloud SQL/Supabase/Neon -- pgvector adds vector search with no new service to deploy, secure, back up, or pay for separately. One system to operate, one backup, one set of credentials, transactional consistency between vectors and relational data. For a large share of RAG apps under a few million vectors, this eliminates the entire "which vector database" decision.

COST: MANAGED

Managed Cost Model

Managed pricing has three drivers: stored vectors (GB, a function of count x dimensions x precision), query volume (reads), and writes/updates. Two big levers cut it: quantization and Matryoshka truncation shrink storage; semantic caching cuts reads. Watch replicas (multiplied cost for HA/throughput) and egress. The failure mode is a chatty app doing millions of small queries -- batch, cache, and pre-filter to keep read volume down. Always project a peak month, not an average one.

COST: SELF-HOST

Self-Host TCO

Self-hosted cost is mostly RAM (HNSW wants vectors in memory) plus CPU for query throughput, plus disk for on-disk indexes and originals -- then the hidden line item: engineering time for upgrades, backups, monitoring, and incidents. The memory math from section 6 sizes the box; quantization and DiskANN shrink it. At large, steady scale self-hosting is dramatically cheaper per vector than managed; at small scale the ops overhead usually is not worth it versus serverless or pgvector.

EMBEDDED

Embedded / In-Process

For local apps, edge, tests, and prototypes, an in-process library needs no server at all: Chroma (embedded mode), Milvus Lite, sqlite-vec, LanceDB, and FAISS all run inside your process with data in local files. Zero network hop, zero deployment, trivial to ship in a CLI or desktop app. The ceiling is a single machine and limited concurrency -- perfect to start with, and easy to graduate from to a client-server engine when you need multi-node scale or shared access.

LOCK-IN

Residency, Compliance & Lock-In

Beyond cost, weigh where data may live and how hard it is to leave. Regulated data (health, finance, EU personal data) may forbid a third-party managed cloud, pushing you to self-host or pgvector in your own region. Lock-in varies: open-source engines (Qdrant, Weaviate, Milvus, Chroma, pgvector) let you migrate or self-host the same software; a proprietary managed API is harder to exit. Keep the raw embeddings and source text so you can always re-index into another engine -- your corpus, not the vector store, is the asset.

9. Decision Guide by Use Case

There is no single best vector database -- only the best fit for your constraints. Start from what you already run and how many vectors you have, then let filtering needs, scale, and ops appetite break ties. The default that surprises people: if you already run PostgreSQL and have under a few million vectors, pgvector is very often the right answer, and it removes the whole "which vector database" decision. The scenarios below map common RAG situations to a pick.

DEFAULT

Already on Postgres, under 5M vectors

If you already operate PostgreSQL and your corpus is under a few million vectors, use pgvector. You get vector search with zero new infrastructure, transactional consistency between embeddings and relational rows, SQL filtering with real indexes, and your existing backups and monitoring. It is not the fastest engine at extreme scale, but for the majority of RAG apps it is fast enough and by far the simplest. Add pgvectorscale for DiskANN and higher scale before reaching for a dedicated engine.

SPEED

Fastest filtered search

When low latency under heavy metadata filtering is the priority -- multi-tenant RAG, permission-scoped search, strict date/type constraints -- pick Qdrant. Its Rust core and pre-filtering via payload indexes keep recall and latency strong exactly where post-filtering engines fall apart, and its quantization options keep large indexes affordable. Self-host it or use Qdrant Cloud. The performance-first choice when filtered vector search is your hot path.

ZERO-OPS

Zero ops, serverless

If you have no appetite to run a database and want to ship RAG now, choose Pinecone. Serverless means no sizing, patching, or scaling, and integrated inference can embed and rerank for you. Best for small teams, spiky traffic, and low-to-medium steady query volume. Model your read/write costs first and accept the lock-in tradeoff -- keep your raw embeddings so you can migrate later if volume makes self-hosting cheaper.

HYBRID

Hybrid search first-class

When keyword precision matters as much as semantics -- technical docs, code, product catalogs, acronym-heavy corpora -- Weaviate gives you native BM25 + vector fusion in one query, plus optional built-in vectorizer and reranker modules so you can pass raw text. It scales to billions and runs self-hosted or on Weaviate Cloud. The pick when you want hybrid retrieval built in rather than assembled from parts.

SCALE

Billion-scale or GPU

For 100M+ vectors, GPU-accelerated indexing, or a single collection that must scale far beyond one node, choose Milvus. Its distributed, storage-compute-separated architecture, wide index menu (IVF-PQ, DISKANN, GPU CAGRA), and native hybrid search are built for massive scale. Run it self-hosted on Kubernetes or as managed Zilliz Cloud. Overkill for small corpora -- but the right tool when scale is the defining constraint.

LOCAL

Prototype, local or edge

For notebooks, demos, tests, desktop and edge apps, or fully local RAG, start with Chroma (or Milvus Lite / sqlite-vec / LanceDB). It runs in-process with one pip install, needs no server, and has built-in embedding functions. Prototype fast, then graduate to pgvector, Qdrant, or Milvus when you need multi-node scale, heavy filtering, or shared production access. Great for starting; not a billion-scale production engine.

Your situationPickWhy
Already run PostgreSQL, <5M vectorspgvectorNo new infra, SQL filtering, transactional
Low latency under strict filters, multi-tenantQdrantRust + pre-filtering keep recall & speed
Want zero database operations, ship nowPineconeServerless, no sizing or scaling
Hybrid (keyword + semantic) is essentialWeaviateNative BM25 + vector fusion built in
100M+ vectors, GPU indexing, shardingMilvusDistributed, widest index menu
Prototype, notebook, local/edge appChromaIn-process, zero config, pip install
# Quick RAM sizing to decide "one node vs quantize/shard"
def ram_gb(n_vectors, dims, bytes_per_component=4, hnsw_m=16):
    vec_bytes   = n_vectors * dims * bytes_per_component      # raw vectors
    graph_bytes = n_vectors * hnsw_m * 12                     # ~HNSW links
    total = (vec_bytes + graph_bytes) * 1.3                   # +30% headroom
    return round(total / 1e9, 2)

# 5M vectors, 1536 dims, HNSW m=16
print("float32:", ram_gb(5_000_000, 1536), "GB")             # ~46 GB  -> big node
print("int8   :", ram_gb(5_000_000, 1536, bytes_per_component=1), "GB")  # ~12 GB
print("MRL-768 int8:", ram_gb(5_000_000, 768, bytes_per_component=1), "GB")  # ~6 GB

# Rule of thumb:
#   fits comfortably in one node's RAM     -> pgvector / single Qdrant / Chroma
#   too big for RAM but budget matters     -> quantize (int8/binary) or DiskANN
#   still too big or needs > 1 node of QPS -> Milvus / sharded Qdrant / Pinecone

10. Production Operations

Getting a vector database into production is more than a working query. You need incremental updates, backups, tenant isolation, monitoring, and a cost story that survives growth. The retrieval algorithm is the easy part; the operational surface -- keeping the index fresh, durable, isolated, and observable -- is what separates a demo from a system you can run at 3am.

UPSERT

Upserts & Incremental Indexing

Never re-embed the whole corpus on every change. Key each vector by a stable document/chunk ID, track content checksums, and re-embed only what changed -- then upsert (insert-or-replace) by ID. Handle deletes explicitly: an orphaned vector for a removed document keeps surfacing in results. For sources like Confluence or Notion, trigger re-ingestion on webhooks. HNSW handles inserts online but heavy deletion fragments the graph over time, so schedule periodic index maintenance/rebuilds.

BACKUP

Backups & Snapshots

Your vectors are expensive to recompute, so back them up. pgvector rides your existing PostgreSQL backups (pg_dump, WAL, PITR) -- a real advantage. Qdrant and Milvus provide snapshot/backup tooling; managed services handle it but verify the RPO/RTO and that you can actually restore. Keep the raw source text and embeddings in cheap object storage as the ultimate recovery path: you can always rebuild any index from them, and it also frees you to migrate engines.

ISOLATION

Multi-Tenancy

Isolate each customer's data to prevent leakage. Options: separate collections per tenant (strongest isolation, more overhead), native multi-tenancy (Weaviate tenants, Pinecone namespaces, Qdrant with a tenant payload index), or a tenant_id filter enforced server-side (pgvector with row-level security). Always apply the tenant filter as a mandatory pre-filter, never trust the client. Test isolation by querying as tenant A and asserting zero rows from tenant B.

OBSERVE

Monitoring & Observability

Track query latency (p50/p95/p99), QPS, recall on a labeled sample, memory and disk usage, index build/refresh duration, and error rates. Alert on p99 latency creep (index fragmentation or memory pressure) and on recall drift (a metric or model mismatch after a change). Watch the filtered-query path specifically -- it degrades differently than unfiltered. For pgvector, reuse your existing PostgreSQL monitoring; dedicated engines expose Prometheus metrics.

COST

Cost Optimization

The big levers, in order: quantize (int8 is ~4x, binary+rescore ~32x), truncate dimensions with Matryoshka models, move cold vectors on-disk/DiskANN, and cache repeated queries in Redis. On managed services, batch writes and pre-filter to cut metered reads, and think hard before adding replicas (they multiply cost). Right-size the node from the section 6 memory math rather than over-provisioning RAM you never use.

BULK

Bulk Ingestion & Batching

Initial loads of millions of vectors need batching: embed in batches (respect provider rate limits and use parallel workers), then upsert in batches of hundreds to thousands per request. For very large loads, insert data first and build the index afterward -- IVF needs the data to train centroids, and bulk-then-index is far faster than one-at-a-time. In pgvector, use COPY and build HNSW with parallel workers; in Milvus/Qdrant, use their bulk-import paths. Make ingestion idempotent so a retry does not duplicate vectors.

SECURITY

Security & Access Control

A vector database can leak sensitive data through retrieval. Store per-document access rules as metadata and enforce them as a server-side pre-filter at query time -- never rely on the LLM to respect boundaries, since anything in the context can surface in the answer. Encrypt at rest and in transit, scope API keys narrowly, and lock down the admin/dashboard port (do not expose Qdrant/Milvus to the public internet). Audit queries that touch restricted data, and remember that embeddings can leak information about their source text.

11. Client Code Examples

The same RAG task -- create a collection, upsert embeddings with metadata, run a filtered similarity search -- across the four engines you are most likely to choose. Each uses the current 2026 client and shows an HNSW/cosine setup plus metadata filtering. Embeddings come from any model (OpenAI text-embedding-3 shown); swap in your own. These are the load-bearing 20 lines; wrap them in your ingestion and generation code.

pgvector (Python + SQL, HNSW)

import psycopg2, json, openai

client = openai.OpenAI()
conn = psycopg2.connect("postgresql://user:pass@localhost:5432/ragdb")

def embed(text: str) -> list[float]:
    r = client.embeddings.create(model="text-embedding-3-small",
                                 input=text, dimensions=1536)
    return r.data[0].embedding

# Setup: extension, table, and an HNSW cosine index
with conn.cursor() as cur:
    cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id        BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
            content   TEXT NOT NULL,
            tenant_id TEXT,
            year      INT,
            embedding vector(1536)
        )""")
    cur.execute("""
        CREATE INDEX IF NOT EXISTS docs_emb_idx ON documents
        USING hnsw (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 200)""")
    # index the columns you filter on
    cur.execute("CREATE INDEX IF NOT EXISTS docs_tenant_idx ON documents (tenant_id, year)")
    conn.commit()

def upsert(content, tenant_id, year):
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO documents (content, tenant_id, year, embedding) "
            "VALUES (%s, %s, %s, %s::vector)",
            (content, tenant_id, year, str(embed(content))))
    conn.commit()

def search(query, tenant_id, k=5):
    qv = str(embed(query))
    with conn.cursor() as cur:
        cur.execute("SET LOCAL hnsw.ef_search = 100")          # recall/latency knob
        cur.execute("""
            SELECT content, 1 - (embedding <=> %s::vector) AS score
            FROM documents
            WHERE tenant_id = %s AND year >= 2025               -- pre-filter
            ORDER BY embedding <=> %s::vector
            LIMIT %s""", (qv, tenant_id, qv, k))
        return cur.fetchall()

print(search("how do I rotate API keys?", tenant_id="acme"))

Qdrant (filtered search, Python client)

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct, Filter, FieldCondition,
    MatchValue, Range, PayloadSchemaType,
)
import openai

oai = openai.OpenAI()
client = QdrantClient(url="http://localhost:6333")   # or QdrantClient(url=..., api_key=...)

def embed(text: str) -> list[float]:
    return oai.embeddings.create(model="text-embedding-3-small",
                                 input=text, dimensions=1536).data[0].embedding

# 1. Create a collection (HNSW + cosine are the defaults)
client.recreate_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
# index the payload fields you filter on -> fast pre-filtering
client.create_payload_index("docs", "tenant_id", PayloadSchemaType.KEYWORD)
client.create_payload_index("docs", "year", PayloadSchemaType.INTEGER)

# 2. Upsert vectors with metadata payloads
def upsert(idx, content, tenant_id, year):
    client.upsert("docs", points=[PointStruct(
        id=idx, vector=embed(content),
        payload={"content": content, "tenant_id": tenant_id, "year": year})])

# 3. Filtered similarity search (filter applied BEFORE the vector walk)
def search(query, tenant_id, k=5):
    hits = client.query_points(
        collection_name="docs",
        query=embed(query),
        query_filter=Filter(must=[
            FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id)),
            FieldCondition(key="year", range=Range(gte=2025)),
        ]),
        limit=k, with_payload=True,
        search_params={"hnsw_ef": 100},          # recall/latency knob
    ).points
    return [(h.payload["content"], h.score) for h in hits]

print(search("how do I rotate API keys?", tenant_id="acme"))

Pinecone (serverless)

from pinecone import Pinecone, ServerlessSpec
import openai

oai = openai.OpenAI()
pc = Pinecone(api_key="your-api-key")

def embed(text: str) -> list[float]:
    return oai.embeddings.create(model="text-embedding-3-small",
                                 input=text, dimensions=1536).data[0].embedding

# 1. Create a serverless index (nothing to size or scale)
if not pc.has_index("docs"):
    pc.create_index(
        name="docs", dimension=1536, metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"))
index = pc.Index("docs")

# 2. Upsert vectors with metadata; namespaces isolate tenants
def upsert(idx, content, tenant_id, year):
    index.upsert(namespace=tenant_id, vectors=[{
        "id": idx, "values": embed(content),
        "metadata": {"content": content, "year": year}}])

# 3. Filtered query within a tenant namespace
def search(query, tenant_id, k=5):
    res = index.query(
        namespace=tenant_id,
        vector=embed(query),
        top_k=k, include_metadata=True,
        filter={"year": {"$gte": 2025}})
    return [(m["metadata"]["content"], m["score"]) for m in res["matches"]]

print(search("how do I rotate API keys?", tenant_id="acme"))

Weaviate (native hybrid search)

import weaviate
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import Filter
import openai

oai = openai.OpenAI()
client = weaviate.connect_to_local()          # or connect_to_weaviate_cloud(...)

def embed(text: str) -> list[float]:
    return oai.embeddings.create(model="text-embedding-3-small",
                                 input=text, dimensions=1536).data[0].embedding

# 1. Create a collection; we supply our own vectors
if not client.collections.exists("Docs"):
    client.collections.create(
        "Docs",
        vector_config=Configure.Vectors.self_provided(),
        properties=[
            Property(name="content", data_type=DataType.TEXT),
            Property(name="tenant_id", data_type=DataType.TEXT),
            Property(name="year", data_type=DataType.INT),
        ])
docs = client.collections.get("Docs")

# 2. Insert objects with their vectors
def upsert(content, tenant_id, year):
    docs.data.insert(
        properties={"content": content, "tenant_id": tenant_id, "year": year},
        vector=embed(content))

# 3. Hybrid search: BM25 keywords + vector, fused by RRF, with a filter
def search(query, tenant_id, k=5):
    res = docs.query.hybrid(
        query=query,                 # keyword side (BM25)
        vector=embed(query),         # semantic side
        alpha=0.5,                   # 0 = keyword only, 1 = vector only
        filters=Filter.by_property("tenant_id").equal(tenant_id)
                & Filter.by_property("year").greater_or_equal(2025),
        limit=k)
    return [(o.properties["content"], o.metadata.score) for o in res.objects]

print(search("how do I rotate API keys?", tenant_id="acme"))
client.close()

Related Technologies