RAG Pipelines and Vector Databases: Production Retrieval-Augmented Generation

The definitive guide to building production RAG systems -- from document ingestion, chunking strategies, and embedding models to vector database selection (Pinecone, Weaviate, Qdrant, ChromaDB, pgvector, Milvus), retrieval strategies (hybrid search, HyDE, multi-query), reranking (Cohere, ColBERT, cross-encoders), advanced patterns (agentic RAG, graph RAG, corrective RAG), RAGAS evaluation, and production deployment with caching, streaming, and cost optimization.

RAGVector DatabasesEmbeddingsPineconeWeaviateQdrantpgvectorChromaDBMilvusLangChainLlamaIndexCohereOpenAIRAGASBM25ColBERT

1. RAG Architecture Overview

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLM responses in external knowledge. Instead of relying solely on a model's parametric memory (which is frozen at training time and prone to hallucination), RAG retrieves relevant documents from a knowledge base at inference time and includes them in the prompt context. This gives the model access to current, domain-specific, and verifiable information -- the difference between an LLM that guesses and one that cites sources.

A production RAG system has three phases: ingestion (processing documents into searchable chunks with embeddings stored in a vector database), retrieval (finding the most relevant chunks for a given query using vector similarity, keyword search, or hybrid approaches), and generation (feeding retrieved context to an LLM to produce a grounded answer). Each phase has its own set of engineering decisions that compound into overall system quality.

INGEST

Ingestion Pipeline

The offline pipeline that processes raw documents into queryable knowledge. Steps: (1) load documents from sources (PDFs, web pages, databases, APIs), (2) extract and clean text with metadata, (3) split text into chunks using a chosen strategy, (4) generate embedding vectors for each chunk, (5) store chunks + embeddings + metadata in a vector database. This pipeline runs on schedule or on document change, not at query time.

RETRIEVE

Retrieval Pipeline

The online pipeline that runs at query time. Steps: (1) embed the user query using the same model used for ingestion, (2) search the vector database for top-k similar chunks, (3) optionally apply hybrid search (combining vector + keyword BM25 scores), (4) rerank results using a cross-encoder or reranking API, (5) filter by metadata (date, source, permissions). Latency budget is typically 200-500ms for the entire retrieval step.

GENERATE

Generation Pipeline

The final step: constructing a prompt with the retrieved context and the user query, then calling the LLM. The prompt template controls how the model uses the context -- whether it should cite sources, refuse to answer when context is insufficient, or synthesize across multiple documents. Post-processing includes citation extraction, answer validation, and hallucination detection.

FEEDBACK

Feedback Loop

Production RAG systems need continuous improvement. Track retrieval quality (did the right documents get retrieved?), answer quality (was the response accurate and relevant?), and user satisfaction (thumbs up/down, follow-up questions). Feed this data back into chunking strategy refinement, embedding model selection, and prompt engineering. Without a feedback loop, RAG quality degrades silently as the knowledge base grows.

CONTEXT

Context Window Management

Modern LLMs offer 128K-1M token context windows, but stuffing more context does not always improve answers. Long-context models show a "lost in the middle" effect where information in the center of the context is recalled less reliably. Production RAG systems retrieve 5-20 highly relevant chunks rather than 100 marginally relevant ones, prioritizing precision over recall to keep generation quality high and cost low.

MULTI-INDEX

Multi-Index Architecture

Enterprise RAG often spans multiple knowledge sources: internal docs, Confluence, Slack, JIRA, code repositories, and external sources. Each source gets its own ingestion pipeline and potentially its own index with different chunking and embedding strategies. At query time, a router determines which indices to search, and results are merged and reranked across sources. This prevents one noisy source from drowning out high-quality results from another.

2. Document Processing

The quality of your RAG system is bounded by the quality of your document processing. Garbage in, garbage out applies doubly here: poorly extracted text leads to bad chunks, which lead to irrelevant retrievals, which lead to hallucinated answers. Invest heavily in document processing -- it is the highest-leverage improvement you can make to a RAG pipeline.

PDF

PDF Processing

PDFs are the hardest format to process reliably. Use PyMuPDF (fitz) for fast text extraction with layout preservation. For scanned PDFs, use Tesseract OCR or cloud APIs (Google Document AI, Azure Form Recognizer). Unstructured handles mixed-content PDFs with tables, images, and multi-column layouts. Always preserve table structure as markdown or HTML -- flattening tables into paragraphs destroys relational information that the LLM needs.

HTML

HTML and Web Content

Use BeautifulSoup4 or trafilatura for web content extraction. Trafilatura is purpose-built for article extraction and handles boilerplate removal, date extraction, and author detection automatically. For JavaScript-rendered pages, use Playwright or Selenium to render before extraction. Strip navigation, footers, ads, and cookie banners -- they add noise to embeddings without adding information value.

MARKDOWN

Markdown and Structured Text

Markdown is the ideal input format for RAG because headers provide natural chunk boundaries and structure is explicit. Use heading-based splitting to create chunks that respect document hierarchy. Preserve code blocks as atomic units. For GitHub repositories, process README files, documentation, and code comments separately with format-specific strategies. MarkItDown by Microsoft converts Office documents (DOCX, PPTX, XLSX) to clean Markdown.

METADATA

Metadata Extraction

Every chunk should carry metadata: source URL, document title, section heading, page number, creation date, author, and document type. Metadata enables filtered retrieval (search only legal docs, only from 2025+, only from the engineering team). Use LLMs to generate synthetic metadata: summaries, key entities, topic classification, and relevance tags. This metadata-enriched approach consistently outperforms raw text chunks in retrieval precision.

TABLE

Table and Structured Data

Tables are information-dense and lose meaning when chunked naively. Extract tables as complete units using Camelot or Tabula for PDFs. Store each table as a single chunk with its caption and surrounding context. For very large tables, chunk by row groups while repeating headers. Consider generating a natural language summary of each table as an additional chunk -- LLMs find narrative descriptions easier to reason over than raw tabular data.

MULTIMODAL

Multi-modal Content

Documents with images, diagrams, and charts require multi-modal processing. Use vision LLMs (GPT-4o, Claude Sonnet 4.5) to generate text descriptions of images and diagrams. Store both the image reference and its generated description as a chunk. For architectural diagrams and flowcharts, extract relationships as structured text. Multi-modal RAG with ColPali embeds entire document pages as images, bypassing text extraction entirely for visually rich documents.

3. Chunking Strategies

Chunking determines the granularity of your knowledge base. Chunks too large dilute the embedding with irrelevant information and waste context window tokens. Chunks too small lose the context needed for the LLM to produce coherent answers. The optimal chunk size depends on your document type, embedding model, and use case -- but most production systems land between 256 and 1024 tokens per chunk.

BASIC

Fixed-Size Chunking

Split text into chunks of a fixed token count with configurable overlap. Simple, predictable, and works as a baseline. Typical settings: 512 tokens with 50-100 token overlap. The overlap ensures that information spanning a chunk boundary appears in at least one chunk. Use tiktoken for accurate token counting with OpenAI models. Drawback: splits mid-sentence and mid-paragraph, breaking semantic coherence.

SEMANTIC

Recursive Character Splitting

LangChain's RecursiveCharacterTextSplitter splits on a hierarchy of separators: first by double newlines (paragraphs), then single newlines, then sentences, then words. This respects natural text boundaries while keeping chunks within the size limit. The most commonly used chunking strategy in production RAG systems. Significantly outperforms fixed-size chunking for narrative documents with clear paragraph structure.

ADVANCED

Semantic Chunking

Uses embedding similarity to determine chunk boundaries. Compute embeddings for each sentence, then split where cosine similarity between consecutive sentences drops below a threshold. This creates chunks where all sentences are semantically related. LangChain's SemanticChunker and LlamaIndex's SemanticSplitterNodeParser implement this. Higher quality than recursive splitting but 10-50x slower due to embedding computation per sentence during ingestion.

HIERARCHY

Parent-Child (Hierarchical) Chunking

Create two levels of chunks: small child chunks (128-256 tokens) for precise retrieval, and larger parent chunks (1024-2048 tokens) for context. Embed and search over child chunks, but return the parent chunk to the LLM. This gives you the retrieval precision of small chunks with the contextual completeness of large chunks. LlamaIndex's AutoMergingRetriever implements this: when multiple child chunks from the same parent are retrieved, they merge into the parent automatically.

OVERLAP

Sliding Window

A variant of fixed-size chunking with high overlap (50%+). Each chunk overlaps significantly with its neighbors, ensuring that no information falls between cracks. Works well for dense technical content where context from surrounding text is critical. The tradeoff is a larger index (2-3x more chunks) and higher embedding costs. Consider this when retrieval precision matters more than storage and compute costs.

DOCUMENT

Document-Aware Chunking

Split by document structure: headers, sections, subsections, and logical boundaries. Markdown headers, HTML heading tags, and LaTeX sections provide natural boundaries. Each chunk inherits its section hierarchy as metadata (e.g., "Chapter 3 > Section 3.2 > Subsection 3.2.1"). This preserves the document's logical structure and enables hierarchical retrieval where the LLM knows exactly where each piece of information comes from.

EXPERIMENTAL

Late Chunking (Contextual Embeddings)

Proposed by Jina AI: embed the entire document first using a long-context embedding model, then split the output embeddings at chunk boundaries. Each chunk embedding retains context from the full document because attention was computed across the entire document before splitting. This addresses the fundamental limitation of traditional chunking where each chunk is embedded in isolation. Supported by jina-embeddings-v3 and Anthropic's contextual retrieval approach.

4. Embedding Models

Embedding models convert text into dense vectors that capture semantic meaning. The choice of embedding model determines retrieval quality: it sets the ceiling for how well your RAG system can match queries to relevant documents. As of April 2026, the landscape includes both proprietary API models (OpenAI, Cohere, Google, Voyage) and open-source models (BGE, E5, GTE, Jina) that rival or exceed proprietary options on benchmarks like MTEB.

OPENAI

OpenAI text-embedding-3-large

3072 dimensions, 8191 token context. The highest-quality OpenAI embedding model. Supports dimensions parameter for shortening embeddings (e.g., 1536 or 256) with minimal quality loss via Matryoshka Representation Learning. Pricing: $0.13 per million tokens. Use the full 3072 dimensions for maximum quality, or 1536 for a balance of quality and cost. The text-embedding-3-small variant (1536 dims, $0.02/M tokens) is a good default for cost-sensitive applications.

COHERE

Cohere embed-v4

Cohere's latest embedding model (2026) with native multi-modal support for text and images. 1024 dimensions, 128K token context. Supports input_type parameter (search_document vs search_query) for asymmetric embeddings -- a crucial feature for RAG where documents and queries have different distributions. Offers int8 and binary quantization for 4-32x storage reduction. Top-3 on MTEB leaderboard. Built-in support for 100+ languages.

GOOGLE

Google text-embedding-005

768 dimensions, 2048 token context. Free via Google AI API (rate-limited) and paid via Vertex AI. Supports task_type parameter to optimize embeddings for retrieval, classification, clustering, or similarity. Strong multilingual performance. Lower dimensionality means smaller indexes and faster similarity search at the cost of some retrieval quality compared to 3072-dimension models. Good choice for GCP-native stacks.

VOYAGE

Voyage AI voyage-3-large

2048 dimensions, 32K token context. Endorsed by Anthropic as the recommended embedding model for use with Claude. Excels at code retrieval (voyage-code-3 variant). Supports Matryoshka shortening and asymmetric query/document encoding. Competitive with OpenAI's large model on MTEB while offering longer context for processing large document chunks without truncation.

OPEN-SOURCE

BGE / E5 / GTE (Open-Source)

BGE-M3 by BAAI: multi-lingual, multi-granularity, multi-functionality model that generates dense, sparse, and ColBERT embeddings simultaneously. 1024 dims, 8192 tokens. E5-mistral-7b-instruct: instruction-tuned, top MTEB scorer, 4096 dims. GTE-Qwen2: 1.5B parameter model with 8192 token context, strong multilingual support. Run locally with sentence-transformers or Ollama. Zero API cost, full data privacy, but requires GPU for production throughput.

JINA

Jina Embeddings v3

1024 dimensions, 8192 token context. Supports task-specific LoRA adapters for retrieval, classification, and similarity. Key innovation: late chunking support where the model processes the full document and produces contextually-aware chunk embeddings. Competitive pricing at $0.02/M tokens. Open-weights variant available for self-hosting. Excellent for multilingual RAG with 89 languages supported.

ModelDimsMax TokensMTEB AvgPrice / M tokens
text-embedding-3-large30728,19164.6$0.13
text-embedding-3-small15368,19162.3$0.02
Cohere embed-v41024128,00066.4$0.10
voyage-3-large204832,00065.8$0.18
text-embedding-0057682,04861.2Free / $0.0001
BGE-M310248,19265.1Self-hosted
jina-embeddings-v310248,19265.5$0.02

5. Vector Databases Comparison

Vector databases store embeddings and enable fast similarity search at scale. The choice depends on your deployment model (managed vs self-hosted), scale requirements (thousands vs billions of vectors), query patterns (pure vector vs hybrid search), and existing infrastructure. All production-grade vector databases support HNSW indexing, metadata filtering, and sub-100ms query latency for million-scale collections.

MANAGED

Pinecone

Fully managed, serverless vector database. Zero infrastructure to manage -- create an index and start inserting vectors. Automatic scaling, replication, and backups. Supports namespaces for multi-tenant isolation, sparse-dense hybrid search, and metadata filtering. Serverless pricing: pay per query and storage ($0.04/GB/month storage, $8/1M queries). The lowest-friction option for teams that want RAG without database operations. Limitation: vendor lock-in and no self-hosted option.

FLEXIBLE

Weaviate

Open-source vector database with built-in vectorization modules. Unique feature: integrates embedding model inference directly -- pass raw text and Weaviate handles embedding via configured model providers. Native hybrid search combining BM25 and vector scoring. GraphQL API for expressive queries. Supports multi-tenancy, RBAC, and backup/restore. Available as Weaviate Cloud (managed) or self-hosted via Docker/Kubernetes. Strong choice for teams that want both vector and keyword search in one system.

PERFORMANCE

Qdrant

Written in Rust for maximum performance. Supports scalar, binary, and product quantization for 4-32x memory reduction. Advanced filtering with payload indexes that apply before vector search (not post-filter), ensuring you always get k results even with strict filters. Supports multi-vector points (e.g., title + body embeddings per document). Sparse vector support for hybrid search. Qdrant Cloud (managed) or self-hosted. The performance-first choice for latency-sensitive RAG at scale.

PROTOTYPING

ChromaDB

Lightweight, developer-friendly vector database designed for AI applications. Runs in-process (embedded mode) with zero configuration -- pip install chromadb and start indexing. Supports persistent storage and client-server mode for production. Built-in embedding functions for OpenAI, Cohere, and sentence-transformers. Ideal for prototyping, notebooks, and single-machine RAG applications. Not designed for billion-scale production workloads -- graduate to Pinecone, Qdrant, or pgvector when you outgrow it.

POSTGRES

pgvector

PostgreSQL extension that adds vector similarity search to your existing database. Store embeddings alongside relational data with full SQL access, ACID transactions, joins, and your existing backup/monitoring stack. Supports HNSW and IVFFlat indexes. pgvecto.rs (by Tensorchord) offers better performance with Rust-based indexing. The pragmatic choice when you already run PostgreSQL and want to avoid adding another database to your stack. Scales to ~10M vectors on a single instance.

SCALE

Milvus

Purpose-built for billion-scale vector search. Distributed architecture with separation of storage and compute. Supports 10+ index types including GPU-accelerated IVF and DiskANN for datasets larger than memory. Native hybrid search with BM25. Run as Milvus Lite (embedded), Milvus Standalone (single node), or Milvus Distributed (Kubernetes). Zilliz Cloud offers a fully managed version. The choice for teams indexing 100M+ vectors who need horizontal scaling with consistent sub-100ms latency.

DatabaseDeploymentHybrid SearchMax ScaleBest For
PineconeManaged onlySparse-denseBillionsZero-ops teams
WeaviateCloud + self-hostedBM25 + vectorBillionsBuilt-in vectorization
QdrantCloud + self-hostedSparse vectorsBillionsLow-latency, filtering
ChromaDBEmbedded + serverNo nativeMillionsPrototyping, notebooks
pgvectorSelf-hosted (PG ext)Via SQL + tsvector~10M per nodeExisting PostgreSQL stack
MilvusCloud + self-hostedBM25 + vector10B+Massive scale

6. Retrieval Strategies

Retrieval is where most RAG quality issues originate. A perfect embedding model with a naive top-k retrieval strategy will underperform a decent embedding model with thoughtful retrieval engineering. The strategies below progress from simple to advanced, and production systems typically combine multiple approaches.

BASIC

Vector Similarity Search

The simplest retrieval strategy: embed the query, find the top-k nearest vectors by cosine similarity (or dot product / L2 distance). Works well when query and document language match closely. Typical k values: 5-20. Issues: returns redundant results when multiple chunks cover the same topic, and misses relevant documents when the query uses different terminology than the source documents.

DIVERSITY

Maximum Marginal Relevance (MMR)

Balances relevance and diversity in retrieved results. MMR iteratively selects documents that are both similar to the query and dissimilar to already-selected documents. Controlled by a lambda parameter (1.0 = pure relevance, 0.0 = pure diversity). This prevents the common failure mode where 5 out of 5 retrieved chunks say the same thing. Essential for multi-document synthesis tasks. Built into LangChain and LlamaIndex retrievers.

HYBRID

Hybrid Search (BM25 + Vector)

Combines keyword-based BM25 scoring with vector similarity scoring. BM25 excels at exact term matching (product names, error codes, acronyms) where vector search struggles. Vectors excel at semantic matching where BM25 fails (e.g., "cost" matching "pricing"). Hybrid search catches both. Merge strategies: reciprocal rank fusion (RRF), weighted linear combination, or learned score fusion. Weaviate, Qdrant, and Pinecone support hybrid search natively. For pgvector, combine with PostgreSQL's tsvector full-text search.

TRANSFORM

HyDE (Hypothetical Document Embeddings)

Uses the LLM to generate a hypothetical answer to the query, then embeds that answer and searches for similar documents. The intuition: a hypothetical answer is in the same "language" as the documents (detailed, technical, complete), while the query is short and informal. This bridges the query-document distribution gap. Effective for complex questions where the query alone does not contain enough semantic signal. Adds one LLM call of latency. Implemented in LangChain as HypotheticalDocumentEmbedder.

EXPANSION

Multi-Query Retrieval

Uses the LLM to generate 3-5 alternative phrasings of the user query, then retrieves documents for each phrasing and deduplicates results. This increases recall by capturing different aspects of the query that a single embedding might miss. Example: "How do I deploy to production?" generates variants like "production deployment steps", "CI/CD pipeline setup", "release management process". LangChain's MultiQueryRetriever implements this pattern.

CONTEXTUAL

Contextual Compression

After retrieving chunks, use an LLM to extract only the relevant portions of each chunk relative to the query. A 1000-token chunk might contain only 100 tokens of relevant information. Contextual compression reduces noise in the generation context and lets you retrieve more chunks within the same context window budget. LangChain's ContextualCompressionRetriever chains a base retriever with a compressor (LLM or cross-encoder based).

ROUTING

Query Routing

Not all queries should search the same index or use the same retrieval strategy. A router (LLM-based or classifier-based) analyzes the query and routes to the appropriate index, collection, or strategy. Example: factual questions use dense retrieval, keyword-heavy queries use BM25, analytical questions use a knowledge graph. LangChain's RouterChain and LlamaIndex's RouterQueryEngine implement this. Critical for multi-index RAG architectures.

7. Reranking

Reranking is the highest-leverage post-retrieval improvement. Initial retrieval (embedding similarity) is a fast but imprecise first pass. Rerankers use more expensive models to re-score and reorder the top-k results based on deep query-document interaction. A bi-encoder (embedding model) processes query and document independently; a cross-encoder processes them jointly, attending to interactions between query and document tokens. This joint processing catches relevance signals that independent encoding misses.

API

Cohere Rerank

The most widely used reranking API. rerank-v3.5 (latest) processes up to 4096 tokens per document and supports 100+ languages. Pass your query and top-k documents; get back relevance scores and reranked order. Pricing: $2 per 1000 search units. Consistently adds 5-15% retrieval precision improvement over embedding-only retrieval. Supports structured document inputs (JSON fields) for multi-field reranking. Drop-in integration with LangChain and LlamaIndex.

MODEL

Cross-Encoder Models

Open-source cross-encoder models for self-hosted reranking. BAAI/bge-reranker-v2-m3 supports multilingual reranking. cross-encoder/ms-marco-MiniLM-L-12-v2 is lightweight and fast. Run via sentence-transformers with GPU for production throughput. Cross-encoders process each query-document pair independently, so latency scales linearly with the number of documents. Rerank the top 20-50 documents from initial retrieval, not the entire corpus.

TOKEN-LEVEL

ColBERT (Late Interaction)

ColBERT computes token-level embeddings for both query and document, then scores via MaxSim (maximum similarity between each query token and all document tokens). This late interaction architecture is faster than cross-encoders (document embeddings can be precomputed) while capturing fine-grained relevance signals that bi-encoders miss. ColBERTv2 and RAGatouille provide easy-to-use Python implementations. Ideal for high-throughput reranking where cross-encoder latency is prohibitive.

FUSION

Reciprocal Rank Fusion (RRF)

A simple, effective method for merging ranked lists from multiple retrieval strategies. For each document, compute RRF(d) = sum(1 / (k + rank_i)) across all lists, where k is a constant (typically 60). This gives more weight to documents that appear high in multiple lists. No model required -- purely rank-based. Use RRF to merge BM25 results with vector results, or to combine results from multiple embedding models. Implemented natively in Elasticsearch 8.x and Weaviate.

LEARNED

FlashRank and Lightweight Rerankers

FlashRank provides sub-50ms reranking on CPU using distilled models (14-86M parameters). mixedbread-ai/mxbai-rerank-large-v1 offers strong quality with fast inference. These lightweight rerankers are practical for latency-constrained environments where Cohere API calls or large cross-encoders add too much latency. Run them on the same machine as your application server -- no GPU required for models under 100M parameters.

LLM

LLM-as-Reranker

Use an LLM to score relevance of each retrieved document to the query. The LLM evaluates semantic relevance, factual alignment, and completeness in ways that embedding similarity cannot. Prompt: "Rate how relevant this passage is to the question on a scale of 1-5." Expensive (one LLM call per document) but highest quality for critical applications. Use with a fast, cheap model (Claude Haiku, GPT-4o-mini) to keep cost manageable. Reserve for high-value queries or as a fallback when other rerankers show low confidence.

8. Advanced RAG Patterns

Standard RAG (retrieve-then-generate) hits a ceiling for complex questions that require multi-step reasoning, self-correction, or heterogeneous data sources. Advanced RAG patterns extend the basic architecture with agentic behavior, graph-based knowledge, multi-modal retrieval, and self-evaluation loops. These patterns increase system complexity but unlock capabilities that simple RAG cannot achieve.

AGENTIC

Agentic RAG

Wraps the RAG pipeline inside an AI agent that can decide when to retrieve, what to retrieve, and whether to retrieve again. The agent formulates search queries, evaluates retrieved results, reformulates if results are poor, and synthesizes answers across multiple retrieval rounds. Built with LangGraph or the Claude Agent SDK, where retrieval is a tool the agent invokes as needed. This is the dominant production pattern for 2025-2026: retrieval as a tool call, not a fixed pipeline step.

GRAPH

Graph RAG

Builds a knowledge graph from documents using LLM-based entity and relationship extraction, then uses graph traversal for retrieval alongside vector search. Excels at multi-hop questions ("What companies were founded by people who studied at Stanford and worked at Google?") where vector search fails because the answer requires connecting multiple facts. Microsoft's GraphRAG library implements community detection and summarization over the graph for thematic retrieval. Combine with Neo4j or a property graph database for storage.

SELF-CORRECT

Corrective RAG (CRAG)

Adds a self-evaluation step after retrieval: an LLM grades each retrieved document as "relevant", "ambiguous", or "irrelevant". If documents are relevant, proceed to generation. If ambiguous, retrieve from additional sources (e.g., web search). If irrelevant, skip vector search entirely and fall back to web search or refuse to answer. This prevents the common failure mode where the LLM confidently generates answers from irrelevant retrieved context. Implemented as a LangGraph workflow with conditional routing.

MULTIMODAL

Multi-Modal RAG

Extends RAG to handle images, tables, charts, and diagrams alongside text. Two approaches: (1) Generate text descriptions of visual content and embed them as text chunks. (2) Use multi-modal embedding models (CLIP, Cohere embed-v4, ColPali) to embed images directly alongside text in the same vector space. ColPali embeds entire document pages as images, bypassing text extraction entirely. Use vision LLMs (GPT-4o, Claude) for generation over retrieved visual content.

ADAPTIVE

Self-RAG

A fine-tuned model that decides dynamically whether to retrieve, generates reflection tokens (IsRel, IsSup, IsUse) to evaluate retrieval quality and answer quality at each step, and can regenerate if self-evaluation fails. Unlike standard RAG where retrieval is always triggered, Self-RAG skips retrieval for questions the model can answer from parametric memory and activates retrieval only when needed. Reduces latency for simple questions while maintaining accuracy for knowledge-intensive ones.

STEP-BACK

Step-Back Prompting

Before retrieval, the LLM generates a higher-level, more abstract version of the query. Example: "What is the solubility of calcium hydroxide at 25C?" becomes "What are the chemical properties of calcium hydroxide?" The step-back query retrieves broader context that includes the specific answer. This helps with highly specific questions where the exact phrasing might not match any document. Combine with the original query for a multi-query approach.

RAPTOR

RAPTOR (Tree-Based Retrieval)

Builds a tree of document summaries at multiple levels of abstraction. Leaf nodes are original chunks, intermediate nodes are summaries of chunk clusters, and the root is a summary of the entire corpus. Retrieval searches across all tree levels, matching detail-level questions to leaf chunks and thematic questions to summary nodes. Particularly effective for long documents and book-length content where both specific facts and broad themes need to be retrievable.

CACHE

Contextual Retrieval (Anthropic)

Anthropic's approach: prepend a short context explanation to each chunk before embedding. Use the LLM to generate a 50-100 token situating context: "This chunk is from the Q2 2025 earnings report, specifically the revenue breakdown section." This context is embedded with the chunk, improving retrieval precision by 49% in Anthropic's benchmarks. When combined with BM25 hybrid search and reranking, contextual retrieval reduces failed retrievals by 67% compared to standard RAG.

9. Evaluation (RAGAS Framework)

You cannot improve what you do not measure. RAG evaluation is harder than standard ML evaluation because you need to assess both retrieval quality and generation quality independently, and their interaction. The RAGAS (Retrieval Augmented Generation Assessment) framework is the most widely adopted evaluation toolkit, providing automated metrics that correlate well with human judgments without requiring ground-truth labels.

METRIC

Faithfulness

Measures whether the generated answer is grounded in the retrieved context. The LLM extracts individual claims from the answer, then verifies each claim against the retrieved documents. Faithfulness = (supported claims) / (total claims). A faithfulness score below 0.8 indicates hallucination -- the model is generating information not present in the context. This is the most critical metric: a RAG system that hallucinates is worse than one that refuses to answer.

METRIC

Answer Relevancy

Measures whether the answer addresses the question. The evaluator LLM generates questions that the answer would address, then computes cosine similarity between these generated questions and the original question. High similarity = relevant answer. Low similarity = the answer went off-topic. This catches the failure mode where the model generates a factually correct but irrelevant response because the retrieved context led it astray.

METRIC

Context Precision

Measures whether the retrieved documents are relevant to the question. For each retrieved chunk, the evaluator determines if it contains information needed to answer the query. Precision = (relevant chunks) / (total retrieved chunks). Low precision means your retrieval is pulling in noise -- irrelevant chunks that waste context window and may confuse the model. Target: above 0.7 for production systems.

METRIC

Context Recall

Measures whether the retrieved documents contain all the information needed to answer the question. Unlike precision (which measures noise), recall measures coverage. Requires ground-truth answers for computation: the evaluator checks if each claim in the ground-truth answer is supported by the retrieved context. Low recall means your retrieval is missing relevant documents -- you need better embeddings, more chunks, or query expansion.

TOOL

RAGAS Framework

Open-source evaluation framework (pip install ragas). Computes faithfulness, answer relevancy, context precision, and context recall using LLM-as-judge. Supports test set generation from your documents. Integrates with LangSmith, Weights & Biases, and CI/CD pipelines for continuous evaluation. Use RAGAS to compare chunking strategies, embedding models, and retrieval parameters systematically rather than relying on vibes. Run evaluation on 100-500 diverse test queries for statistically meaningful results.

TOOL

Custom Evaluation Dimensions

Beyond RAGAS metrics, production systems evaluate: latency (end-to-end response time under p95), cost per query (embedding + retrieval + reranking + generation tokens), citation accuracy (do citations point to the correct source?), coverage (what percentage of queries get satisfactory answers?), and freshness (are answers based on the most recent version of documents?). Build a dashboard tracking these metrics over time to detect regressions early.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Prepare evaluation dataset
eval_data = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,      # list[list[str]]
    "ground_truth": reference_answers     # for context_recall
})

# Run evaluation
results = evaluate(
    dataset=eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ChatOpenAI(model="gpt-4o"),       # evaluator LLM
    embeddings=OpenAIEmbeddings()
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,
#  'context_precision': 0.78, 'context_recall': 0.82}

10. Production Deployment

Moving RAG from a notebook to production involves caching, streaming, multi-tenancy, monitoring, and cost optimization. The difference between a demo and a production system is not the retrieval algorithm -- it is the infrastructure that makes the algorithm reliable, fast, and affordable at scale.

CACHE

Semantic Caching

Cache query-response pairs indexed by query embedding similarity. When a new query is semantically similar (cosine similarity > 0.95) to a cached query, return the cached response without hitting the vector database or LLM. Redis with RedisVL or GPTCache provides sub-millisecond cache lookups. Semantic caching reduces costs by 30-60% for workloads with repetitive queries (customer support, FAQ bots). Set TTL based on knowledge base update frequency to prevent stale answers.

UX

Streaming Responses

Stream the LLM response token-by-token using Server-Sent Events (SSE) or WebSockets. The user sees the first token within 500ms instead of waiting 3-5 seconds for the complete response. Stream source citations alongside the response so users can verify claims in real-time. LangChain's astream_events and LlamaIndex's stream_chat support streaming with retrieval metadata. Always stream in production -- the perceived latency improvement is dramatic.

ISOLATION

Multi-Tenant Architecture

Isolate each customer's knowledge base to prevent data leakage. Strategies: (1) separate collections per tenant (strongest isolation, highest cost), (2) namespace/partition within a shared collection (Pinecone namespaces, Qdrant payload filtering), (3) metadata-based filtering with row-level security. Always apply tenant filters before retrieval, never after. Test isolation by querying as Tenant A and verifying zero results from Tenant B's documents. Pinecone namespaces and Weaviate multi-tenancy provide native support.

OBSERVE

Monitoring and Observability

Track four categories: retrieval metrics (query latency, chunks retrieved, filter hit rates), generation metrics (token usage, LLM latency, error rates), quality metrics (RAGAS scores on a sample of production queries), and business metrics (user satisfaction, escalation rate, answer coverage). Use LangSmith or Langfuse for end-to-end tracing. Set alerts on faithfulness score drops -- they indicate retrieval degradation or knowledge base staleness.

COST

Cost Optimization

RAG costs = embedding costs + vector DB costs + reranking costs + LLM generation costs. Optimize each: (1) use smaller embedding models for low-stakes applications (text-embedding-3-small), (2) apply quantization to reduce vector storage 4-8x, (3) use a lightweight reranker (FlashRank) instead of Cohere for cost-sensitive workloads, (4) use smaller LLMs (Claude Haiku, GPT-4o-mini) for straightforward Q&A, (5) cache aggressively. A well-optimized RAG pipeline costs $0.001-0.01 per query. An unoptimized one costs $0.10-0.50.

INGEST

Incremental Ingestion

Re-embedding your entire knowledge base on every update is expensive and slow. Implement incremental ingestion: track document checksums, only re-process changed documents, and upsert (update or insert) vectors by document ID. Use a document registry (PostgreSQL table or key-value store) to track ingestion state. For frequently updated sources (Confluence, Notion), set up webhooks to trigger re-ingestion on document save. Batch embedding calls to maximize throughput and minimize API costs.

SECURITY

Security and Access Control

RAG systems can leak sensitive information through retrieval. Implement document-level access control: store user/group permissions as metadata on each chunk, and filter by the requesting user's permissions at query time. Never rely on the LLM to respect access boundaries -- it will include restricted content in its response if it appears in the context. Sanitize inputs to prevent prompt injection that could bypass retrieval filters. Audit all queries touching sensitive documents.

11. Code Examples

Three implementation approaches from high-level frameworks to direct API calls. LangChain provides the fastest path to a working RAG system with the most abstractions. LlamaIndex is purpose-built for RAG with deeper indexing primitives. Direct API calls give maximum control and minimum dependencies for teams that prefer explicit code over framework magic.

LangChain RAG Pipeline

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# 1. Load and chunk documents
loader = PyMuPDFLoader("technical_manual.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(docs)

# 2. Embed and store in Qdrant
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Qdrant.from_documents(
    chunks, embeddings,
    url="http://localhost:6333",
    collection_name="tech_manual"
)

# 3. Create retrieval chain with reranking
retriever = vectorstore.as_retriever(
    search_type="mmr",          # Maximum Marginal Relevance
    search_kwargs={"k": 10, "fetch_k": 25, "lambda_mult": 0.7}
)

prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer based on the context below. Cite sources.
If the context doesn't contain the answer, say so.

Context: {context}"""),
    ("human", "{input}")
])

llm = ChatOpenAI(model="gpt-4o", temperature=0)
chain = create_retrieval_chain(
    retriever,
    create_stuff_documents_chain(llm, prompt)
)

# 4. Query
result = chain.invoke({"input": "How do I configure failover?"})
print(result["answer"])

LlamaIndex RAG Pipeline

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.anthropic import Anthropic
from llama_index.core.postprocessor import SentenceTransformerRerank
from pinecone import Pinecone

# 1. Configure models
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
Settings.llm = Anthropic(model="claude-sonnet-4-5-20250514", temperature=0)

# 2. Load documents with metadata
documents = SimpleDirectoryReader(
    input_dir="./knowledge_base",
    recursive=True,
    filename_as_id=True
).load_data()

# 3. Parse with sentence-aware splitting
parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)

# 4. Create Pinecone index
pc = Pinecone(api_key="your-api-key")
pinecone_index = pc.Index("rag-production")
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

index = VectorStoreIndex.from_documents(
    documents, vector_store=vector_store
)

# 5. Query with reranking
reranker = SentenceTransformerRerank(
    model="BAAI/bge-reranker-v2-m3", top_n=5
)
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[reranker],
    response_mode="tree_summarize"
)

response = query_engine.query("Explain the backup recovery procedure")
print(response)
for node in response.source_nodes:
    print(f"  [{node.score:.3f}] {node.metadata['file_name']}")

Direct API: Pinecone + OpenAI

import openai
from pinecone import Pinecone

client = openai.OpenAI()
pc = Pinecone(api_key="your-api-key")
index = pc.Index("rag-production")

def embed(text: str, input_type: str = "search_query") -> list[float]:
    """Generate embedding for text."""
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text,
        dimensions=1536  # Matryoshka shortening for cost savings
    )
    return response.data[0].embedding

def retrieve(query: str, top_k: int = 10, filters: dict = None) -> list[dict]:
    """Retrieve relevant chunks from Pinecone."""
    query_vector = embed(query)
    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=filters  # e.g., {"source": "engineering-docs", "year": {"$gte": 2025}}
    )
    return [
        {"text": m.metadata["text"], "source": m.metadata["source"], "score": m.score}
        for m in results.matches
    ]

def generate(query: str, contexts: list[dict]) -> str:
    """Generate answer grounded in retrieved context."""
    context_str = "\n\n---\n\n".join(
        f"[Source: {c['source']}] {c['text']}" for c in contexts
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        messages=[
            {"role": "system", "content": f"""Answer based on the context below.
Cite sources using [Source: ...] format.
If context is insufficient, say "I don't have enough information."

Context:
{context_str}"""},
            {"role": "user", "content": query}
        ],
        stream=True
    )
    chunks = []
    for chunk in response:
        if chunk.choices[0].delta.content:
            chunks.append(chunk.choices[0].delta.content)
            print(chunk.choices[0].delta.content, end="", flush=True)
    return "".join(chunks)

# RAG query
query = "What are the SLA requirements for the payment service?"
contexts = retrieve(query, top_k=10, filters={"team": "payments"})
answer = generate(query, contexts)

pgvector with Python (Direct SQL)

import psycopg2
import openai
import json

client = openai.OpenAI()
conn = psycopg2.connect("postgresql://user:pass@localhost:5432/ragdb")

# Setup: create table with vector column
with conn.cursor() as cur:
    cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id SERIAL PRIMARY KEY,
            content TEXT NOT NULL,
            source VARCHAR(255),
            embedding vector(1536),
            metadata JSONB DEFAULT '{}'::jsonb,
            created_at TIMESTAMPTZ DEFAULT NOW()
        )
    """)
    cur.execute("""
        CREATE INDEX IF NOT EXISTS docs_embedding_idx
        ON documents USING hnsw (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 200)
    """)
    conn.commit()

def ingest(content: str, source: str, metadata: dict = None):
    """Embed and store a document chunk."""
    resp = client.embeddings.create(
        model="text-embedding-3-large", input=content, dimensions=1536
    )
    embedding = resp.data[0].embedding
    with conn.cursor() as cur:
        cur.execute(
            """INSERT INTO documents (content, source, embedding, metadata)
               VALUES (%s, %s, %s::vector, %s::jsonb)""",
            (content, source, str(embedding), json.dumps(metadata or {}))
        )
    conn.commit()

def search(query: str, top_k: int = 5, source_filter: str = None) -> list:
    """Hybrid search: vector similarity + optional source filter."""
    q_emb = client.embeddings.create(
        model="text-embedding-3-large", input=query, dimensions=1536
    ).data[0].embedding

    sql = """
        SELECT content, source,
               1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        WHERE 1=1
    """
    params = [str(q_emb)]
    if source_filter:
        sql += " AND source = %s"
        params.append(source_filter)
    sql += " ORDER BY embedding <=> %s::vector LIMIT %s"
    params.extend([str(q_emb), top_k])

    with conn.cursor() as cur:
        cur.execute(sql, params)
        return [
            {"content": r[0], "source": r[1], "score": r[2]}
            for r in cur.fetchall()
        ]

Related Technologies