Building Production RAG Pipelines with LangChain
Retrieval-Augmented Generation is the most consequential architectural pattern in production LLM systems today. It is also the one most frequently shipped as a demo and never hardened into a system that survives real traffic, real data drift, and real users. A toy RAG pipeline is a notebook with Chroma.from_documents and RetrievalQA. A production RAG pipeline is closer in shape to a search engine, an ETL platform, and an inference service stapled together — each with its own SLOs, failure modes, and evaluation harness.
This post is about the second one.
Why RAG Exists at All
LLMs have two well-known limitations that RAG addresses directly: knowledge cutoff and hallucination. The model knows nothing about your internal documentation, your customer’s contract, or events from last week. Fine-tuning solves part of this but is expensive to iterate, slow to update, and unsuitable for data that changes daily. RAG sidesteps fine-tuning entirely by injecting authoritative context into the prompt at inference time.
The core mechanic is straightforward:
- Embed your corpus into a vector space.
- At query time, embed the user’s question, find the k most similar chunks, and pass them to the LLM as context.
- Constrain the LLM to answer from the retrieved context.
Everything interesting in production RAG happens in the gap between this three-line description and a system that actually returns useful answers.
The Ingestion Pipeline
Retrieval quality is bounded above by ingestion quality. A great retriever cannot rescue a corpus that was chunked badly or stripped of structure.
Document Loading
LangChain ships DocumentLoader implementations for PDFs, HTML, Notion, Confluence, S3, Slack, and dozens of others. In practice you will replace most of them. PyPDFLoader handles simple PDFs; unstructured.io or pdfplumber handle complex layouts; tabular PDFs need OCR + table extraction (Camelot, AWS Textract, or pdf-table-extract). The right loader is the one that preserves the information your retriever needs — headings, lists, tables, code blocks — not the easiest one to import.
Persist metadata aggressively: source URL, document ID, last-modified timestamp, section path, access-control attributes. Metadata is what makes hybrid filtering possible later (“only documents this user can access, modified in the last 90 days”).
Chunking
This is where most RAG systems silently regress.
- Fixed-size chunking (e.g., 1000 tokens with 200-token overlap) is the default in
RecursiveCharacterTextSplitter. It is fine for prose, terrible for structured content (code, tables, JSON). - Structure-aware chunking respects markdown headings, code blocks, or HTML sections. Use
MarkdownHeaderTextSplitterorHTMLHeaderTextSplitter, then secondary-split anything that still exceeds your target length. - Semantic chunking uses embeddings to find natural breakpoints. Slow and expensive on a large corpus; reserve for high-value documents.
Aim for chunks small enough to be a focused unit of meaning (typically 200–800 tokens) but large enough that the LLM can answer from a single chunk in the common case. Include the document title and section path in every chunk’s text or metadata — without it, retrieved chunks lose the context that makes them interpretable.
Embedding Choice
OpenAI’s text-embedding-3-small and text-embedding-3-large are sensible defaults; voyage-3 and Cohere embed-v3 are strong alternatives. Open-source options (bge-large-en-v1.5, e5-mistral-7b-instruct) eliminate per-call cost and the network hop at the price of running your own GPU inference. Pick once and stick with it: changing embedding models requires re-embedding your entire corpus, and mixing embedding spaces in a single index is incorrect.
Match embedding_dim to the vector DB’s index configuration. A 3072-dim embedding in a 1536-dim index will either be rejected or silently truncated depending on the client — and silent truncation is the kind of bug that takes weeks to diagnose.
The Vector Store Layer
LangChain’s VectorStore interface is intentionally narrow: add_documents, similarity_search, similarity_search_with_score. The store you choose underneath matters enormously in production.
- pgvector. PostgreSQL extension. The right answer when your corpus fits comfortably in a single Postgres instance and you want transactional ingest, joins, and operational simplicity. HNSW index since pgvector 0.5; tune
ef_searchfor the recall/latency trade. - Pinecone. Managed, serverless, scales to billions of vectors with low operational overhead. Pay for it.
- Weaviate / Qdrant / Milvus. Self-hostable, strong filtering, hybrid search support. Qdrant and Weaviate are the most production-friendly of the three in my experience.
- OpenSearch / Elasticsearch. Worth considering when you already run them — they support dense vectors plus the BM25 sparse retrieval you almost certainly want for hybrid search.
A common mistake is treating the vector DB as a black box. Latency depends on index type (HNSW vs IVF), number of vectors, dimension, recall target, and filter cardinality. Benchmark with realistic filters and dataset sizes before committing.
Retrieval Is Not Just Cosine Similarity
Pure semantic search routinely misses exact-match queries — product SKUs, function names, error codes, acronyms. The fix is hybrid search: combine dense retrieval (embeddings) with sparse retrieval (BM25 or SPLADE) and merge results.
from langchain.retrievers import BM25Retriever, EnsembleRetrieverfrom langchain_community.vectorstores import Qdrant
bm25 = BM25Retriever.from_documents(docs, k=10)dense = Qdrant.from_documents(docs, embedder).as_retriever(search_kwargs={"k": 10})
retriever = EnsembleRetriever( retrievers=[bm25, dense], weights=[0.4, 0.6],)The next step is reranking. Initial retrieval is optimized for recall (catch the right chunk in the top-50). A cross-encoder reranker (cohere/rerank-english-v3.0, bge-reranker-v2-m3) then sorts those 50 candidates for precision, and you pass the top 5–10 to the LLM. The compute cost is one cross-encoder pass per candidate; latency typically lands in the 50–200ms range and dramatically improves answer quality.
Query Transformation
Users rarely phrase questions the way documents are written. Three transformations are worth knowing:
- HyDE (Hypothetical Document Embeddings). Ask the LLM to draft a hypothetical answer to the question, embed the answer, and retrieve against that embedding. Works well for short, ambiguous queries.
- Multi-query retrieval. Generate 3–5 paraphrases of the question, retrieve for each, then union and rerank.
- Query decomposition. Break a multi-hop question (“What did the CEO say about Q3 margins and how did that compare to Q2?”) into sub-questions, retrieve independently, and synthesize.
Each adds latency and LLM cost. Apply them selectively based on query characteristics, not unconditionally.
Context Construction
Once you have the top-k chunks, how you assemble them into the prompt matters. A few non-negotiables:
- Order matters. LLMs exhibit a measurable “lost in the middle” effect — content at the start and end of the context window is recalled better than content in the middle. Place the highest-ranked chunks at the boundaries.
- Cite explicitly. Tag each chunk with an identifier (
[1],[2]) and instruct the model to cite using those tags. Post-process to surface citations as UI artifacts. - Stay under the effective context window. Models advertise 128k or 200k tokens but answer quality degrades long before that limit. For retrieval-grounded answers, 4k–8k tokens of context is usually the sweet spot.
def build_prompt(question: str, chunks: list[Chunk]) -> str: context = "\n\n".join( f"[{i+1}] (source: {c.metadata['source']})\n{c.page_content}" for i, c in enumerate(chunks) ) return PROMPT_TEMPLATE.format(context=context, question=question)Grounding and Refusal Behavior
The system prompt should explicitly instruct the model to refuse when the context is insufficient. A common, durable pattern:
Answer the user’s question using only the information in the context below. If the context does not contain the answer, respond exactly: “I don’t have enough information to answer that.” Cite sources as [n] inline.
Add a structured-output guard (Pydantic model or JSON schema) when downstream systems consume the response. langchain_core.output_parsers.PydanticOutputParser makes this straightforward.
Evaluation: The Step Everyone Skips
You cannot improve what you do not measure. Production RAG requires an evaluation harness from day one, not retroactively after launch.
Maintain a labeled evaluation set of ~100–500 representative queries with expected source documents and either reference answers or rubric criteria. Run the pipeline against this set on every meaningful change (new embedder, new chunker, new retriever weights) and track:
- Retrieval recall@k. Did the correct source chunk appear in the top-k? This is the upper bound on answer quality and almost always the first thing to fix.
- Faithfulness. Does the generated answer only use information from the retrieved context? Measure with
RAGAS,TruLens, or LLM-as-judge with a pinned evaluator model. - Answer relevance. Does the answer address the question, regardless of correctness?
- Latency and cost per query. Always.
Treat the eval suite as test code. Version it, gate releases on regression, and update it when you discover new failure modes in production.
Production Concerns
A few things that separate a working RAG demo from a system that runs:
- Ingest is a streaming concern, not a batch one. Source documents change. Build an incremental indexer that ingests on webhooks or change-data-capture events and tracks
last_indexed_atper source. Re-embedding the full corpus nightly does not scale. - Access control belongs in the retriever. Filter at the vector DB level by
user_id/tenant_id/aclmetadata. Never rely on the LLM to “not mention things the user shouldn’t see” — it is not a security boundary. - Caching is asymmetric. Embedding the same query twice should hit a cache. Caching LLM answers is harder because retrieval freshness changes the right answer; key by
(query_hash, retrieved_chunk_ids)if you cache at all. - Observability. Log retrieved chunk IDs, scores, and reranker decisions per query. When a user reports a wrong answer, the only diagnostic worth anything is the full retrieval trace.
- Cost. Reranking and multi-query retrieval can easily 10x your per-query LLM cost. Track tokens per stage and put a budget on each.
Trade-offs
RAG is not a universal answer to grounding. It struggles with:
- Aggregation questions. “How many customers complained about latency last month?” — retrieval returns a handful of chunks; the answer requires scanning all of them. Push these queries to SQL or analytical engines, not the LLM.
- Temporal reasoning. Vector similarity has no concept of recency. Use metadata filters (
updated_at > now() - 30d) and decay scoring rather than hoping the model figures it out. - Long-tail rare terms. Dense embeddings struggle with proper nouns and rare jargon. Hybrid retrieval with BM25 mitigates this; fine-tuning the embedding model on your domain mitigates more.
Closing
A production RAG pipeline looks less like a chat interface with documents bolted on and more like an evolving search system whose final stage happens to be an LLM. The work is in the unglamorous places: chunking that preserves structure, hybrid retrieval, reranking, evaluation that catches regressions, observability per stage, and an ingestion pipeline that keeps up with source-of-truth changes. Get those right and the LLM becomes the easy part — replaceable, swappable, and largely interchangeable as the model market continues to commoditize. Get them wrong and you have built a very expensive autocomplete that confidently makes things up.