Designing Retrieval Pipelines for Vector Databases
Vector search is one of those technologies whose mechanics are deceptively simple — embed your text, store the vectors, find the nearest neighbors at query time — and whose production realization involves an unreasonable number of decisions. Embedding model, chunking strategy, vector dimension, index type, filter handling, hybrid retrieval, reranking, sharding, eviction, refresh cadence — each has its own trade-offs, each interacts with the others, and the wrong combination can produce a system that’s both expensive and unsearchable. This post is a working engineer’s view of building a retrieval pipeline that holds up in production.
What “Retrieval” Actually Is
A retrieval pipeline turns unstructured content into an indexed representation, and a query into a ranked list of matching items. In a modern stack the steps are:
- Ingestion — documents are loaded, parsed, normalized, and chunked.
- Embedding — each chunk is converted to a dense vector via an embedding model.
- Indexing — vectors and metadata are stored in a vector database with an approximate nearest neighbor (ANN) index.
- Query — user input is embedded and the index returns the top-k nearest chunks.
- Reranking — a more expensive model reorders the candidates for precision.
- Use — chunks are returned to the consumer (a RAG application, a search UI, a recommender).
The interesting questions are at every stage, but the highest-leverage ones are chunking, embedding choice, and index configuration.
Embeddings, Briefly
An embedding is a fixed-length vector representation where geometric similarity (cosine, dot product, or L2 distance) approximates semantic similarity. Modern text embedding models (OpenAI’s text-embedding-3-large, Cohere’s embed-v3, BGE’s bge-large-en-v1.5, Voyage’s voyage-3) produce vectors in dimensions ranging from 384 to 3072.
A few production-relevant properties:
- Embedding space is a function of the model. Vectors from one model are not comparable to vectors from another. Switching models requires re-embedding the corpus.
- Dimensionality is a trade-off. Higher dimensions capture more nuance but cost more storage and slower search. 1024 is a defensible default; 3072 is overkill for most cases.
- Matryoshka representations (introduced in newer models) let you truncate vectors to lower dimensions with graceful degradation. Useful for hierarchical retrieval (coarse-to-fine).
- Domain adaptation. Off-the-shelf models work well for general-purpose retrieval. For specialized domains (legal, medical, internal-jargon-heavy), fine-tuning or domain-adapted models produce meaningful improvements.
- Asymmetric models. Some embedding models distinguish between document and query embeddings (typically with a different prompt prefix). Use them correctly — embedding everything with the document prefix and then querying with a document-prefixed embedding gives noticeably worse results.
The choice should be deliberate and stuck with. Reembedding 100M chunks is not a free operation.
Chunking: Where Most Quality Comes From
Retrieval quality is bounded above by chunk quality. A great embedder cannot rescue chunks that were split arbitrarily through structured content.
The chunking strategies that hold up:
- Fixed-size with overlap. 500–1000 tokens per chunk, 100–200 token overlap. The default in most frameworks; adequate for prose, inadequate for structured content.
- Structure-aware. Split on headings, list boundaries, code blocks. Required for Markdown, HTML, technical docs.
- Semantic chunking. Use embeddings to find natural breakpoints between paragraphs. Expensive at ingestion; sometimes worthwhile.
- Late chunking (introduced in 2024). Embed the entire document, then derive per-chunk embeddings from the contextualized token embeddings. Preserves cross-chunk context. Requires a model that supports it.
- Hierarchical chunking. Multiple representations — sentence-level for precision, paragraph-level for context, document-level for routing.
Practical rules:
- Include the title and section path in each chunk’s text or as separate metadata. Otherwise retrieved chunks are decontextualized.
- Don’t split tables, code blocks, or bullet lists across chunks unless you must. Each is a unit of meaning.
- Aim for chunks that can answer questions on their own. The “I retrieved a chunk but it’s just the middle of a sentence” failure mode is a chunking bug.
For documents with mixed content (text + tables + images), specialized parsers (unstructured.io, Marker, LlamaParse) preserve structure better than naive PDF text extraction.
Index Types
ANN indexes are how vector databases search billions of vectors in milliseconds. The main families:
- HNSW (Hierarchical Navigable Small World). A multi-layer graph. Excellent recall/latency trade-offs. Memory-hungry — index size is typically 1.5–2x the raw vector size. The standard for production vector search.
- IVF (Inverted File). Partitions vectors into clusters; search probes the nearest clusters. Smaller memory footprint than HNSW. Tunable via
nprobe(how many clusters to search) for recall/latency trade. - IVF-PQ (Product Quantization). Compresses vectors via quantization. Massive memory savings at the cost of recall. Used for very large indexes (billions of vectors).
- DiskANN / Vamana. Optimized for SSDs; allows indexes much larger than RAM. Used by very-large-scale systems.
Index parameters that matter:
For HNSW:
M(links per node) — higher = better recall, more memory. 16–64 is typical.ef_construction— build-time exploration. Higher = better index, slower build.ef_search— query-time exploration. Higher = better recall, slower query. The most useful query-time knob.
For IVF:
nlist— number of clusters. Typically√(total_vectors).nprobe— clusters searched per query. Higher = better recall, slower.
The right configuration is workload-specific. Benchmark recall@k vs. latency on your actual data; default parameters are starting points, not endpoints.
Vector Database Options
The 2026 landscape:
- Pinecone. Managed, serverless, scales to billions of vectors. Pay-per-use. The default for teams that want to outsource operational concerns.
- Qdrant. Self-hostable and managed, written in Rust, strong filtering, hybrid search support. Production-friendly defaults.
- Weaviate. Self-hostable and managed, schema-first, hybrid search, built-in modules for embedders.
- Milvus / Zilliz. Open-source and managed (Zilliz Cloud). Strong horizontal scaling for very large indexes.
- pgvector. PostgreSQL extension. The right answer when your corpus fits in one Postgres instance and you want transactional ingest, joins, and operational simplicity. HNSW indexes since 0.5.
- OpenSearch / Elasticsearch. Vector search on top of existing search infrastructure. Good when you already operate them and want hybrid (dense + sparse) retrieval natively.
- Vespa. Hybrid retrieval at very large scale; the right answer for billion-doc, multi-tier ranking pipelines.
The choice is operational as much as technical. Pinecone removes operational work; pgvector removes infrastructure complexity; Qdrant/Weaviate give control. Pick based on team capacity and corpus size, not based on a feature checklist.
Hybrid Retrieval
Pure dense retrieval has a known weakness: exact-match queries. Searching for a product SKU, a function name, an error code, or any proper noun, embeddings systematically underperform sparse retrieval (BM25, SPLADE).
The fix is hybrid retrieval: run both, merge results.
dense_results = await dense_index.search(query_vec, k=20)sparse_results = await bm25_index.search(query_text, k=20)combined = reciprocal_rank_fusion(dense_results, sparse_results, k=60)Reciprocal Rank Fusion (RRF) is the standard merge: each candidate’s score is Σ 1 / (k + rank_in_list). Simple, robust, doesn’t require score normalization.
Most modern vector databases support hybrid retrieval natively (Qdrant, Weaviate, OpenSearch, Vespa). For databases that don’t, run two queries and merge in the application.
The recall improvement from hybrid retrieval is consistently in the 5–15% range across benchmarks. Worth the modest implementation effort.
Filtering: The Operational Reality
Most production retrieval is not “find similar to this query” but “find similar to this query, given these constraints.” Tenant ID, user ACLs, time windows, document type, language.
The naïve approach — retrieve top-k and post-filter — fails when the filter is selective. If only 1% of your documents belong to tenant A and you retrieve top-100, you get on average one tenant-A document.
Two strategies:
- Pre-filter then ANN. Apply the filter first, then run ANN on the filtered set. Fast for highly selective filters; slow for selective ANN with small post-filter sets.
- ANN with filter pruning during search. The index navigates and filters simultaneously. Most modern databases support this; the implementation efficiency varies dramatically.
For multi-tenant systems, the filter is non-negotiable. Always include tenant_id in metadata; always filter on it; never trust application-level filtering as the only barrier.
results = await client.search( collection_name="docs", query_vector=query_vec, query_filter=Filter(must=[ FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id)), FieldCondition(key="lang", match=MatchValue(value="en")), FieldCondition(key="updated_at", range=Range(gte=since_ts)), ]), limit=20,)Reranking
The candidates returned by ANN are optimized for recall (catch the relevant items in top-k). They are not optimized for precision (rank the most relevant first). The fix is a more expensive model that scores (query, candidate) pairs.
Cross-encoders (bge-reranker-v2-m3, cohere/rerank-english-v3.0, Voyage/rerank-2) are the standard:
candidates = await vector_db.search(query_vec, k=50)texts = [c.payload["text"] for c in candidates]reranked = await reranker.rerank(query, texts, top_n=10)final = [candidates[r.index] for r in reranked]Cost: one cross-encoder pass per candidate (50 in this example). Latency: typically 50–200ms total. Recall improvement: substantial — the difference between “the right answer is somewhere in top-10” and “the right answer is in top-3.”
Reranking is the single most cost-effective quality improvement in most retrieval pipelines. Skip it only when latency is genuinely binding (sub-100ms p95) or candidate counts are very small.
Ingestion at Scale
A common pitfall: the ingestion pipeline is built as a one-off batch script. Six months later, the team needs to re-ingest 100M documents because the embedding model changed, and the script can’t possibly finish in time.
Build ingestion as a streaming pipeline:
- Source change events. Whenever a source document is created, updated, or deleted, emit an event.
- Idempotent processing. A consumer can replay events without producing duplicates. Use stable IDs derived from the source.
- Incremental embedding. Embed only changed chunks. Hash the chunk content; if the hash matches a previous version, reuse the vector.
- Concurrent throughput. Embedding APIs have rate limits; parallel workers with bounded concurrency hit them effectively without exceeding.
- Backfill as replay. Re-ingesting the corpus is the same code path as steady-state ingestion, just with the source firing all events.
For large-scale ingestion, batch the embedding calls. Most APIs accept 50–100 inputs per call with similar latency to a single call; the throughput improvement is roughly linear.
Index Refresh and Eviction
What happens when a source document is deleted or updated? Stale vectors return as results.
- Deletes. Vector databases support deletes; ensure your ingestion pipeline propagates them. Soft-delete with a tombstone field is the most common pattern.
- Updates. Re-embed the new content; replace the existing vector under the same ID. If chunking is content-derived, an update may produce a different number of chunks — handle this explicitly.
- Stale detection. Periodic sweeps that compare the index against the source of truth catch ingestion-pipeline drift. Useful at the quarterly cadence.
Performance Tuning
A few patterns that earn their keep:
- Quantization for large indexes. Scalar quantization (float32 → int8) cuts memory by 4x with minor recall loss. Binary quantization (float32 → bit) cuts by 32x with larger recall loss, useful for first-stage retrieval before refining with full-precision vectors.
- Sharding by tenant or partition. Per-tenant collections eliminate cross-tenant filtering entirely. Operational overhead in exchange for simplicity.
- Caching query vectors. Identical queries embed to identical vectors. Cache the embedding (keyed by query hash) and skip the embedder call.
- Caching search results. Identical
(query, filter)pairs produce identical results until the index changes. Short TTL caching is cheap and helps for repeated queries. - Local indexes for small per-tenant data. A tenant with 10K chunks doesn’t need a distributed index. A SQLite + FAISS local index per tenant can outperform a centralized solution.
Observability
Metrics worth tracking:
- Recall@k on a labeled eval set. Run weekly; the most important quality signal.
- Query latency by stage — embed, search, filter, rerank.
- Index size and memory usage.
- Ingestion lag — time from source change to index update.
- Query-to-result distribution. Frequency of low-similarity results (often indicate gap in the corpus or bad chunking).
Without a labeled eval set, every quality conversation is anecdotal. Maintain a set of ~100–500 representative (query, expected_result_ids) pairs and track recall over time.
Closing
A production retrieval pipeline is many small decisions whose aggregate effect dominates the system’s quality. Embed once with a model you’ll keep; chunk with structure-awareness; index with HNSW (or one of its relatives) with parameters tuned to your data; filter by tenant always; combine dense and sparse retrieval; rerank candidates with a cross-encoder; build ingestion as a streaming, idempotent pipeline that can replay; measure recall against a labeled set. The mechanics are well-trodden by now — the published patterns from search-system engineering, adapted to vector representations. The discipline is the same as any data platform: take ingestion seriously, instrument the pipeline, version the schema (which here means embedding model + chunking strategy + index parameters), and treat re-ingestion as a routine operation rather than a crisis. Get that right and the retrieval layer becomes a quiet, well-behaved component of the system. Get it wrong and you find out that “semantic search” was carrying more load than anyone wrote down.