RAG in Production: What the Tutorials Don't Tell You
Jawad
AI Engineer
7 March 2026
10 min readRetrieval-Augmented Generation looks like magic in a Jupyter notebook. You load some PDFs, chunk them, embed them, and ask a question — the LLM answers correctly and you feel like you've solved AI. Then you deploy it to production, and real users start asking questions that don't map cleanly to your document structure, and your retrieval accuracy craters. This post is about the gap between the tutorial and production.
Chunking Strategies: The Foundation of Retrieval Quality
Bad chunking is the root cause of most RAG failures we diagnose. If your chunks break semantic units — splitting a table header from its rows, or cutting a code example in half — no retrieval model will save you.
- Fixed-size chunking (e.g. 512 tokens with 50-token overlap): Simple, fast, and consistently mediocre. Good for prototyping, not for production systems where document structure matters.
- Semantic chunking: Split on natural boundaries — paragraphs, sections, headings. Requires parsing the document structure. Retrieval quality improvement: typically 15–25% on our benchmarks.
- Hierarchical chunking (parent-child): Store both a large context chunk (the section) and small precise chunks (the paragraphs). Retrieve the small chunks, but send the parent context to the LLM. This pattern consistently outperforms others in our evaluations.
- Document-type specific: PDFs need different chunking than Markdown, which needs different chunking than HTML. Build a document-type router early — it pays off.
Hybrid Search: Why Vector Alone Isn't Enough
Pure vector search fails on exact lookups — product codes, version numbers, proper nouns. Pure BM25 keyword search fails on semantic queries. Hybrid search (BM25 + vector similarity, combined with Reciprocal Rank Fusion) outperforms either alone on heterogeneous query types by a consistent margin in our tests.
# Hybrid search with pgvector + tsvector (PostgreSQL)
SELECT
id,
content,
(1.0 / (60 + rank_bm25)) + (1.0 / (60 + rank_vector)) AS rrf_score
FROM (
SELECT id, content,
ROW_NUMBER() OVER (ORDER BY ts_rank(tsv, query) DESC) AS rank_bm25,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1 ASC) AS rank_vector
FROM documents,
to_tsquery('english', $2) query
WHERE tsv @@ query OR TRUE
) ranked
ORDER BY rrf_score DESC
LIMIT 10;⚠️ If your RAG accuracy is below 70%, here's why
In our experience, the top three causes are: (1) chunks that break semantic units, (2) using a generic embedding model for domain-specific content without fine-tuning, and (3) no re-ranking step after retrieval. A cross-encoder re-ranker applied to the top-20 retrieved chunks before sending to the LLM typically recovers 10–20 accuracy points.
87%
Avg accuracy with hybrid search
62%
Avg accuracy with vector-only
340ms
Median retrieval latency (p50)
15–20%
Accuracy gain from re-ranking
Tags
You might also like
Work with us
Ready to build your product?
We help product teams across the UK, Netherlands, Australia, and North America ship faster without compromising quality. Let's talk about your project.
Talk to our team →
