GSoft Consulting
AI & Automation

RAG in Production: What the Tutorials Don't Tell You

J

Jawad

AI Engineer

7 March 2026

10 min read
RAG in Production: What the Tutorials Don't Tell You
AI & Automation

Retrieval-Augmented Generation looks like magic in a Jupyter notebook. You load some PDFs, chunk them, embed them, and ask a question — the LLM answers correctly and you feel like you've solved AI. Then you deploy it to production, and real users start asking questions that don't map cleanly to your document structure, and your retrieval accuracy craters. This post is about the gap between the tutorial and production.

Chunking Strategies: The Foundation of Retrieval Quality

Bad chunking is the root cause of most RAG failures we diagnose. If your chunks break semantic units — splitting a table header from its rows, or cutting a code example in half — no retrieval model will save you.

  • Fixed-size chunking (e.g. 512 tokens with 50-token overlap): Simple, fast, and consistently mediocre. Good for prototyping, not for production systems where document structure matters.
  • Semantic chunking: Split on natural boundaries — paragraphs, sections, headings. Requires parsing the document structure. Retrieval quality improvement: typically 15–25% on our benchmarks.
  • Hierarchical chunking (parent-child): Store both a large context chunk (the section) and small precise chunks (the paragraphs). Retrieve the small chunks, but send the parent context to the LLM. This pattern consistently outperforms others in our evaluations.
  • Document-type specific: PDFs need different chunking than Markdown, which needs different chunking than HTML. Build a document-type router early — it pays off.

Hybrid Search: Why Vector Alone Isn't Enough

Pure vector search fails on exact lookups — product codes, version numbers, proper nouns. Pure BM25 keyword search fails on semantic queries. Hybrid search (BM25 + vector similarity, combined with Reciprocal Rank Fusion) outperforms either alone on heterogeneous query types by a consistent margin in our tests.

PYTHONcode
# Hybrid search with pgvector + tsvector (PostgreSQL)
SELECT
  id,
  content,
  (1.0 / (60 + rank_bm25)) + (1.0 / (60 + rank_vector)) AS rrf_score
FROM (
  SELECT id, content,
    ROW_NUMBER() OVER (ORDER BY ts_rank(tsv, query) DESC) AS rank_bm25,
    ROW_NUMBER() OVER (ORDER BY embedding <=> $1 ASC) AS rank_vector
  FROM documents,
    to_tsquery('english', $2) query
  WHERE tsv @@ query OR TRUE
) ranked
ORDER BY rrf_score DESC
LIMIT 10;

⚠️ If your RAG accuracy is below 70%, here's why

In our experience, the top three causes are: (1) chunks that break semantic units, (2) using a generic embedding model for domain-specific content without fine-tuning, and (3) no re-ranking step after retrieval. A cross-encoder re-ranker applied to the top-20 retrieved chunks before sending to the LLM typically recovers 10–20 accuracy points.

87%

Avg accuracy with hybrid search

62%

Avg accuracy with vector-only

340ms

Median retrieval latency (p50)

15–20%

Accuracy gain from re-ranking

Tags

AIRAGLLMVector DatabasesOpenAIProduction

Work with us

Ready to build your product?

We help product teams across the UK, Netherlands, Australia, and North America ship faster without compromising quality. Let's talk about your project.

Talk to our team →