By INI8 Labs · 2026-05-21 · 11 min read
How to Build a Production RAG Pipeline That Doesn't Hallucinate
The demo worked. Retrieval looked relevant. The generated answers sounded authoritative. Then you deployed to production and discovered that your RAG system confidently fabricates policy details, cites documents that don't exist, and blends information from unrelated sources into plausible-sounding nonsense.
This is the RAG hallucination problem, and it's the gap between demo and production that catches every enterprise team.
LLMs hallucinate at baseline rates of 3–20% across mixed tasks, with higher rates on sparse domains or contradictory inputs. RAG is supposed to fix this by grounding generation in retrieved evidence. And it does — when engineered correctly. RAG alone typically reduces hallucinations by 40–71%. RAG combined with guardrails, evaluation, and careful architecture can push hallucination rates below 5% for well-defined enterprise use cases.
The difference between "demo that impresses" and "system that's trustworthy" comes down to engineering decisions at every layer: data quality, chunking, retrieval strategy, reranking, context assembly, generation constraints, and continuous evaluation. This guide covers each one.
Why RAG Systems Hallucinate
Understanding the failure modes is the first step to preventing them.
Retrieval failure. The system retrieves irrelevant or tangentially related documents. The LLM, instructed to answer based on the provided context, generates a plausible response from irrelevant evidence — which is a hallucination. This is the most common failure mode and accounts for the majority of RAG hallucinations.
Context dilution. Too many retrieved chunks, most of which are partially relevant, overwhelm the signal with noise. The model picks up fragments from different chunks and synthesizes them into a response that doesn't accurately represent any single source.
Chunk boundary problems. A critical piece of information is split across two chunks. The retriever picks one but not the other, giving the model incomplete evidence. The model fills in the gap from its pre-training knowledge — which may be wrong or outdated.
Conflicting information. Different source documents contain contradictory information (common when documents span different time periods or author perspectives). The model arbitrarily chooses one version or blends them.
Abstention failure. The model generates an answer when the retrieved context doesn't actually contain the information needed. Instead of saying "I don't have enough information," it invents a plausible response.
Layer 1: Data Quality — The Foundation That Determines Everything
RAG systems are only as reliable as the data they retrieve from. Ungoverned data produces 45–60% retrieval accuracy. Well-governed data achieves 85–92%.
Clean the corpus before building the pipeline.
- Remove duplicates and near-duplicates. Two versions of the same policy from different years creates contradictions the system can't resolve.
- Remove outdated documents. If your 2023 pricing guide is still in the corpus alongside the 2025 version, the system will sometimes retrieve the wrong one.
- Standardize formatting. Inconsistent document formats (some in PDF, some in HTML, some in Markdown) produce inconsistent chunking quality.
- Validate accuracy. A RAG system that retrieves inaccurate source documents generates inaccurate answers confidently. Garbage in, confident garbage out.
Add rich metadata to every document: source, author, creation date, last updated date, document type, access permissions, and section headings. This metadata enables filtering at query time — "retrieve only from 2024–2025 financial reports" — and powers access control in multi-tenant environments.
Layer 2: Chunking — Split for Meaning, Not for Size
How you split documents into chunks determines whether the retriever can find complete, meaningful evidence.
Fixed-size chunks (the default) split documents into 500–1000 token segments. Simple but often splits context unnaturally. A paragraph about a policy might be cut in half, and neither chunk is sufficient on its own.
Semantic chunking splits at natural boundaries — paragraphs, sections, and headings — preserving logical units of information. Better quality but requires understanding document structure.
Document-aware chunking uses the document's own structure (headings, bullet points, tables, code blocks) to create chunks that map to logical sections. Best quality for structured documents.
The practical approach: Use semantic or document-aware chunking for structured content (policies, documentation, reports). Use recursive chunking with larger initial sizes for unstructured content (emails, chat transcripts). Include overlap between chunks (50–100 tokens) so context at boundaries isn't lost.
Always include the section heading and document title in every chunk. Without this context, a chunk like "The limit is $50,000 per transaction" is ambiguous — which limit? For which product? Adding "Source: Commercial Banking Policy, Section 3.2 — Wire Transfer Limits" makes the chunk self-contained.
Layer 3: Hybrid Retrieval — Don't Rely on Semantic Search Alone
Semantic search (vector similarity) finds chunks with similar meaning to the query. It's excellent for conceptual questions ("what's our policy on remote work?") but poor for exact matches ("what's the limit for account type XR-7?").
Keyword search (BM25) finds chunks containing the exact query terms. Excellent for specific lookups, poor for paraphrased queries.
Hybrid retrieval combines both using reciprocal rank fusion or linear score combination. This is the production standard. Neither method alone provides sufficient recall for enterprise data — hybrid consistently outperforms single-method pipelines.
Implementation: retrieve the top 20–50 candidates from both semantic and keyword search, merge the results, and pass them to the reranking stage.
Layer 4: Reranking — The Quality Multiplier Most Teams Skip
This is the single highest-impact quality improvement for most RAG systems.
Initial retrieval (whether semantic, keyword, or hybrid) uses bi-encoders — fast but imprecise. A cross-encoder reranker takes the top 20–50 candidates and re-scores each one by jointly encoding the query and document together, enabling much deeper semantic comparison.
Reranking alone typically improves retrieval quality by 15–35%. Models like Cohere Rerank, BGE-Reranker, and ColBERT provide production-ready reranking with minimal latency overhead (50–100ms for 50 candidates).
After reranking, select the top 3–5 chunks. Passing more chunks to the LLM increases cost and context dilution without proportional quality improvement.
Layer 5: Context Assembly and Generation Constraints
How you present retrieved context to the LLM determines whether it generates a grounded response or a hallucination.
Include source attribution instructions. Tell the model to cite which retrieved chunks it used for each claim. This makes hallucinations detectable — if the model claims something not in any cited chunk, it's fabricating.
Handle missing information explicitly. Instruct the model: "If the retrieved context doesn't contain enough information to answer the question, say so. Do not generate an answer from general knowledge." This prevents the most dangerous hallucination mode — confident fabrication when the retriever fails.
Limit context window. Don't pass all 20 retrieved chunks to the model. Reranking + top-K selection (3–5 high-quality chunks) produces better results than flooding the context. More isn't better — it's noisier.
Handle contradictions. When retrieved chunks contain conflicting information, instruct the model to acknowledge the conflict and cite both sources, rather than arbitrarily choosing one. "According to the 2024 policy, the limit is $50,000. However, the 2025 update increased this to $75,000."
Layer 6: Evaluation — Catch Drift Before Users Do
A RAG system without evaluation degrades silently. Accuracy erodes as source documents change, query distributions shift, and the retrieval index grows stale.
Metrics to track continuously:
- Retrieval accuracy — are the right chunks being retrieved? Measure with a labeled evaluation set.
- Faithfulness — does the generated answer stick to the retrieved context? Automated checks (LLM-as-judge) compare answers against source chunks.
- Answer relevance — does the response actually address the question?
- Abstention rate — how often does the model correctly decline to answer when evidence is insufficient?
- Hallucination rate — what percentage of generated claims aren't supported by retrieved context?
Evaluation frameworks: RAGAS and TruLens automate these measurements. Custom LLM-as-judge pipelines provide domain-specific evaluation. Without automated evaluation, you're relying on user complaints to detect quality problems — by which point trust is already damaged.
Build a golden evaluation set. 100–200 representative questions with known correct answers and the chunks that should be retrieved. Run this evaluation weekly. Quality drops of more than 3–5% trigger investigation.
Layer 7: Guardrails and Human-in-the-Loop
For high-stakes enterprise applications (medical, legal, financial), automated guardrails add a safety layer.
Output validation. Check generated responses for common hallucination patterns — fabricated citations, statistics not present in context, claims that contradict retrieved evidence. Automated validators can flag or block responses before they reach users.
Confidence scoring. When retrieval quality is low (retriever scores below a threshold), route the query to a human agent instead of generating a potentially unreliable response. This preserves trust — better to say "let me connect you with a specialist" than to generate a hallucinated answer.
Feedback loops. Collect user feedback (thumbs up/down, corrections) and feed it back into evaluation and retrieval optimization. The users who interact with your RAG system daily are your best quality monitors.
What This Looks Like End-to-End
A production RAG pipeline that minimizes hallucination:
- Data layer: Clean, deduplicated, metadata-rich corpus with automated refresh pipeline
- Chunking: Semantic or document-aware chunking with overlap and section context
- Indexing: Dual index — vector embeddings for semantic search, inverted index for keyword search
- Retrieval: Hybrid search (semantic + BM25) returning top 50 candidates
- Reranking: Cross-encoder reranker selecting top 3–5 chunks
- Generation: LLM with explicit grounding instructions, source citation requirements, and abstention guidance
- Evaluation: Automated faithfulness, relevance, and hallucination monitoring on production samples
- Guardrails: Output validation, confidence-based routing, and human escalation for low-confidence responses
This architecture consistently delivers hallucination rates below 5% for well-defined enterprise use cases. It's more complex than a demo pipeline — and that complexity is exactly the difference between a prototype and a production generative AI system.
FAQ
What's the single most impactful change to reduce RAG hallucination?
Adding a cross-encoder reranker between retrieval and generation. Most RAG quality problems are retrieval quality problems — the LLM generates reasonable responses based on whatever context it receives. If the context is irrelevant, the response will be wrong. Reranking ensures the top chunks are genuinely relevant, which alone reduces hallucination by 15–35%.
How much does a production RAG pipeline cost to operate?
For a mid-scale enterprise deployment (10K queries/day), expect $2K–$8K/month: vector database ($500–$2K), embedding compute ($200–$500), reranker compute ($300–$800), LLM inference ($1K–$5K depending on model). The largest variable cost is LLM inference, which is where model routing (using smaller models for simple queries) provides the most savings.
Can RAG completely eliminate hallucination?
No. Hallucination can be reduced to very low rates (sub-5%) but not eliminated entirely. LLMs generate probabilistic outputs, not verified facts. Edge cases — ambiguous queries, sparse evidence, contradictory sources — will always produce some unreliable responses. The goal is to minimize hallucination through engineering and catch remaining cases through evaluation and guardrails.
How do we handle RAG when the knowledge base is constantly changing?
Build an automated ingestion pipeline that detects source document changes, re-chunks, re-embeds, and updates the vector index. For slowly changing corpora (weekly updates), batch ingestion is sufficient. For rapidly changing data (support tickets, product databases), consider near-real-time ingestion or combine RAG with live API calls for the most dynamic data sources.
A RAG system that hallucinates isn't a tool — it's a liability. INI8 Labs helps enterprise teams build production-grade RAG pipelines with hybrid retrieval, reranking, evaluation, and guardrails — systems that deliver trustworthy, grounded AI responses your teams can rely on.