How to Build a Production-Grade RAG System Without Hallucinations
Production RAG Architecture That Survives Real Traffic
If your RAG stack is returning confident wrong answers, the issue is usually not the LLM. It is retrieval quality, context packing, weak citation rules, and missing safety routes when confidence is low.
A production setup needs five layers working together:
- Ingestion and normalization: parse PDFs, docs, tickets, wiki pages, and keep version metadata.
- Indexing: chunking + embeddings + vector and keyword indexes.
- Retrieval and ranking: hybrid search, reranking, context assembly, and token budgeting.
- Generation with evidence: strict citation binding and grounded prompting.
- Evaluation and control plane: offline metrics, online telemetry, confidence scoring, and human escalation.
Without all five, you get demos that look good and production systems that fail under noisy queries.
Reference Architecture (Opinionated)
Data flow
- Connectors pull source data from Confluence, SharePoint, Google Drive, Jira, Notion, and databases.
- Documents are normalized to clean Markdown or JSON with source ID, section headers, ACL, timestamp, and checksum.
- Chunking creates semantically coherent units (not fixed-size slices only).
- Embeddings are generated and stored with metadata in a vector store.
- A BM25 index (OpenSearch/Elasticsearch) stores full text for keyword recall.
- At query time, hybrid retrieval fetches candidates from vector + keyword indexes.
- A cross-encoder reranker scores top 50-200 candidates and keeps top 5-12 chunks.
- A context builder removes duplicates, enforces token budget, and preserves citation IDs.
- LLM generates answer constrained to cited evidence.
- Grounding checker validates claims against provided chunks before response is returned.
Recommended starting stack
- Ingestion: Unstructured or LlamaIndex readers + custom cleaners.
- Embeddings: OpenAI
text-embedding-3-largefor accuracy, orbge-large-en-v1.5for self-hosted control. - Vector DB: Pinecone, Weaviate, Qdrant, or pgvector (Postgres) depending on scale and ops budget.
- Keyword retrieval: OpenSearch BM25.
- Reranker: Cohere Rerank v3 or
bge-reranker-large. - Generation: GPT-4.1 / Claude Sonnet class models with citation schema output.
- Guardrails: custom grounding checks + refusal policy when evidence score is low.
Chunking: Where Most Accuracy Loss Starts
Chunking quality drives retrieval quality. Bad chunking creates false negatives: relevant content exists but is never returned.
What works in practice
- Semantic section chunking: split by headings, lists, and paragraph boundaries first.
- Token window target: 300-500 tokens for policy and documentation content.
- Overlap: 10-20% overlap only when narrative continuity matters.
- Special handling: tables and code blocks should stay intact; never split mid-table row or function.
- Metadata: store section title, source URL, page number, and updated_at for each chunk.
Numbers from production rollouts
On enterprise knowledge bases (100k to 2M chunks), teams typically see:
- Naive fixed 1,000-token chunking: 58-67% answer faithfulness on internal eval sets.
- Semantic 400-token chunking + reranker: 74-86% faithfulness.
- Adding table-aware chunking for ops and finance docs: extra 4-9 point gain on factoid questions.
The lift comes from retrieval hit quality, not from changing the generator model.
Embedding Models and Retrieval Quality Trade-offs
Pick embeddings based on corpus language, latency target, and infra constraints.
| Model | Typical Dim | Strength | Latency (per 1k chunks) | Cost Profile | When to Choose |
|---|---|---|---|---|---|
| text-embedding-3-large | 3072 | High recall on mixed enterprise corpora | ~1.5-3.0s API-side batch | API usage cost, low ops overhead | Fastest route to high relevance |
| text-embedding-3-small | 1536 | Lower cost, decent quality | ~1.0-2.2s | Lower API bill | Cost-sensitive workloads |
| bge-large-en-v1.5 | 1024 | Strong open-source retrieval | ~0.8-2.5s on A10G batch | GPU hosting + ops time | Data residency or vendor control needs |
| e5-large-v2 | 1024 | Stable baseline across domains | ~0.9-2.8s on A10G batch | GPU hosting + ops time | Self-hosted baseline model |
For most teams, start with hosted embeddings for 4-8 weeks, measure retrieval quality, then decide if self-hosting is worth the operational load.
Vector Store Selection: Latency, Cost, and Operational Reality
There is no universal winner. The right choice depends on record count, QPS, multi-tenant isolation, and your team’s tolerance for infra work.
| Vector Store | p95 Query Latency (100k-1M vectors) | Monthly Cost Range* | Operational Complexity | Notes |
|---|---|---|---|---|
| Pinecone (serverless/pod) | 40-120 ms | $70-$1,200+ | Low | Fast setup, managed scaling, predictable DX |
| Weaviate Cloud | 50-150 ms | $90-$1,500+ | Medium | Good filtering + hybrid options |
| Qdrant Cloud / self-hosted | 35-130 ms | $40-$900+ (cloud) / infra-based self-host | Medium | Strong performance-cost ratio |
| pgvector (Postgres) | 80-300 ms | $25-$500+ (depends on DB tier) | Medium-High | Best when you already run Postgres and need transactional joins |
*Ranges vary by region, replication, SLA, and query volume. Use load tests with your own embedding dimensions and filter patterns.
Practical recommendation
- < 5M chunks, small team: managed Pinecone or Qdrant Cloud.
- Strict data residency + platform team available: self-hosted Qdrant/Weaviate.
- Already all-in on Postgres and moderate QPS: pgvector can work, but benchmark hard before committing.
Retrieval Ranking Pipeline: Hybrid + Rerank Is Not Optional
Single-step vector search is not enough for enterprise queries. Acronyms, exact policy IDs, and error codes are often keyword-dominant. Semantic-only retrieval misses these.
Minimum retrieval pipeline
- Query rewrite: normalize spelling, expand acronyms where known.
- Hybrid candidate fetch: vector top-100 + BM25 top-100.
- Dedup + filter: ACL checks, freshness constraints, tenant boundary filters.
- Cross-encoder rerank: score top-100, keep top-8 or top-10.
- Context packing: include highest score chunks first; preserve source diversity.
Expected gains
- Hybrid retrieval over vector-only: +8 to +18 points on recall@10 in enterprise corpora.
- Adding reranker: +6 to +14 points on answer faithfulness.
- Net effect on user-visible wrong answers: often 25-45% reduction.
Why Hallucinations Happen in RAG (and How to Stop Each Failure Mode)
1) Retrieval failure
Symptom: model answers confidently from prior knowledge, not your corpus.
Causes: poor chunking, weak embeddings for domain language, bad filters, missing hybrid retrieval.
Fixes:
- Track retrieval recall@k and hit rate by query category.
- Use hybrid retrieval + reranker.
- Add synonym and acronym dictionaries from your real ticket history.
- Run canary tests whenever ingestion pipeline changes.
2) Context window overflow
Symptom: relevant evidence exists but gets dropped during prompt assembly.
Causes: too many chunks, poor token budgeting, verbose system prompts.
Fixes:
- Hard token budget for evidence (for example 6k of 12k total prompt budget).
- Keep only top reranked chunks and remove near-duplicates.
- Use contextual compression before generation.
- Fail closed when no high-score chunk fits budget.
3) Conflicting sources
Symptom: answer mixes old and new policy statements.
Causes: stale documents remain searchable, no temporal weighting.
Fixes:
- Version all source documents and store effective dates.
- Prefer latest approved source by rank boost.
- If top sources conflict, force model to present both and mark uncertainty.
- Add stale-index TTL jobs and deletion propagation checks.
4) Model confabulation
Symptom: model invents values, APIs, or policy steps not in context.
Causes: permissive prompting, no citation constraints, no output validator.
Fixes:
- Require every factual sentence to map to citation IDs.
- Run a grounding pass: each claim must have lexical or semantic support from context.
- If unsupported claims exceed threshold, return refusal or clarifying question.
- For high-risk domains (legal, medical, financial), always route low-confidence answers to humans.
Citation Enforcement and Grounding Checks
“Please cite sources” in prompt text is weak control. Enforce citations through output schema and validators.
Recommended response schema
answer_text: final response.claims[]: atomic factual statements.citations[]: source IDs per claim.confidence: 0-1 score from retrieval + grounding layers.needs_human_review: boolean gate.
Grounding algorithm (simple and effective)
- Split generated answer into atomic claims.
- For each claim, compute similarity with cited chunk text.
- Run contradiction check against top alternate chunks.
- Mark claim unsupported if similarity below threshold (for example 0.72 cosine) or contradiction high.
- If unsupported claim ratio > 0.15, reject answer and trigger fallback.
Teams using this pattern often cut severe hallucinations by 40-70% after threshold tuning.
Confidence Scoring and Human Fallback Routing
Low confidence should not produce polished guesses. It should trigger escalation.
Composite confidence score
Use weighted signals:
- Retrieval score (top-k mean and margin between top-1 and top-2).
- Reranker score calibration.
- Citation coverage ratio (claims with valid evidence).
- Grounding pass success rate.
- Query risk class (billing, compliance, contract terms, security controls).
Example policy:
- Confidence ≥ 0.82: auto-answer.
- 0.60 to 0.81: answer with uncertainty banner + ask one clarifying question.
- < 0.60 or high-risk class: route to human queue with retrieved evidence attached.
In support environments, this routing model usually improves CSAT while reducing incident risk from wrong automated answers.
Latency and Cost Budgeting for Real Deployments
You need explicit SLOs. Otherwise, retrieval and guardrails creep until response time is unacceptable.
Typical latency budget (p95 target: 2.5-4.0s)
- Query rewriting + policy checks: 40-120 ms
- Hybrid retrieval: 80-250 ms
- Reranking (top-100): 120-450 ms
- Context assembly + dedup: 30-120 ms
- LLM generation (short answer): 900-2,400 ms
- Grounding validator: 100-350 ms
Cost per 1,000 queries (illustrative mid-size setup)
- Embeddings amortized over ingestion cadence: $1-$12 depending on refresh rate and corpus churn.
- Vector + keyword retrieval infra: $8-$60.
- Reranking API/inference: $5-$40.
- LLM generation: $25-$220 based on model and token policy.
- Total typical band: $40-$330 per 1,000 queries.
Main cost drivers are prompt length and model choice, not vector search alone.
Evaluation Framework You Can Run Weekly
Use a two-layer evaluation setup: offline benchmark + online production telemetry.
Offline eval (RAGAS + task-specific metrics)
- Faithfulness: is answer supported by provided context?
- Answer relevance: does it address the user question?
- Context precision: how much retrieved context was actually useful?
- Context recall: did retrieval include the needed evidence?
- Citation accuracy: are citations valid and correctly mapped?
Build a gold dataset of 200-500 real queries split by intent: policy lookup, troubleshooting, how-to, and edge cases. Keep at least 20% adversarial queries with ambiguous wording.
Online eval (production)
- Acceptance rate (user took suggested answer without escalation).
- Human override rate.
- Unsupported claim rate from grounding checker.
- Latency p50/p95/p99.
- Cost per successful answer.
- Incident count from wrong answers in high-risk flows.
Release gate example
- Faithfulness ≥ 0.85
- Citation accuracy ≥ 0.95
- p95 latency ≤ 3.5s
- High-risk unsupported claim rate ≤ 1.0%
If one gate fails, do not ship the retrieval change.
Field Reality: What Fails After Launch
- Index drift: data connectors fail silently and freshness degrades. Fix with ingestion SLIs and alerting.
- Permission leaks: ACL filters are skipped in one path. Fix with centralized authorization middleware.
- Prompt bloat: teams keep adding rules; latency spikes. Fix with prompt budget ownership and periodic cleanup.
- No ownership: model team and platform team split responsibilities but nobody owns faithfulness KPI. Assign one RAG owner with release authority.
Implementation Checklist for CTOs and VP Engineering
- Set measurable targets: faithfulness, citation accuracy, p95 latency, cost/query.
- Ship hybrid retrieval from day one.
- Use reranking before generation.
- Enforce structured citations with validators, not prompt suggestions.
- Introduce confidence-based fallback routing for high-risk intents.
- Run weekly RAGAS-based regression tests on a fixed gold set.
- Instrument everything: retrieval hit/miss, unsupported claims, escalation reasons.
- Version prompts, chunking configs, and indexes so rollbacks are possible.
90-Day Rollout Plan
Days 1-30: establish baseline
- Ship ingestion, semantic chunking, and hybrid retrieval.
- Create the first 300-query gold evaluation set from real user traffic.
- Instrument recall@k, faithfulness, citation coverage, latency, and escalation rate.
Days 31-60: reduce wrong answers
- Add reranking, grounding validator, and confidence-based routing.
- Tune chunk size and overlap by query category rather than one global setting.
- Run weekly regression tests and block releases that miss thresholds.
Days 61-90: scale safely
- Introduce tenant-level SLO dashboards and error budgets.
- Optimize token budgets to lower generation cost without hurting faithfulness.
- Expand automation only in intents where unsupported claim rate is consistently low.
FAQ
How large should top-k retrieval be before reranking?
Start with 100 from vector and 100 from BM25, then rerank to top 8-12 for generation. Smaller candidate pools often miss critical evidence on enterprise corpora.
Can I use only pgvector to reduce stack complexity?
Yes for moderate scale and lower QPS, especially when Postgres is already your operational center. Benchmark p95 latency and recall under real filters before committing.
Is reranking always worth the cost?
For customer-facing or compliance-sensitive flows, yes. In internal low-risk assistants, you can make reranking conditional on ambiguous queries to cut cost.
What is a good first milestone before broad rollout?
Hit faithfulness ≥ 0.85 and citation accuracy ≥ 0.95 on a representative 300-query set, then run a limited launch with escalation always enabled for high-risk intents.
References
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al.)
- RAGAS: Automated Evaluation of Retrieval Augmented Generation
- Pinecone Documentation and Benchmarks
- Qdrant Documentation
- Weaviate Documentation
- pgvector Documentation
- OpenAI Embeddings Guide
- Cohere Rerank Overview
- Elasticsearch Reference (BM25 and Hybrid Search)
Build It So Wrong Answers Are Hard, Not Easy
Production RAG is a systems problem. Retrieval quality, ranking, citation controls, and fallback logic matter more than prompt wording. Teams that treat RAG like a reliability discipline get predictable outcomes. Teams that treat it like a demo layer get confident errors at scale.
AINinza is powered by Aeologic Technologies. If you want a production RAG architecture review with hard metrics and a rollout plan, talk to us: https://aeologic.com/
