1. Why most RAG demos don't survive production
The path from demo to production is mostly engineering discipline. Below is what we've learned shipping RAG to 30+ production deployments — the patterns that work, the ones that don't, and the eval rigor that prevents regressions.
- Hallucinated citations — the LLM cites a doc that doesn't exist, or cites the wrong section of a real doc.
- Missing the obvious answer — the right doc IS in the corpus, but the retrieval ranks it 7th out of 10, and the LLM doesn't see it in the top-K passed in the prompt.
- Costs explode — every query routes to GPT-4 Turbo at 50K tokens of context, and your monthly bill goes from $200 to $14K.
- Latency degrades — embedding + retrieval + LLM = 4–8 second responses. Users abandon.
- Evals don't exist — when something regresses, no one knows. Six months in, the team realizes accuracy on common queries dropped 30% three sprints ago.
2. The production RAG stack (anatomy)
A production RAG system has six layers. Skipping any one of them creates a known failure mode.
| Layer | What it does | Common failure if missing |
|---|---|---|
| Ingestion + chunking | Splits source docs into retrievable passages | Bad chunks → bad retrieval, regardless of model quality |
| Embedding | Maps chunks to vector space for similarity search | Wrong embedder for domain → low retrieval recall |
| Vector store | Stores + queries embeddings (Pinecone, Qdrant, pgvector) | Wrong index type → slow queries or low recall |
| Retrieval | Returns top-K candidates per query | Naive top-K → misses obvious answers |
| Reranking | Reorders candidates with a more powerful model | Without reranking, top-1 accuracy is 30–50% lower |
| Generation | LLM produces answer with retrieved context | Without citation discipline → hallucinated sources |
3. Chunking strategy: more important than the embedder
The biggest lever for retrieval quality isn't the embedding model — it's how you chunked the documents. Three patterns we use, in order of preference:
- 1Semantic chunking (preferred for technical content). Use a small LLM to find natural section breaks. Chunks are 200–800 tokens, ending at sentence/paragraph boundaries. Each chunk is self-contained — readable in isolation. Tools: LlamaIndex's SentenceSplitter with semantic_chunker, or LangChain's RecursiveCharacterTextSplitter with proper separators.
- 2Sliding window with overlap (default for unstructured prose). 512-token chunks with 64-token overlap. Overlap prevents losing context at boundaries. Works for blog posts, articles, long-form documentation.
- 3Structural chunking (best for structured docs). Use the document's own structure — markdown headers, HTML semantic tags, code blocks — as chunk boundaries. Each H2 section becomes a chunk. For code documentation, each function/class becomes a chunk.
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model,
)
nodes = splitter.get_nodes_from_documents(documents)
# Each node has:
# - text: the chunk content
# - metadata: source doc, position, section heading
# - relationships: prev/next chunk pointers (for context expansion)Always store source URL, section heading, and timestamp as chunk metadata. At generation time, pass this metadata to the LLM along with the chunk. Citations become accurate because the LLM has the source identifier inline with the content.
4. Embedding model selection (in 2026)
We default to text-embedding-3-small for general content, voyage-3-large for technical/legal/medical, and self-hosted (e5-large-v2) for cost-sensitive deployments where the corpus is small.
Tip: never compare embedding models on benchmark scores alone. Run a real eval set from YOUR domain. We've seen voyage-3 outperform OpenAI by 12% on legal docs and underperform by 8% on consumer reviews. Domain matters.
| Model | Dimensions | Cost/1M tokens | When to use |
|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 (truncatable to 256) | $0.02 | Default for general text. Cheap, fast, good enough. |
| text-embedding-3-large (OpenAI) | 3072 (truncatable to 256) | $0.13 | When recall matters more than cost. Higher accuracy on niche domains. |
| voyage-3-large | 2048 | $0.18 | Best for technical domains (code, legal, medical). Outperforms OpenAI in our benchmarks. |
| Cohere embed-v3 | 1024 | $0.10 | Strong multilingual. Compress mode for cost reduction. |
| sentence-transformers (self-hosted) | 384–768 | Free | Self-hosted; quality is below SOTA but adequate for many use cases. |
5. Reranking — the cheap accuracy lever
Vector retrieval gives you semantically similar chunks. It does NOT give you the most relevant ones for a specific query. The two are correlated but not identical. Reranking is the step that closes the gap.
The pattern: retrieve top-50 candidates with vector search (fast), then rerank with a more expensive model to pick the top-5 (accurate). The reranker uses a cross-encoder that scores each (query, candidate) pair directly — slower per pair, but you only run it on 50 candidates, not the full corpus.
- Cohere Rerank 3.5: $1 per 1K queries. Works on 1024-token candidates. Solid quality across domains.
- voyage-rerank-2: similar pricing. Strong on technical content.
- Self-hosted: BAAI/bge-reranker-v2-m3 is competitive and free if you can run a GPU.
- Cross-encoder via OpenAI: use a small LLM (GPT-4o-mini) with a structured prompt. Higher latency but full control.
from openai import OpenAI
import cohere
client = OpenAI()
co = cohere.Client(api_key=COHERE_API_KEY)
def retrieve_with_rerank(query: str, k: int = 5):
# Stage 1: vector search, top-50 (fast, cheap)
query_emb = client.embeddings.create(
input=query,
model="text-embedding-3-small"
).data[0].embedding
candidates = vector_db.search(query_emb, top_k=50) # ~50ms
# Stage 2: rerank top-50 to top-K (slower, more accurate)
rerank_response = co.rerank(
model="rerank-v3.5",
query=query,
documents=[c.text for c in candidates],
top_n=k,
)
# Map back to original candidates with their metadata
return [candidates[r.index] for r in rerank_response.results]
# Real-world impact:
# Without rerank, top-1 retrieval accuracy: ~62%
# With rerank, top-1 retrieval accuracy: ~89%
# Latency penalty: ~200ms.
# Cost: ~$0.001 per query.6. Citation discipline (and why it matters)
An LLM that cites is an LLM you can trust. An LLM that doesn't cite is an LLM that hallucinates by default. The pattern: structure the prompt so the LLM MUST cite sources by ID, then validate the citations match real chunks before showing the answer.
def build_prompt(query: str, chunks: list[Chunk]) -> str:
sources = "\n".join([
f"[{i+1}] {c.text}\nSource: {c.metadata['url']}#{c.metadata['section']}"
for i, c in enumerate(chunks)
])
return f"""You are answering a user question using ONLY the sources below.
Each claim in your answer MUST be followed by a citation in the format [N]
where N is the source number. If the sources don't contain the answer,
say "I don't have enough information" — do not invent.
Sources:
{sources}
User question: {query}
Answer (cite every factual claim):"""
# After generation, validate:
def validate_citations(answer: str, chunks: list[Chunk]) -> bool:
cited = set(re.findall(r"\[(\d+)\]", answer))
valid = set(str(i+1) for i in range(len(chunks)))
if not cited.issubset(valid):
return False # cited a source that doesn't exist
if not cited:
return False # cited nothing — likely hallucinating
return TrueRerank gives you the right chunks. Citation discipline forces the LLM to use them. Together they take a typical RAG citation accuracy from ~70% to ~94% in our deployments.
7. Eval harness — the discipline that prevents regressions
Evals are not optional. They are the single most important engineering investment in a production RAG system. Without them, every prompt change is a roll of the dice.
Build a golden set: 100–500 (query, expected_answer, source_passages) tuples covering the queries you care about. Each example includes the query a user might ask, the answer you'd accept, and the source passages that should be retrieved.
Run the golden set after every prompt or model change. Score on three dimensions: retrieval recall (did we retrieve the right passages?), answer quality (does the answer match the expected one?), and citation accuracy (are cited sources actually in the corpus?).
| Eval dimension | Metric | Target |
|---|---|---|
| Retrieval recall@5 | % of queries where the gold passage is in top-5 | > 90% |
| Answer correctness | LLM-as-judge score (0-100) vs gold answer | > 85 |
| Citation accuracy | % of citations that point to real, relevant chunks | > 95% |
| Hallucination rate | % of answers with claims not in retrieved chunks | < 3% |
| Latency p95 | End-to-end retrieve + generate | < 4s |
| Cost per query | Average $ per query in production | Track + alert on changes |
import json
def evaluate_rag(rag_system, golden_set):
results = {
"retrieval_recall": [],
"answer_correctness": [],
"citation_accuracy": [],
}
for example in golden_set:
chunks = rag_system.retrieve(example["query"])
answer = rag_system.generate(example["query"], chunks)
# Retrieval recall
gold_ids = set(example["expected_chunk_ids"])
retrieved_ids = set(c.id for c in chunks[:5])
recall = len(gold_ids & retrieved_ids) / len(gold_ids)
results["retrieval_recall"].append(recall)
# Answer correctness (via LLM-as-judge)
score = llm_judge_score(
query=example["query"],
expected=example["expected_answer"],
actual=answer,
)
results["answer_correctness"].append(score)
# Citation accuracy
citations = extract_citations(answer)
valid = sum(1 for c in citations if c.refers_to(chunks))
results["citation_accuracy"].append(
valid / max(1, len(citations))
)
return {k: sum(v) / len(v) for k, v in results.items()}8. Cost-aware model routing — why most queries don't need GPT-4
The router itself is a small LLM call (Tier 1) that classifies the query into a complexity bucket, then routes accordingly. The router itself costs <$0.001 per query and saves 5–10× on the actual generation.
Real numbers: across our 30+ deployments, the median cost reduction from routing is 70%. Some workloads with heavy long-tail simple queries see 85% reduction.
- Tier 1 (cheap, fast): GPT-4o-mini, Claude Haiku, Gemini Flash. ~$0.15-0.25/M input tokens. Use for: simple factual queries, definitions, lookups.
- Tier 2 (balanced): GPT-4o, Claude Sonnet, Gemini Pro. ~$2.5-3/M input tokens. Use for: multi-step reasoning, comparisons, summarization.
- Tier 3 (expensive, high-quality): Claude Opus, GPT-5, Gemini Ultra. ~$15+/M input tokens. Use for: complex reasoning, high-stakes answers, code generation.
9. Common failure modes + fixes
| Symptom | Likely cause | Fix |
|---|---|---|
| Hallucinated answers | No citation forcing in prompt | Add citation discipline pattern |
| Right doc not retrieved | Wrong chunk size or no rerank | Tune chunking + add reranker |
| Slow responses | Top-K too high, model too slow | Cap top-K at 5, route to cheaper model |
| Cost spikes | GPT-4 routing all queries | Add complexity classifier + Tier 1 fallback |
| Stale answers | No retrieval freshness signal | Index with timestamps, filter old chunks |
| Multi-language failures | Embedder is English-only | Switch to Cohere multilingual or text-embedding-3-large |
| Long-tail miss | Vector recall is bad on rare entities | Add BM25 hybrid search + RRF fusion |
Pure vector search misses rare entity names (product SKUs, error codes, jargon). Pure BM25 (keyword) misses semantic queries. The fix is hybrid: run both, fuse with Reciprocal Rank Fusion (RRF). Increases recall by 8–15% in our deployments at near-zero added latency.