1. Why most RAG demos don't survive production

The path from demo to production is mostly engineering discipline. Below is what works when shipping RAG to production — the patterns that hold up, the ones that don't, and the eval rigor that prevents regressions.

Hallucinated citations — the LLM cites a doc that doesn't exist, or cites the wrong section of a real doc.
Missing the obvious answer — the right doc IS in the corpus, but the retrieval ranks it 7th out of 10, and the LLM doesn't see it in the top-K passed in the prompt.
Costs explode — every query routes to GPT-4 Turbo at 50K tokens of context, and your monthly bill goes from $200 to $14K.
Latency degrades — embedding + retrieval + LLM = 4–8 second responses. Users abandon.
Evals don't exist — when something regresses, no one knows. Six months in, the team realizes accuracy on common queries dropped 30% three sprints ago.

2. The production RAG stack (anatomy)

A production RAG system has six layers. Skipping any one of them creates a known failure mode.

Layer	What it does	Common failure if missing
Ingestion + chunking	Splits source docs into retrievable passages	Bad chunks → bad retrieval, regardless of model quality
Embedding	Maps chunks to vector space for similarity search	Wrong embedder for domain → low retrieval recall
Vector store	Stores + queries embeddings (Pinecone, Qdrant, pgvector)	Wrong index type → slow queries or low recall
Retrieval	Returns top-K candidates per query	Naive top-K → misses obvious answers
Reranking	Reorders candidates with a more powerful model	Without reranking, top-1 accuracy is 30–50% lower
Generation	LLM produces answer with retrieved context	Without citation discipline → hallucinated sources

3. Chunking strategy: more important than the embedder

The biggest lever for retrieval quality isn't the embedding model — it's how you chunked the documents. Three patterns worth knowing, in order of preference:

1Semantic chunking (preferred for technical content). Use a small LLM to find natural section breaks. Chunks are 200–800 tokens, ending at sentence/paragraph boundaries. Each chunk is self-contained — readable in isolation. Tools: LlamaIndex's SentenceSplitter with semantic_chunker, or LangChain's RecursiveCharacterTextSplitter with proper separators.
2Sliding window with overlap (default for unstructured prose). 512-token chunks with 64-token overlap. Overlap prevents losing context at boundaries. Works for blog posts, articles, long-form documentation.
3Structural chunking (best for structured docs). Use the document's own structure — markdown headers, HTML semantic tags, code blocks — as chunk boundaries. Each H2 section becomes a chunk. For code documentation, each function/class becomes a chunk.

Semantic chunking with LlamaIndex

python

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model,
)

nodes = splitter.get_nodes_from_documents(documents)

# Each node has:
#   - text: the chunk content
#   - metadata: source doc, position, section heading
#   - relationships: prev/next chunk pointers (for context expansion)

The metadata trick

Always store source URL, section heading, and timestamp as chunk metadata. At generation time, pass this metadata to the LLM along with the chunk. Citations become accurate because the LLM has the source identifier inline with the content.

4. Embedding model selection (in 2026)

A sensible default: text-embedding-3-small for general content, voyage-3-large for technical/legal/medical, and self-hosted (e5-large-v2) for cost-sensitive deployments where the corpus is small.

Tip: never compare embedding models on benchmark scores alone. Run a real eval set from YOUR domain. Embedding models can swing widely by domain — one model may beat another by ~10% on legal docs yet lose on consumer reviews — which is why you must eval on YOUR data. Domain matters.

Model	Dimensions	Cost/1M tokens	When to use
text-embedding-3-small (OpenAI)	1536 (truncatable to 256)	$0.02	Default for general text. Cheap, fast, good enough.
text-embedding-3-large (OpenAI)	3072 (truncatable to 256)	$0.13	When recall matters more than cost. Higher accuracy on niche domains.
voyage-3-large	2048	$0.18	Strong for technical domains (code, legal, medical); often outperforms OpenAI on these in published benchmarks.
Cohere embed-v3	1024	$0.10	Strong multilingual. Compress mode for cost reduction.
sentence-transformers (self-hosted)	384–768	Free	Self-hosted; quality is below SOTA but adequate for many use cases.

5. Reranking — the cheap accuracy lever

Vector retrieval gives you semantically similar chunks. It does NOT give you the most relevant ones for a specific query. The two are correlated but not identical. Reranking is the step that closes the gap.

The pattern: retrieve top-50 candidates with vector search (fast), then rerank with a more expensive model to pick the top-5 (accurate). The reranker uses a cross-encoder that scores each (query, candidate) pair directly — slower per pair, but you only run it on 50 candidates, not the full corpus.

Cohere Rerank 3.5: $1 per 1K queries. Works on 1024-token candidates. Solid quality across domains.
voyage-rerank-2: similar pricing. Strong on technical content.
Self-hosted: BAAI/bge-reranker-v2-m3 is competitive and free if you can run a GPU.
Cross-encoder via OpenAI: use a small LLM (GPT-4o-mini) with a structured prompt. Higher latency but full control.

Two-stage retrieval: vector + rerank

python

from openai import OpenAI
import cohere

client = OpenAI()
co = cohere.Client(api_key=COHERE_API_KEY)

def retrieve_with_rerank(query: str, k: int = 5):
    # Stage 1: vector search, top-50 (fast, cheap)
    query_emb = client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    ).data[0].embedding

    candidates = vector_db.search(query_emb, top_k=50)  # ~50ms

    # Stage 2: rerank top-50 to top-K (slower, more accurate)
    rerank_response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=[c.text for c in candidates],
        top_n=k,
    )

    # Map back to original candidates with their metadata
    return [candidates[r.index] for r in rerank_response.results]

# Real-world impact:
#   Without rerank, top-1 retrieval accuracy: ~62%
#   With rerank, top-1 retrieval accuracy: ~89%
#   Latency penalty: ~200ms.
#   Cost: ~$0.001 per query.

6. Citation discipline (and why it matters)

An LLM that cites is an LLM you can trust. An LLM that doesn't cite is an LLM that hallucinates by default. The pattern: structure the prompt so the LLM MUST cite sources by ID, then validate the citations match real chunks before showing the answer.

Forced-citation prompt pattern

python

def build_prompt(query: str, chunks: list[Chunk]) -> str:
    sources = "\n".join([
        f"[{i+1}] {c.text}\nSource: {c.metadata['url']}#{c.metadata['section']}"
        for i, c in enumerate(chunks)
    ])

    return f"""You are answering a user question using ONLY the sources below.
Each claim in your answer MUST be followed by a citation in the format [N]
where N is the source number. If the sources don't contain the answer,
say "I don't have enough information" — do not invent.

Sources:
{sources}

User question: {query}

Answer (cite every factual claim):"""

# After generation, validate:
def validate_citations(answer: str, chunks: list[Chunk]) -> bool:
    cited = set(re.findall(r"\[(\d+)\]", answer))
    valid = set(str(i+1) for i in range(len(chunks)))
    if not cited.issubset(valid):
        return False  # cited a source that doesn't exist
    if not cited:
        return False  # cited nothing — likely hallucinating
    return True

The reranking + citation combo

Rerank gives you the right chunks. Citation discipline forces the LLM to use them. Together they can take a typical RAG citation accuracy from roughly 70% to the low-90s%.

7. Eval harness — the discipline that prevents regressions

Evals are not optional. They are the single most important engineering investment in a production RAG system. Without them, every prompt change is a roll of the dice.

Build a golden set: 100–500 (query, expected_answer, source_passages) tuples covering the queries you care about. Each example includes the query a user might ask, the answer you'd accept, and the source passages that should be retrieved.

Run the golden set after every prompt or model change. Score on three dimensions: retrieval recall (did we retrieve the right passages?), answer quality (does the answer match the expected one?), and citation accuracy (are cited sources actually in the corpus?).

Eval dimension	Metric	Target
Retrieval recall@5	% of queries where the gold passage is in top-5	> 90%
Answer correctness	LLM-as-judge score (0-100) vs gold answer	> 85
Citation accuracy	% of citations that point to real, relevant chunks	> 95%
Hallucination rate	% of answers with claims not in retrieved chunks	< 3%
Latency p95	End-to-end retrieve + generate	< 4s
Cost per query	Average $ per query in production	Track + alert on changes

Eval harness skeleton

python

import json

def evaluate_rag(rag_system, golden_set):
    results = {
        "retrieval_recall": [],
        "answer_correctness": [],
        "citation_accuracy": [],
    }

    for example in golden_set:
        chunks = rag_system.retrieve(example["query"])
        answer = rag_system.generate(example["query"], chunks)

        # Retrieval recall
        gold_ids = set(example["expected_chunk_ids"])
        retrieved_ids = set(c.id for c in chunks[:5])
        recall = len(gold_ids & retrieved_ids) / len(gold_ids)
        results["retrieval_recall"].append(recall)

        # Answer correctness (via LLM-as-judge)
        score = llm_judge_score(
            query=example["query"],
            expected=example["expected_answer"],
            actual=answer,
        )
        results["answer_correctness"].append(score)

        # Citation accuracy
        citations = extract_citations(answer)
        valid = sum(1 for c in citations if c.refers_to(chunks))
        results["citation_accuracy"].append(
            valid / max(1, len(citations))
        )

    return {k: sum(v) / len(v) for k, v in results.items()}

8. Cost-aware model routing — why most queries don't need GPT-4

The router itself is a small LLM call (Tier 1) that classifies the query into a complexity bucket, then routes accordingly. The router itself costs <$0.001 per query and saves 5–10× on the actual generation.

Real numbers: intelligent routing commonly cuts generation cost by well over half — often ~70%+ on workloads with many simple queries. Some workloads with heavy long-tail simple queries see 85% reduction.

Tier 1 (cheap, fast): GPT-4o-mini, Claude Haiku, Gemini Flash. ~$0.15-0.25/M input tokens. Use for: simple factual queries, definitions, lookups.
Tier 2 (balanced): GPT-4o, Claude Sonnet, Gemini Pro. ~$2.5-3/M input tokens. Use for: multi-step reasoning, comparisons, summarization.
Tier 3 (expensive, high-quality): Claude Opus, GPT-5, Gemini Ultra. ~$15+/M input tokens. Use for: complex reasoning, high-stakes answers, code generation.

9. Common failure modes + fixes

Symptom	Likely cause	Fix
Hallucinated answers	No citation forcing in prompt	Add citation discipline pattern
Right doc not retrieved	Wrong chunk size or no rerank	Tune chunking + add reranker
Slow responses	Top-K too high, model too slow	Cap top-K at 5, route to cheaper model
Cost spikes	GPT-4 routing all queries	Add complexity classifier + Tier 1 fallback
Stale answers	No retrieval freshness signal	Index with timestamps, filter old chunks
Multi-language failures	Embedder is English-only	Switch to Cohere multilingual or text-embedding-3-large
Long-tail miss	Vector recall is bad on rare entities	Add BM25 hybrid search + RRF fusion

Hybrid search (vector + BM25)

Pure vector search misses rare entity names (product SKUs, error codes, jargon). Pure BM25 (keyword) misses semantic queries. The fix is hybrid: run both, fuse with Reciprocal Rank Fusion (RRF). Typically increases recall by ~8–15% at near-zero added latency.

Production RAG: chunking, reranking, evals, and cost (a field guide)

1. Why most RAG demos don't survive production

2. The production RAG stack (anatomy)

3. Chunking strategy: more important than the embedder

4. Embedding model selection (in 2026)

5. Reranking — the cheap accuracy lever

6. Citation discipline (and why it matters)

7. Eval harness — the discipline that prevents regressions

8. Cost-aware model routing — why most queries don't need GPT-4

9. Common failure modes + fixes

Related deep-dives

Headless CMS in 2026: Sanity vs Contentful vs Payload vs Strapi vs Storyblok

Shipping vs. shipping fast: a senior engineer's three-variable heuristic

Generative Engine Optimization (GEO) in 2026: the complete playbook

The cost of waiting
is your competitor.

Production RAG: chunking, reranking, evals, and cost (a field guide)

1. Why most RAG demos don't survive production

2. The production RAG stack (anatomy)

3. Chunking strategy: more important than the embedder

4. Embedding model selection (in 2026)

5. Reranking — the cheap accuracy lever

6. Citation discipline (and why it matters)

7. Eval harness — the discipline that prevents regressions

8. Cost-aware model routing — why most queries don't need GPT-4

9. Common failure modes + fixes

Related deep-dives

Headless CMS in 2026: Sanity vs Contentful vs Payload vs Strapi vs Storyblok

Shipping vs. shipping fast: a senior engineer's three-variable heuristic

Generative Engine Optimization (GEO) in 2026: the complete playbook

The cost of waiting is your competitor.

The cost of waiting
is your competitor.