All insights
AI / Engineering· Mar 30, 2026· 24 min read

Production RAG: lessons from 30+ deployments (chunking, reranking, evals, cost)

The demo works in 200 lines. Production needs eval harnesses, reranking, citation discipline, and cost-aware model routing. Here's what we've learned shipping RAG to real users.

AM
Aman Mathur
Founder, SERP Axis

1. Why most RAG demos don't survive production

The path from demo to production is mostly engineering discipline. Below is what we've learned shipping RAG to 30+ production deployments — the patterns that work, the ones that don't, and the eval rigor that prevents regressions.

  • Hallucinated citations — the LLM cites a doc that doesn't exist, or cites the wrong section of a real doc.
  • Missing the obvious answer — the right doc IS in the corpus, but the retrieval ranks it 7th out of 10, and the LLM doesn't see it in the top-K passed in the prompt.
  • Costs explode — every query routes to GPT-4 Turbo at 50K tokens of context, and your monthly bill goes from $200 to $14K.
  • Latency degrades — embedding + retrieval + LLM = 4–8 second responses. Users abandon.
  • Evals don't exist — when something regresses, no one knows. Six months in, the team realizes accuracy on common queries dropped 30% three sprints ago.

2. The production RAG stack (anatomy)

A production RAG system has six layers. Skipping any one of them creates a known failure mode.

LayerWhat it doesCommon failure if missing
Ingestion + chunkingSplits source docs into retrievable passagesBad chunks → bad retrieval, regardless of model quality
EmbeddingMaps chunks to vector space for similarity searchWrong embedder for domain → low retrieval recall
Vector storeStores + queries embeddings (Pinecone, Qdrant, pgvector)Wrong index type → slow queries or low recall
RetrievalReturns top-K candidates per queryNaive top-K → misses obvious answers
RerankingReorders candidates with a more powerful modelWithout reranking, top-1 accuracy is 30–50% lower
GenerationLLM produces answer with retrieved contextWithout citation discipline → hallucinated sources

3. Chunking strategy: more important than the embedder

The biggest lever for retrieval quality isn't the embedding model — it's how you chunked the documents. Three patterns we use, in order of preference:

  1. 1Semantic chunking (preferred for technical content). Use a small LLM to find natural section breaks. Chunks are 200–800 tokens, ending at sentence/paragraph boundaries. Each chunk is self-contained — readable in isolation. Tools: LlamaIndex's SentenceSplitter with semantic_chunker, or LangChain's RecursiveCharacterTextSplitter with proper separators.
  2. 2Sliding window with overlap (default for unstructured prose). 512-token chunks with 64-token overlap. Overlap prevents losing context at boundaries. Works for blog posts, articles, long-form documentation.
  3. 3Structural chunking (best for structured docs). Use the document's own structure — markdown headers, HTML semantic tags, code blocks — as chunk boundaries. Each H2 section becomes a chunk. For code documentation, each function/class becomes a chunk.
Semantic chunking with LlamaIndex
python
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model,
)

nodes = splitter.get_nodes_from_documents(documents)

# Each node has:
#   - text: the chunk content
#   - metadata: source doc, position, section heading
#   - relationships: prev/next chunk pointers (for context expansion)
The metadata trick

Always store source URL, section heading, and timestamp as chunk metadata. At generation time, pass this metadata to the LLM along with the chunk. Citations become accurate because the LLM has the source identifier inline with the content.

4. Embedding model selection (in 2026)

We default to text-embedding-3-small for general content, voyage-3-large for technical/legal/medical, and self-hosted (e5-large-v2) for cost-sensitive deployments where the corpus is small.

Tip: never compare embedding models on benchmark scores alone. Run a real eval set from YOUR domain. We've seen voyage-3 outperform OpenAI by 12% on legal docs and underperform by 8% on consumer reviews. Domain matters.

ModelDimensionsCost/1M tokensWhen to use
text-embedding-3-small (OpenAI)1536 (truncatable to 256)$0.02Default for general text. Cheap, fast, good enough.
text-embedding-3-large (OpenAI)3072 (truncatable to 256)$0.13When recall matters more than cost. Higher accuracy on niche domains.
voyage-3-large2048$0.18Best for technical domains (code, legal, medical). Outperforms OpenAI in our benchmarks.
Cohere embed-v31024$0.10Strong multilingual. Compress mode for cost reduction.
sentence-transformers (self-hosted)384–768FreeSelf-hosted; quality is below SOTA but adequate for many use cases.

5. Reranking — the cheap accuracy lever

Vector retrieval gives you semantically similar chunks. It does NOT give you the most relevant ones for a specific query. The two are correlated but not identical. Reranking is the step that closes the gap.

The pattern: retrieve top-50 candidates with vector search (fast), then rerank with a more expensive model to pick the top-5 (accurate). The reranker uses a cross-encoder that scores each (query, candidate) pair directly — slower per pair, but you only run it on 50 candidates, not the full corpus.

  • Cohere Rerank 3.5: $1 per 1K queries. Works on 1024-token candidates. Solid quality across domains.
  • voyage-rerank-2: similar pricing. Strong on technical content.
  • Self-hosted: BAAI/bge-reranker-v2-m3 is competitive and free if you can run a GPU.
  • Cross-encoder via OpenAI: use a small LLM (GPT-4o-mini) with a structured prompt. Higher latency but full control.
Two-stage retrieval: vector + rerank
python
from openai import OpenAI
import cohere

client = OpenAI()
co = cohere.Client(api_key=COHERE_API_KEY)

def retrieve_with_rerank(query: str, k: int = 5):
    # Stage 1: vector search, top-50 (fast, cheap)
    query_emb = client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    ).data[0].embedding

    candidates = vector_db.search(query_emb, top_k=50)  # ~50ms

    # Stage 2: rerank top-50 to top-K (slower, more accurate)
    rerank_response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=[c.text for c in candidates],
        top_n=k,
    )

    # Map back to original candidates with their metadata
    return [candidates[r.index] for r in rerank_response.results]

# Real-world impact:
#   Without rerank, top-1 retrieval accuracy: ~62%
#   With rerank, top-1 retrieval accuracy: ~89%
#   Latency penalty: ~200ms.
#   Cost: ~$0.001 per query.

6. Citation discipline (and why it matters)

An LLM that cites is an LLM you can trust. An LLM that doesn't cite is an LLM that hallucinates by default. The pattern: structure the prompt so the LLM MUST cite sources by ID, then validate the citations match real chunks before showing the answer.

Forced-citation prompt pattern
python
def build_prompt(query: str, chunks: list[Chunk]) -> str:
    sources = "\n".join([
        f"[{i+1}] {c.text}\nSource: {c.metadata['url']}#{c.metadata['section']}"
        for i, c in enumerate(chunks)
    ])

    return f"""You are answering a user question using ONLY the sources below.
Each claim in your answer MUST be followed by a citation in the format [N]
where N is the source number. If the sources don't contain the answer,
say "I don't have enough information" — do not invent.

Sources:
{sources}

User question: {query}

Answer (cite every factual claim):"""

# After generation, validate:
def validate_citations(answer: str, chunks: list[Chunk]) -> bool:
    cited = set(re.findall(r"\[(\d+)\]", answer))
    valid = set(str(i+1) for i in range(len(chunks)))
    if not cited.issubset(valid):
        return False  # cited a source that doesn't exist
    if not cited:
        return False  # cited nothing — likely hallucinating
    return True
The reranking + citation combo

Rerank gives you the right chunks. Citation discipline forces the LLM to use them. Together they take a typical RAG citation accuracy from ~70% to ~94% in our deployments.

7. Eval harness — the discipline that prevents regressions

Evals are not optional. They are the single most important engineering investment in a production RAG system. Without them, every prompt change is a roll of the dice.

Build a golden set: 100–500 (query, expected_answer, source_passages) tuples covering the queries you care about. Each example includes the query a user might ask, the answer you'd accept, and the source passages that should be retrieved.

Run the golden set after every prompt or model change. Score on three dimensions: retrieval recall (did we retrieve the right passages?), answer quality (does the answer match the expected one?), and citation accuracy (are cited sources actually in the corpus?).

Eval dimensionMetricTarget
Retrieval recall@5% of queries where the gold passage is in top-5> 90%
Answer correctnessLLM-as-judge score (0-100) vs gold answer> 85
Citation accuracy% of citations that point to real, relevant chunks> 95%
Hallucination rate% of answers with claims not in retrieved chunks< 3%
Latency p95End-to-end retrieve + generate< 4s
Cost per queryAverage $ per query in productionTrack + alert on changes
Eval harness skeleton
python
import json

def evaluate_rag(rag_system, golden_set):
    results = {
        "retrieval_recall": [],
        "answer_correctness": [],
        "citation_accuracy": [],
    }

    for example in golden_set:
        chunks = rag_system.retrieve(example["query"])
        answer = rag_system.generate(example["query"], chunks)

        # Retrieval recall
        gold_ids = set(example["expected_chunk_ids"])
        retrieved_ids = set(c.id for c in chunks[:5])
        recall = len(gold_ids & retrieved_ids) / len(gold_ids)
        results["retrieval_recall"].append(recall)

        # Answer correctness (via LLM-as-judge)
        score = llm_judge_score(
            query=example["query"],
            expected=example["expected_answer"],
            actual=answer,
        )
        results["answer_correctness"].append(score)

        # Citation accuracy
        citations = extract_citations(answer)
        valid = sum(1 for c in citations if c.refers_to(chunks))
        results["citation_accuracy"].append(
            valid / max(1, len(citations))
        )

    return {k: sum(v) / len(v) for k, v in results.items()}

8. Cost-aware model routing — why most queries don't need GPT-4

The router itself is a small LLM call (Tier 1) that classifies the query into a complexity bucket, then routes accordingly. The router itself costs <$0.001 per query and saves 5–10× on the actual generation.

Real numbers: across our 30+ deployments, the median cost reduction from routing is 70%. Some workloads with heavy long-tail simple queries see 85% reduction.

  • Tier 1 (cheap, fast): GPT-4o-mini, Claude Haiku, Gemini Flash. ~$0.15-0.25/M input tokens. Use for: simple factual queries, definitions, lookups.
  • Tier 2 (balanced): GPT-4o, Claude Sonnet, Gemini Pro. ~$2.5-3/M input tokens. Use for: multi-step reasoning, comparisons, summarization.
  • Tier 3 (expensive, high-quality): Claude Opus, GPT-5, Gemini Ultra. ~$15+/M input tokens. Use for: complex reasoning, high-stakes answers, code generation.

9. Common failure modes + fixes

SymptomLikely causeFix
Hallucinated answersNo citation forcing in promptAdd citation discipline pattern
Right doc not retrievedWrong chunk size or no rerankTune chunking + add reranker
Slow responsesTop-K too high, model too slowCap top-K at 5, route to cheaper model
Cost spikesGPT-4 routing all queriesAdd complexity classifier + Tier 1 fallback
Stale answersNo retrieval freshness signalIndex with timestamps, filter old chunks
Multi-language failuresEmbedder is English-onlySwitch to Cohere multilingual or text-embedding-3-large
Long-tail missVector recall is bad on rare entitiesAdd BM25 hybrid search + RRF fusion
Hybrid search (vector + BM25)

Pure vector search misses rare entity names (product SKUs, error codes, jargon). Pure BM25 (keyword) misses semantic queries. The fix is hybrid: run both, fuse with Reciprocal Rank Fusion (RRF). Increases recall by 8–15% in our deployments at near-zero added latency.

Tags
RAGLLMEmbeddingsVector searchAI engineeringEval
4 strategy seats remaining · Q3

The cost of waiting
is your competitor.

Every 90 days you delay is 90 days of authority compounding for someone else. Get the audit. See the math. Then decide.

Money-back
60 days
Reply within
3 hours
Audit value
$2,400 yours, free