Designing a RAG Pipeline That Doesn't Lie

Retrieval-augmented generation is easy to demo and hard to trust. The first version always looks magical — you ask a question, it answers from your docs. Then you ship it, and the failure modes show up: confident answers from stale chunks, citations that don't support the claim, and silence when the retriever misses.

Here's how I think about the parts that actually matter.

Retrieval is the product

The model is mostly fixed. The retriever is where you have leverage. A few things move the needle more than prompt tweaking ever will:

Chunking that respects structure — split on headings and semantic boundaries, not a fixed token count.
Hybrid search — dense vectors for meaning, sparse (BM25) for exact terms. Most real queries need both.
Reranking — a cross-encoder over the top-k makes a bigger difference than a bigger embedding model.

Grounding, then generation

The cheapest guardrail is structural: make the model cite, and verify the citation actually exists.

def answer(query: str) -> Answer:
    candidates = hybrid_search(query, k=20)
    ranked = rerank(query, candidates)[:5]
    response = llm.generate(query, context=ranked, require_citations=True)
    return verify_citations(response, ranked)  # drop unsupported claims

If a claim isn't backed by a retrieved chunk, it doesn't ship. That single rule removes most hallucinations people complain about.

What I'd tell past me

Spend your time on retrieval quality and evaluation, not on the prompt. The prompt is the last 10%.

Build an eval set of real questions with known-good answers before you tune anything. Otherwise you're optimizing vibes.