§15.1
Retrieval-Augmented Generation
A general-purpose language model has read a lot of the public internet and very little of your firm. Ask it about your refund policy, your product catalog, your internal SOPs, or your sales playbook, and the answer will either be a polite refusal or a confident hallucination. Retrieval-Augmented Generation (RAG) is the standard architecture for closing that gap. The model still does the writing. The firm provides the facts — at query time, from its own indexed documents, retrieved by semantic similarity to the question.
This article walks through the RAG pipeline end to end. The pieces are familiar from earlier in the part — chunking is a text-preprocessing decision (§13.2), embedding and the vector index are both from §14.3, the LLM is from §21. RAG is what happens when they are wired together into a workflow that answers questions with sources attached.
The Executive Question
How do we let our employees, customers, or applications ask questions in natural language and get answers grounded in our own documents — not the model's training data?
The honest version: how do we keep the model's language ability and replace its factual ability with what's actually true at our firm right now?
The Pipeline
A diagram is the fastest way to see what RAG does. The pieces are simple individually; the engineering is in how they connect.
A retrieval-augmented generation pipeline, end to end
The retrieval step is where most RAG failures live. Bad retrieval → ungrounded answer; missing chunk → confident hallucination.
A walk-through:
- Documents. The firm's authoritative content — policies, manuals, FAQs, pricing decks, support history. Whatever the system is expected to answer from.
- Chunk. Documents are split into chunks of roughly 500–800 tokens with some overlap. Chunks small enough that retrieval can be precise; large enough to carry self-contained meaning.
- Embed. Each chunk is embedded with the same model that will embed queries. The vectors are stored in a vector database.
- Index. The database supports nearest-neighbour search across the chunk vectors. This is the §14.3 infrastructure.
- Query. A user (or another system) submits a question.
- Embed the query. With the same model.
- Retrieve. Top-k chunks nearest to the query in vector space.
- Generate. The retrieved chunks are inserted into the LLM's prompt as context, alongside the original question.
- Answer. The model writes a response that should be grounded in the retrieved chunks, with citations back to the source documents.
The architecture's strength is in the separation of concerns. Language ability stays with the model. Facts stay with the firm's index. Updating the knowledge base is just re-indexing; the model itself never has to be re-trained.
The Two Big Failures
Most RAG failures fall into one of two categories.
Missing context. The right chunk never enters the prompt because the retrieval step missed it. The model, asked for a fact it doesn't have, will either refuse or hallucinate. Causes: chunks too small to be self-contained; chunks too large to be retrievable; semantic mismatch between query language and document language; key documents missing from the index in the first place.
Ungrounded generation. The right chunk did enter the prompt, but the model ignored it and answered from its training data instead. This is the harder failure to catch because the answer often sounds correct. Causes: prompts that don't require the model to cite; conflicting evidence in the retrieved chunks; model temperature too high.
Mitigations layer on top of the basic pipeline:
- Citation-required prompts. The system prompt instructs the model to refuse if no retrieved chunk supports the claim.
- Re-ranking. A second model (a cross-encoder or a small LLM) re-scores the top-k chunks against the query, producing a stronger top-k.
- Hybrid search. Combine vector similarity with keyword filters (date, region, language) so the right documents are eligible to be retrieved.
- Source-visible UI. The user sees the cited chunks and can click through; ungrounded answers become visible.
- Eval suites. Curated question-answer-source triplets that the system is graded against on every release.
Chunking Choices Matter More Than You'd Expect
Chunking is the most under-discussed part of RAG and the most consequential. The trade-offs:
- Small chunks (200–400 tokens). Precise retrieval — the right paragraph gets pulled in cleanly. Cost: chunks may not be self-contained; the model lacks the surrounding context.
- Large chunks (1500–2000 tokens). Self-contained. Cost: retrieval is less precise; irrelevant content fills the context window; the model may attend to the wrong part.
- Overlap. A small overlap (50–100 tokens) between adjacent chunks bridges the seams. Standard practice in most production systems.
- Hierarchical chunking. Sections, paragraphs, sentences indexed separately. Queries hit the right level. More engineering, often worth it for long-document corpora.
- Semantic chunking. Split at meaning boundaries (topic shifts) rather than fixed token counts. Newer; promising; not yet a default.
A practical rule: start with 500-token chunks with 50-token overlap. Test against an eval suite. Adjust based on where retrieval is failing.
A Bean & Basket RAG Example
The use case: an internal Q&A assistant for store managers. The documents — policy handbook, pricing playbook, product training, troubleshooting guides — total a few hundred files.
Three example queries, each illustrating a different RAG behaviour:
| Query | Retrieved chunks | Behaviour |
|---|---|---|
| "What is our refund policy for app users?" | Refund policy SOP, sections 3.1–3.3. | Clean answer with citations. RAG working as intended. |
| "What happens when the espresso machine throws an E04 error?" | Troubleshooting guide intro (chunked too coarsely; the E04 section is in a different chunk that wasn't retrieved). | Model answers from training data — possibly correct, possibly invented. Chunking failure. |
| "Who is the regional manager for the South-East?" | No chunks above similarity threshold. | Citation-required prompt forces "I don't have that information in the knowledge base." Grounding works as intended. |
The second row is the typical RAG failure managers encounter. Everything looks fine until someone notices the answer is invented. The fix is downstream — better chunking, denser indexing, a re-ranker — not bigger models.
Evaluating a RAG System
The four metrics that matter:
- Retrieval recall. Did the top-k contain the right document? Measured on a labelled question-document set.
- Answer faithfulness. Did the answer use the retrieved chunks, or did the model improvise? Measured by checking each claim against the cited chunk.
- Answer relevance. Did the answer address the question? Measured by human or LLM-as-judge.
- Refusal rate on out-of-scope questions. Does the system know what it doesn't know? Measured with a deliberately out-of-scope eval set.
A common mistake is to grade only on the first metric. A system with perfect retrieval and broken grounding still ships nonsense; the four metrics need to be tracked together.
When RAG Is the Right Architecture, and When It Isn't
Use RAG when:
- The answers depend on internal, current, or specialized knowledge.
- The knowledge changes frequently (RAG re-indexes; fine-tuning doesn't).
- You need citations and audit trails.
- The cost of hallucination is non-trivial.
Consider alternatives when:
- The questions are conversational rather than factual (a small custom prompt may be enough).
- The knowledge is small enough to fit entirely in the context window (just include it — no retrieval needed).
- The system needs to act, not just answer (agents — §16.3 — wrap RAG in a workflow with tools).
- The questions are about patterns in data rather than facts (a structured query against a database is better than a RAG search).
RAG is one AI pattern. It is the right one for grounded Q&A. It is the wrong one for almost everything else.