Skip to content

Notes: M8: RAG II, make it good

M7 got a RAG app working. This module is about making it good, and, just as importantly, being able to prove it's good. The mindset shift here is the one that separates people who ship toys from people who ship products: stop judging your app by "it seemed fine when I tried it," and start measuring it. Once you can put a number on quality, improving it becomes a craft instead of a guess.

Why "it seemed fine" fails you

You test your RAG app on three questions, they work, you ship it. Then real users ask the other ninety-seven questions and a third come back wrong, and you have no idea which change would help, because you have nothing to compare against. The fix is an eval set: a fixed list of questions with known-good answers that you run every time you change something. It turns "I think that's better" into "retrieval went from 6/10 to 9/10." You can't improve what you don't measure.

Two places RAG breaks: measure both

A RAG answer is wrong for one of two reasons, and they have different fixes. Measuring them separately tells you where to look:

  1. Retrieval failure: the right chunk was never fetched, so the model never had the facts. No prompt tweak can fix this; the fix is upstream (chunking, k, reranking).
  2. Generation failure: the right chunk was fetched, but the model still answered wrong (missed it, mixed it up, or invented something). The fix is the prompt or the model.

So our scorecard tracks two numbers: - Retrieval hit rate: of questions whose answer is in the document, how often did we fetch the chunk holding it? - Answer match rate: how often did the final answer contain the correct fact (and say "I don't know" when it should)?

Most RAG problems are retrieval problems. When an answer is wrong, look at what was retrieved first, you'll usually find the model never stood a chance.

flowchart TB
  Q["question"] --> R{"right chunk retrieved?"}
  R -->|no| RF["RETRIEVAL failure<br/>fix: chunking, k, reranking"]
  R -->|yes| G{"answer correct?"}
  G -->|no| GF["GENERATION failure<br/>fix: prompt, model"]
  G -->|yes| OK["good answer "]

Building a tiny eval set

You don't need a fancy framework, a list of dictionaries is plenty to start:

EVAL_SET = [
  {"q": "What's the guest wifi password?",
   "source_must_contain": "freshbeans2024",   # a phrase from the correct chunk
   "answer_must_contain": "freshbeans2024"},   # a phrase the answer should include
  {"q": "Who is the CEO?",                      # not in the doc, tests honesty
   "source_must_contain": None,
   "answer_must_contain": "don't know"},
]
For each item you run the app, check whether retrieval fetched a chunk containing source_must_contain (retrieval hit) and whether the reply contains answer_must_contain (answer match). Include at least one question whose answer isn't in the document, so you also measure whether the app honestly declines instead of inventing. Even 5-10 questions catch most regressions; the key is that the set is fixed, so scores are comparable across changes.

The levers you can turn

When the score isn't good enough, these are your main knobs, change one at a time so you know what helped.

Chunk size (and overlap)

How you split the document shapes what retrieval can return: - Too big (whole sections/pages): each chunk mixes several topics, so the match is fuzzy and you burn tokens on irrelevant text. - Too small (single sentences): a chunk can lose the context needed to make sense, and a fact may get split from the thing that explains it. - Overlap (chunks share a few words at their edges) keeps an idea that straddles a boundary from being lost. A paragraph, or ~100-300 words with a little overlap, is a common starting point. There's no universal "right" size, measure a couple of options.

k: how many chunks you retrieve

Fetch too few (k=1) and a slightly-off match means you miss the answer entirely. Fetch more (k=4-8) and you're far likelier to include the right chunk, at the cost of more tokens and more noise in the prompt. Raising k is usually the most reliable single lift to retrieval hit rate.

Reranking: a sharper second pass

The vector search is fast but coarse. Reranking adds a second stage: retrieve a broad set cheaply (say k=20), then re-score those candidates with a more precise method and keep only the best few for the prompt. Production apps use dedicated rerank models (cross-encoders, or services like Cohere Rerank) that judge each (question, chunk) pair directly; our example uses a simple word-overlap re-score to show the shape. The pattern is always the same: broad-and-cheap, then narrow-and-precise. It improves precision and saves prompt tokens, when the rerank signal is good.

The loop: measure → change one thing → measure

That's the whole method. Run the eval, read the scorecard, change a single lever, run the eval again, keep the change only if the number went up. Two warnings the scorecard saves you from: - A change that helps one question can quietly break another: only a fixed eval set catches that. - "Improvements" often aren't. Reranking, smaller chunks, a fancier prompt, sometimes they lower the score. Measuring means you find out before your users do.

Go deeper (optional, not needed for today's win) - **LLM-as-judge:** our checks use simple "does the answer contain this phrase" matching, which is crude (it misses correct answers worded differently). A common next step is to ask *another* LLM call to grade whether the answer is correct and grounded in the source, more flexible, though it costs a call and needs its own sanity-checking. (You'll see judging again in M10.) - **Faithfulness vs. correctness:** two different questions, "is the answer *supported by* the retrieved text?" (faithfulness/grounding) and "is it *actually true*?" (correctness). Good RAG evals track both. - **Persisting the index:** re-embedding every run is fine for a small doc; for big corpora use `chromadb.PersistentClient(path=...)` so you index once. - **Frameworks:** tools like Ragas exist to automate RAG evals, but rolling your own tiny scorer first (like today) is the best way to understand what they measure.

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. Why isn't "it seemed fine when I tried it" good enough?

Show answer

A few ad-hoc tries miss most questions and give you nothing to compare against, so you can't tell whether a change helped or hurt. A fixed eval set turns quality into a number you can track and improve, and catches changes that fix one question while breaking another.

2. What are the two places a RAG answer can break, and why measure them separately?

Show answer

Retrieval failure (the right chunk was never fetched, no prompt tweak helps; fix chunking/k/ reranking) and generation failure (right chunk fetched, model still answered wrong, fix the prompt/model). Measuring separately tells you where to fix. Most RAG problems are retrieval problems, check what was retrieved first.

3. Name the three retrieval levers and what each does.

Show answer

Chunk size/overlap (how the doc is split, too big = fuzzy, too small = loses context); k (how many chunks you fetch, higher usually lifts retrieval hit rate, at more tokens/noise); and reranking (retrieve broadly, then re-score precisely and keep the best few).

4. What does reranking actually do, in one sentence?

Show answer

A two-stage retrieve: fetch a broad, cheap candidate set with the vector search, then re-score those candidates with a more precise method and keep only the best few for the prompt (broad-and-cheap → narrow-and-precise).

5. You add reranking and the score drops. What do you do?

Show answer

Revert it: and be glad you measured. A change that lowers the score isn't an improvement; the eval set just saved you from shipping it. Try a different lever and re-measure. (That a "fix" can hurt is exactly why you measure instead of guessing.)


New words (also in resources/glossary.md): evaluation / eval set, retrieval hit rate, answer match rate, retrieval failure, generation failure, chunk overlap, reranking, faithfulness, LLM-as-judge (go-deeper).

Source: original, written for this course. RAG-evaluation and reranking concepts reflect standard practice; the scorecard, levers, and example app were verified to run (chunking, reranking, and the scoring harness for real; the vector-store and model calls mocked, see the solution README). No third-party text or figures; diagrams are original.