Lab: M8: measure and improve your RAG app

You'll need: your M7 setup, venv active, key in .env, chromadb installed. Bring the same sample_notes.txt (or your own document). No new install. Time: ~50 min • Work in your breakout pair.

Heads up: today a result that says "your change made things worse" is a win, it means your scorecard is working and you just avoided shipping a bad idea. Measuring beats guessing. Errors are normal and safe.

This lab has two parts: - Part A: run the scorecard and understand the two numbers. - Part B: turn a lever, re-measure, and add your own eval question.

flowchart LR
  E["eval set<br/>(questions + known answers)"] --> Run["run the RAG app"]
  Run --> M["scorecard<br/>retrieval hit % · answer match %"]
  M --> Change["change ONE lever<br/>chunk size · k · rerank"]
  Change --> Run

Part A: measure

Step 1: Set up the folder

Put rag_eval.py (from solution/), eval_starter.py (from starters/) and sample_notes.txt in a folder with your M7 .env. Activate your venv.

You should now see: (.venv) and those files (ls / dir).

Step 2: Run the scorecard

python rag_eval.py

You should now see: two rows, a baseline and a tuned config, each showing a retrieval score (e.g. 4/4) and an answers score (e.g. 4/5). Numbers, not vibes. That's the whole point: you now have a measurement you can improve.

Step 3: Understand the two numbers

Open rag_eval.py and read EVAL_SET and run_config.

You should now see / say: - Retrieval hit rate = of the questions whose answer is in the document, how often did we fetch the chunk that holds it? (If we don't fetch it, the model can't use it.) - Answer match rate = how often did the final answer contain the right fact? (Includes the "Who is the CEO?" question, which should answer "I don't know", testing honesty too.)

Step 4: Find the failure

Look at which eval item didn't score. Note whether it failed at retrieval (wrong chunk fetched) or at answering (right chunk, wrong answer).

You should now see: at least one number below the maximum, and you can say where it broke. Most RAG failures are retrieval failures, the fix is upstream of the model.

Part B: improve, and measure again

Step 5: Read the levers

In rag_eval.py, compare the two run_config(...) calls. The tuned row changes three things: smaller chunks (chunk_small), more retrieved (k=6), and reranking (rerank_keep=3).

You should now see / say: the three levers, chunk size, k, and reranking, and what each does. Smaller chunks = finer matches; bigger k = more chances to fetch the right chunk; reranking = a sharper second pass that keeps the best few.

Step 6: Turn one lever at a time

Open eval_starter.py. It runs one config you control via K and RERANK. Run it as-is (K = 1, RERANK = False), note the score, then change K = 3, run again.

python eval_starter.py        # K=1 → note the score
# edit K = 3, save
python eval_starter.py        # K=3 → compare

You should now see: the scorecard change when you change K. Increasing k usually lifts the retrieval score (more chunks fetched = more chances to include the right one). Changing one thing at a time is how you know which lever helped.

Step 7: Try reranking

Set RERANK = True in eval_starter.py and run again.

You should now see: the app now retrieves 6 chunks and reranks to the best 3. Compare the score. Reranking trades breadth for precision, sometimes it helps, sometimes it doesn't. The scorecard tells you, which is the entire skill.

Step 8: Add your own eval question (finish TODO 1)

In eval_starter.py, add one eval item about your document (a question, a phrase the source chunk contains, and a phrase the answer should include). Run it.

You should now see: your own question scored automatically. You now have a repeatable test you can re-run after any change, a real eval set.

Stuck? The finished harness is ../solution/rag_eval.py. Peek only after you've tried.

Your win

You can measure your RAG app, retrieval hit rate and answer correctness, and use the scorecard to make (and prove) improvements instead of guessing.

Post it to the chat wins board: your before/after, e.g. "k=1 → retrieval 3/5; k=4 → retrieval 5/5. I measured my RAG app and made it better on purpose "

Take-home (optional)

Grow your eval set to 8-10 questions covering different parts of your document (including one whose answer isn't there). A bigger eval set catches changes that help some questions but quietly break others, the thing "it seemed fine" always misses.