M8: RAG II: make it good

Your M7 app answers, but is it right? "Seems fine when I tried it" isn't good enough for anything real. Today you do what separates a demo from a product: you measure your RAG app with a scorecard, find where it fails, turn a few knobs, and prove the score went up. This is the habit that makes you dangerous, you stop guessing and start measuring.

Today's win: you can tell whether your RAG app is correct, a scorecard over a small eval set, and you've used it to make a measured improvement.

Today you will

Build a tiny eval set and score two things: retrieval hit rate and answer correctness
Diagnose a bad retrieval (a question the app gets wrong) by looking at what it fetched
Turn the levers, chunk size, how many chunks (k), reranking: and re-measure

Run of show (~60 min)

Time	What we do
0:00	Hook + the win we're chasing
0:05	The one idea: you can't improve what you don't measure (full read in `notes.md`)
0:10	Lab Part A: run the scorecard; read what each number means
0:35	Lab Part B: turn a lever, re-measure, add your own eval question
0:55	Show: post your before/after scorecard to the wins board
1:00	Wrap + take-home

If you get stuck

No new install, reuse M7's Chroma + key. (If Chroma still won't import, it's the Python-version fix from M7: use Python 3.12.)
A change that makes the score worse is a success of measuring, not a failure, now you know, and you revert it. Re-read the You should now see line. Nothing here can harm your computer.
If retrieval misses, look at the retrieved chunks first, the fix is almost always there (bigger k, smaller chunks, or reranking).

Optional challenge

Write an eval question that your app currently gets wrong, then find the single lever that fixes it without breaking the others. Welcome to real RAG engineering, it's all measure, change one thing, measure again.