M8 solution

The expected, fully-commented artifacts for M8's lab. Peek only after you've tried the lab.

File	What it is
`rag_eval.py`	The eval harness: an `EVAL_SET`, a `run_config` that scores retrieval hit rate + answer match rate, and two configs (`baseline` vs `tuned`) to compare. Includes the levers: `chunk_paragraphs`/`chunk_small`, `k`, and `rerank`.
`sample_notes.txt`	The same café document from M7, so the eval set has known answers.

Run it

With your venv active, chromadb installed (M7), and your .env present:

python rag_eval.py          # prints a two-row scorecard

How this was verified

Verified for real (pure Python, no deps): chunk_small produces genuinely finer chunks than chunk_paragraphs (18 vs 8 on the sample), and rerank keeps the candidates with the most word overlap with the question.
Eval-harness mechanics verified with Chroma + Claude mocked: a stand-in store ranked chunks by word overlap and a stand-in model answered from context, confirming run_config computes well-formed retrieval/answer scores and that retrieval hit rate is monotonic in k (more retrieved never reduces hits).

Could not run live Chroma here (sandbox is Python 3.14; no chromadb wheels, same as M7). The code uses Chroma's documented API; pilot the live eval on Python 3.10-3.12 (3.12 recommended). Note: which config scores higher depends on real embeddings and your document, the harness is the deliverable, not a guarantee that tuned always beats baseline (that's the lesson: measure, don't assume). No API key or billed call was used.