Skip to content

M8 solution

The expected, fully-commented artifacts for M8's lab. Peek only after you've tried the lab.

File What it is
rag_eval.py The eval harness: an EVAL_SET, a run_config that scores retrieval hit rate + answer match rate, and two configs (baseline vs tuned) to compare. Includes the levers: chunk_paragraphs/chunk_small, k, and rerank.
sample_notes.txt The same café document from M7, so the eval set has known answers.

Run it

With your venv active, chromadb installed (M7), and your .env present:

python rag_eval.py          # prints a two-row scorecard

How this was verified

  • Verified for real (pure Python, no deps): chunk_small produces genuinely finer chunks than chunk_paragraphs (18 vs 8 on the sample), and rerank keeps the candidates with the most word overlap with the question.
  • Eval-harness mechanics verified with Chroma + Claude mocked: a stand-in store ranked chunks by word overlap and a stand-in model answered from context, confirming run_config computes well-formed retrieval/answer scores and that retrieval hit rate is monotonic in k (more retrieved never reduces hits).

Could not run live Chroma here (sandbox is Python 3.14; no chromadb wheels, same as M7). The code uses Chroma's documented API; pilot the live eval on Python 3.10-3.12 (3.12 recommended). Note: which config scores higher depends on real embeddings and your document, the harness is the deliverable, not a guarantee that tuned always beats baseline (that's the lesson: measure, don't assume). No API key or billed call was used.