M8 solution
The expected, fully-commented artifacts for M8's lab. Peek only after you've tried the lab.
| File | What it is |
|---|---|
rag_eval.py |
The eval harness: an EVAL_SET, a run_config that scores retrieval hit rate + answer match rate, and two configs (baseline vs tuned) to compare. Includes the levers: chunk_paragraphs/chunk_small, k, and rerank. |
sample_notes.txt |
The same café document from M7, so the eval set has known answers. |
Run it
With your venv active, chromadb installed (M7), and your .env present:
python rag_eval.py # prints a two-row scorecard
How this was verified
- Verified for real (pure Python, no deps):
chunk_smallproduces genuinely finer chunks thanchunk_paragraphs(18 vs 8 on the sample), andrerankkeeps the candidates with the most word overlap with the question. - Eval-harness mechanics verified with Chroma + Claude mocked: a stand-in store ranked chunks by
word overlap and a stand-in model answered from context, confirming
run_configcomputes well-formed retrieval/answer scores and that retrieval hit rate is monotonic ink(more retrieved never reduces hits).
Could not run live Chroma here (sandbox is Python 3.14; no
chromadbwheels, same as M7). The code uses Chroma's documented API; pilot the live eval on Python 3.10-3.12 (3.12 recommended). Note: which config scores higher depends on real embeddings and your document, the harness is the deliverable, not a guarantee thattunedalways beatsbaseline(that's the lesson: measure, don't assume). No API key or billed call was used.