Lab M26: make your evals run themselves (and block bad merges)
You'll need: your venv. The core lab needs no API key and costs nothing (the gate is deterministic). Time: about 40 minutes. Work in your breakout pair.
Heads up: this builds straight on M20's evals. The new idea is to AUTOMATE them: a gate that exits non-zero on failure, wired into CI so a regression turns the build red and blocks the merge. Nothing here can harm your computer, and there is nothing to spend.
This lab has two parts: - Part A: run the gate green, then watch a regression turn it red (and see the exit codes). - Part B: read the GitHub Actions workflow, and practice "every bug becomes a test".
flowchart LR
PUSH["git push / PR"] --> CI["GitHub Actions"]
CI --> RUN["run_evals.py"]
RUN -->|all pass, exit 0| GREEN["build green: merge allowed"]
RUN -->|regression, exit 1| RED["build red: merge blocked"]
Part A: the gate
Step 1: Set up
Copy the solution/ files into a folder. Activate your venv. No installs, no key.
python -c "print('ready')"
ready.
Step 2: Run the gate on the good app
python demo.py
==== GOOD CHANGE (current app) ====
[pass] hours: ...
[pass] reset: ...
[pass] location: ...
3/3 (100%), threshold 100% -> GATE PASSED (merge allowed)
==== REGRESSION (a bug slips in) ====
[FAIL] reset: 'Please contact support.'
2/3 (67%), threshold 100% -> GATE FAILED (merge blocked)
Step 3: See the exit code (what CI actually checks)
python run_evals.py ; echo "exit code: $?"
python run_evals.py --buggy ; echo "exit code: $?"
exit code: 0 and the buggy run prints exit code: 1.
This is the hinge of the whole module: CI passes a build when the command exits 0 and fails it when
the command exits non-zero. Open run_evals.py and find the sys.exit.
Step 4: See quality tracked over time
Run the gate a couple of times and look at the recorded history:
python run_evals.py >/dev/null ; python run_evals.py --buggy >/dev/null ; cat eval_history.json
[1.0, 0.667], the pass rate of each run. A trend, not just a
snapshot. Delete eval_history.json when done.
Part B: into CI, and the EDD habit
Step 5: Read the GitHub Actions workflow
Open evals-ci.yml. It runs on every push and pull_request, sets up
Python, and runs python run_evals.py. Because that command exits non-zero on a regression, the build
goes red and a required check blocks the merge.
You should now see: the workflow needs no API key, because the gate is deterministic. There is
an optional, commented second job for live evals on a schedule, which reads the key from a CI secret
(never a file). To use this in your own project, copy the file to .github/workflows/evals.yml.
Step 6: Why CI is deterministic (the design choice)
Read the comment at the top of evals-ci.yml and section 4 of notes.md.
You should now see (in your words): CI runs deterministic checks (mock or recorded responses) so it is fast, free, and repeatable; live model evals run on a schedule, not on every commit. A gate that costs money or flakes randomly gets ignored, so you keep the always-on gate cheap and certain.
Step 7: Practice "every bug becomes a test"
A user reports the agent cannot answer "Do you offer refunds?". Open
starters/add_eval_case.py:
1. It already adds a refunds case. Run it: python add_eval_case.py. The case FAILS (the bug is real).
2. Open app.py and add a refund answer (a line that returns something containing
"refund" for refund questions).
3. Run python add_eval_case.py again: the case now PASSES.
You should now see: a bug became a failing test, then a fix made it pass, and that test now guards against the bug forever. That loop, fail first then fix, is evaluation-driven development.
Step 8: Show it
Post your green gate (exit 0) next to the blocked regression (exit 1), and one sentence on what you would put in CI on every push versus what you would run live on a schedule.
If you get stuck
ModuleNotFoundError-> run from inside the folder with the solution.pyfiles.- Exit code is always 0 -> you ran
demo.py(which never exits non-zero); userun_evals.pyand check$?. - The refunds case still fails after editing -> make sure your new answer in
app.pyactually contains the word "refund", and that the refund branch is reached before the catch-all return. - Want CI to run live evals? -> add the optional job in
evals-ci.ymland setANTHROPIC_API_KEYas a repo secret (Settings -> Secrets). Never commit the key.