Lab M26: make your evals run themselves (and block bad merges)

You'll need: your venv. The core lab needs no API key and costs nothing (the gate is deterministic). Time: about 40 minutes. Work in your breakout pair.

Heads up: this builds straight on M20's evals. The new idea is to AUTOMATE them: a gate that exits non-zero on failure, wired into CI so a regression turns the build red and blocks the merge. Nothing here can harm your computer, and there is nothing to spend.

This lab has two parts: - Part A: run the gate green, then watch a regression turn it red (and see the exit codes). - Part B: read the GitHub Actions workflow, and practice "every bug becomes a test".

flowchart LR
  PUSH["git push / PR"] --> CI["GitHub Actions"]
  CI --> RUN["run_evals.py"]
  RUN -->|all pass, exit 0| GREEN["build green: merge allowed"]
  RUN -->|regression, exit 1| RED["build red: merge blocked"]

Part A: the gate

Step 1: Set up

Copy the solution/ files into a folder. Activate your venv. No installs, no key.

python -c "print('ready')"

You should now see: ready.

Step 2: Run the gate on the good app

python demo.py

You should now see the good app pass every case and the gate allow the merge:

==== GOOD CHANGE (current app) ====
  [pass] hours: ...
  [pass] reset: ...
  [pass] location: ...
  3/3 (100%), threshold 100%  ->  GATE PASSED (merge allowed)

Then a regression (a bug in the password answer) turns it red:

==== REGRESSION (a bug slips in) ====
  [FAIL] reset: 'Please contact support.'
  2/3 (67%), threshold 100%  ->  GATE FAILED (merge blocked)

You should now see: the same suite passes a good change and fails a bad one. That is the gate.

Step 3: See the exit code (what CI actually checks)

python run_evals.py ; echo "exit code: $?"
python run_evals.py --buggy ; echo "exit code: $?"

You should now see: the good run prints exit code: 0 and the buggy run prints exit code: 1. This is the hinge of the whole module: CI passes a build when the command exits 0 and fails it when the command exits non-zero. Open run_evals.py and find the sys.exit.

Step 4: See quality tracked over time

Run the gate a couple of times and look at the recorded history:

python run_evals.py >/dev/null ; python run_evals.py --buggy >/dev/null ; cat eval_history.json

You should now see: a list like [1.0, 0.667], the pass rate of each run. A trend, not just a snapshot. Delete eval_history.json when done.

Part B: into CI, and the EDD habit

Step 5: Read the GitHub Actions workflow

Open evals-ci.yml. It runs on every push and pull_request, sets up Python, and runs python run_evals.py. Because that command exits non-zero on a regression, the build goes red and a required check blocks the merge.

You should now see: the workflow needs no API key, because the gate is deterministic. There is an optional, commented second job for live evals on a schedule, which reads the key from a CI secret (never a file). To use this in your own project, copy the file to .github/workflows/evals.yml.

Step 6: Why CI is deterministic (the design choice)

Read the comment at the top of evals-ci.yml and section 4 of notes.md.

You should now see (in your words): CI runs deterministic checks (mock or recorded responses) so it is fast, free, and repeatable; live model evals run on a schedule, not on every commit. A gate that costs money or flakes randomly gets ignored, so you keep the always-on gate cheap and certain.

Step 7: Practice "every bug becomes a test"

A user reports the agent cannot answer "Do you offer refunds?". Open starters/add_eval_case.py: 1. It already adds a refunds case. Run it: python add_eval_case.py. The case FAILS (the bug is real). 2. Open app.py and add a refund answer (a line that returns something containing "refund" for refund questions). 3. Run python add_eval_case.py again: the case now PASSES.

You should now see: a bug became a failing test, then a fix made it pass, and that test now guards against the bug forever. That loop, fail first then fix, is evaluation-driven development.

Step 8: Show it

Post your green gate (exit 0) next to the blocked regression (exit 1), and one sentence on what you would put in CI on every push versus what you would run live on a schedule.

If you get stuck

ModuleNotFoundError -> run from inside the folder with the solution .py files.
Exit code is always 0 -> you ran demo.py (which never exits non-zero); use run_evals.py and check $?.
The refunds case still fails after editing -> make sure your new answer in app.py actually contains the word "refund", and that the refund branch is reached before the catch-all return.
Want CI to run live evals? -> add the optional job in evals-ci.yml and set ANTHROPIC_API_KEY as a repo secret (Settings -> Secrets). Never commit the key.

Check yourself

What makes CI pass or fail a build, mechanically?

The exit code of the command it runs. `run_evals.py` exits 0 when the gate passes and non-zero when it fails; CI treats non-zero as a failed build and blocks the merge.

Why run deterministic evals in CI instead of live model calls?

CI must be fast, free, and repeatable. Live calls cost tokens and are non-deterministic, so they make the gate slow, expensive, and flaky, which gets it ignored. Gate deterministically (mock or recorded responses) on every push; run a small live subset on a schedule.

What is the "every bug becomes a test" habit?

When you find a bug, first add an eval case that fails because of it, then fix the code until it passes. The case stays in the suite forever, so that bug can never silently return.

How do you choose the pass threshold?

It is a product decision. Use 100% when every case is critical; a lower bar may fit a fuzzier task. But do not set it so strict that flaky or subjective cases make the gate untrustworthy, or people will route around it.