M26: Evaluation-driven development and CI (Part D: Agentic Systems)

In M20 you wrote evals. But evals you have to remember to run are evals you will forget to run, right before the change that breaks everything. Today you make them automatic: a checked-in test set, run on every push by a robot, that turns the build red and blocks the merge the moment quality drops. It is test-driven development for agents, and it is the difference between "we test sometimes" and "a regression literally cannot reach main".

Today's win: a green-or-red eval gate that passes a good change and blocks a regression automatically, plus the GitHub Actions workflow that runs it on every push, all demonstrated offline.

Today you will

Keep a versioned eval set in the repo that grows over time (every bug becomes a test)
Build an eval gate: pass/fail against a threshold, expressed as a process exit code
Wire it into CI (GitHub Actions) so it runs on every push and pull request and blocks bad merges
Decide what runs in CI (fast, free, deterministic) vs on a schedule (live, costs tokens)
Track quality over time so you can see trends, not just a single run

Run of show (about 55 minutes)

Time	What we do
0:00	Hook: the eval you forgot to run
0:05	The one idea: automate the gate, make red block the merge (read `notes.md`)
0:12	Lab Part A: run the gate green, then watch a regression turn it red (exit codes)
0:30	Lab Part B: read the GitHub Actions workflow; practice "every bug becomes a test"
0:48	Show: post your green run next to the blocked regression
0:55	Wrap

If you get stuck

Builds directly on M20 (the eval suite) and uses the deployment mindset from M11. The core lab is deterministic and runs with no API key and no spend.
No new libraries. Nothing here can harm your computer. The workflow file is a sample you copy into your own project, not something that runs in the course repo.
The key idea is the exit code: run_evals.py exits non-zero when the gate fails, and CI treats non-zero as a failed build. Read that one line if anything is confusing.

Optional challenge

Open starters/add_eval_case.py and practice the core habit: a user reports the agent cannot answer "Do you offer refunds?". Add a failing eval case FIRST, watch the gate go red, then fix app.py until it goes green. You just turned a bug into a permanent regression test.