M26 notes: Evaluation-driven development and CI (the one idea)

The one idea: an eval suite only protects you if it runs automatically. Evaluation-driven development (EDD) makes your M20 evals a checked-in, growing test set that a robot runs on every change and uses to BLOCK any merge that drops quality. The mechanism is humble: a script that exits non-zero when the gate fails, and a CI system that treats non-zero as a failed build. That is it, and it changes everything, because now a regression cannot reach production by being forgotten.

1. Why "run the evals sometimes" fails

In M20 you could run your evals by hand. The problem is human: the day you are in a hurry, or it is late, or the change "obviously" works, is exactly the day you skip them and ship the regression. Manual discipline does not survive contact with deadlines. The fix is to remove the human from the loop: the evals run on every push whether or not anyone remembers, and a failure stops the change automatically.

Analogy. A smoke detector you have to remember to press every night is not a smoke detector. EDD wires the detector to the mains so it is always on, and to the door lock so it will not let you leave with the stove on.

2. The pieces

A versioned eval set (evalset.py): the golden cases live in the repo, in version control, next to the code. They are reviewed in pull requests like any other code, and they grow over time.
A gate (gate(...)): pass/fail against a threshold (here, 100% of cases). The threshold is a product decision: a critical agent might demand 100%, a fuzzier one 95%.
A runner (run_evals.py): runs the suite, prints a scorecard, records the result, and sets the process EXIT CODE: 0 if the gate passed, non-zero if it failed.
CI (evals-ci.yml): a GitHub Actions workflow that runs the runner on every push and pull request. A non-zero exit makes the build red, and a red required check blocks the merge.

The exit code is the hinge. Everything CI does to "pass or fail a build" comes down to whether the command it ran exited 0 or not. Make your gate exit non-zero on failure and CI does the rest.

3. EDD as a habit: every bug becomes a test

Test-driven development for agents looks like this:

A bug is reported (the agent gives a bad answer to some input).
You add an eval case for that input that FAILS because of the bug. Now the gate is red, and it is red for a real reason.
You fix the code until the case passes and the gate is green again.
The case stays in the suite forever, so that exact bug can never silently return.

Over months this is how the eval set becomes valuable: it accumulates one hard-won case per real failure. You do not write a thousand cases up front; you grow them from reality. The lab challenge has you do exactly this for a missing "refunds" answer.

4. What runs in CI vs on a schedule (the honest part)

CI must be fast, free, and repeatable: a gate that costs money or flakes randomly will be ignored or disabled. But LLM calls cost tokens and are non-deterministic. So split your evals:

In CI on every push: deterministic checks. Either pure-logic assertions, or your agent run against a mock or recorded responses (fixtures), so the result is the same every time and costs nothing. That is what run_evals.py does here.
On a schedule (nightly) or manually: a small live subset against the real model, with the API key supplied as a CI secret (never in a file). This catches real model-behaviour drift without making every commit slow and expensive. The sample workflow shows this as an optional second job.

This split is the practical heart of EDD for LLM apps: gate cheaply and deterministically on every change, verify expensively and realistically on a cadence.

5. Track quality over time

A single pass rate is a snapshot; the trend is the story. run_evals.py appends each run's rate to a small history file, so you can see quality climbing as you add features and dipping when something regresses. In a real setup you would send this to a dashboard (and the observability tools from M20 often host eval results too). Seeing the trend turns evals from a gate into a quality signal you manage.

6. Thresholds and flakiness

Two cautions. First, set the threshold deliberately: 100% is right when every case is critical, but if a case is inherently fuzzy, a too-strict gate trains people to ignore it. Second, a gate that fails randomly is worse than no gate, because people stop trusting it; keep CI deterministic (section 4) so a red build always means a real problem.

Words you will hear

Evaluation-driven development (EDD), regression test, eval gate / threshold, exit code, continuous integration (CI), GitHub Actions, fixtures / recorded responses, CI secret, quality tracking, flaky test. Full definitions in the glossary.