M26 solution: evaluation-driven development and CI

Turns M20's eval suite into an automatic gate: a checked-in test set, a runner whose exit code passes or fails the build, and a GitHub Actions workflow that runs it on every push so a regression cannot reach main. The core is deterministic and runs with no API key and no spend.

Files

File	Role
`app.py`	The system under test, a tiny deterministic FAQ agent. A `buggy=True` flag simulates a regression. (Stands in for your real agent; in CI you would run the real agent against a mock or recorded responses.)
`evalset.py`	The versioned golden `CASES`, `run_suite`, the `THRESHOLD`, and `gate(...)` (pass/fail against the threshold).
`run_evals.py`	The command CI runs: prints a scorecard, records the pass rate to `eval_history.json`, and exits 0 (gate passed) or 1 (gate failed).
`evals-ci.yml`	A sample GitHub Actions workflow. Copy to `.github/workflows/evals.yml` in YOUR project; it runs the gate on every push and pull request.
`demo.py`	Runs the gate on the good app and the buggy app so you can see it flip. Start here.
`../starters/add_eval_case.py`	Practice "every bug becomes a test".

Run it

python demo.py                       # see the gate pass, then fail on a regression
python run_evals.py ; echo $?        # the CI command: exit 0 = pass
python run_evals.py --buggy ; echo $?  # exit 1 = fail (CI blocks the merge)

How it works

The exit code is the hinge. CI passes a build when its command exits 0 and fails it on non-zero. run_evals.py does exactly that based on gate(...), so wiring it into CI needs nothing more.
The eval set is versioned and grows. Cases live in evalset.py in the repo; you add one for every bug you find (the lab challenge), so the suite accumulates real-world coverage over time.
CI is deterministic on purpose. The gate uses a deterministic app (or, for a real agent, a mock or recorded responses) so it is fast, free, and repeatable. Live model evals belong on a schedule with the key as a CI secret (the optional second job in evals-ci.yml), not on every commit.
Quality is tracked. Each run appends its pass rate to eval_history.json, turning a snapshot into a trend.

Verified (offline)

demo.py: the good app passes 3/3 (gate allows merge); the buggy app fails the reset case 2/3 (gate blocks merge).
run_evals.py exits 0 on the good app and 1 on the buggy app, the behaviour CI depends on.
eval_history.json records each run's rate (for example [1.0, 0.667]).
evals-ci.yml is valid YAML.
All .py files compile. No key needed; the optional live path reuses the M4 key via a CI secret.