M26 solution: evaluation-driven development and CI
Turns M20's eval suite into an automatic gate: a checked-in test set, a runner whose exit code passes or fails the build, and a GitHub Actions workflow that runs it on every push so a regression cannot reach main. The core is deterministic and runs with no API key and no spend.
Files
| File | Role |
|---|---|
app.py |
The system under test, a tiny deterministic FAQ agent. A buggy=True flag simulates a regression. (Stands in for your real agent; in CI you would run the real agent against a mock or recorded responses.) |
evalset.py |
The versioned golden CASES, run_suite, the THRESHOLD, and gate(...) (pass/fail against the threshold). |
run_evals.py |
The command CI runs: prints a scorecard, records the pass rate to eval_history.json, and exits 0 (gate passed) or 1 (gate failed). |
evals-ci.yml |
A sample GitHub Actions workflow. Copy to .github/workflows/evals.yml in YOUR project; it runs the gate on every push and pull request. |
demo.py |
Runs the gate on the good app and the buggy app so you can see it flip. Start here. |
../starters/add_eval_case.py |
Practice "every bug becomes a test". |
Run it
python demo.py # see the gate pass, then fail on a regression
python run_evals.py ; echo $? # the CI command: exit 0 = pass
python run_evals.py --buggy ; echo $? # exit 1 = fail (CI blocks the merge)
How it works
- The exit code is the hinge. CI passes a build when its command exits 0 and fails it on non-zero.
run_evals.pydoes exactly that based ongate(...), so wiring it into CI needs nothing more. - The eval set is versioned and grows. Cases live in
evalset.pyin the repo; you add one for every bug you find (the lab challenge), so the suite accumulates real-world coverage over time. - CI is deterministic on purpose. The gate uses a deterministic app (or, for a real agent, a mock or
recorded responses) so it is fast, free, and repeatable. Live model evals belong on a schedule with the
key as a CI secret (the optional second job in
evals-ci.yml), not on every commit. - Quality is tracked. Each run appends its pass rate to
eval_history.json, turning a snapshot into a trend.
Verified (offline)
demo.py: the good app passes 3/3 (gate allows merge); the buggy app fails theresetcase 2/3 (gate blocks merge).run_evals.pyexits 0 on the good app and 1 on the buggy app, the behaviour CI depends on.eval_history.jsonrecords each run's rate (for example[1.0, 0.667]).evals-ci.ymlis valid YAML.- All
.pyfiles compile. No key needed; the optional live path reuses the M4 key via a CI secret.