Skip to content

M26 solution: evaluation-driven development and CI

Turns M20's eval suite into an automatic gate: a checked-in test set, a runner whose exit code passes or fails the build, and a GitHub Actions workflow that runs it on every push so a regression cannot reach main. The core is deterministic and runs with no API key and no spend.

Files

File Role
app.py The system under test, a tiny deterministic FAQ agent. A buggy=True flag simulates a regression. (Stands in for your real agent; in CI you would run the real agent against a mock or recorded responses.)
evalset.py The versioned golden CASES, run_suite, the THRESHOLD, and gate(...) (pass/fail against the threshold).
run_evals.py The command CI runs: prints a scorecard, records the pass rate to eval_history.json, and exits 0 (gate passed) or 1 (gate failed).
evals-ci.yml A sample GitHub Actions workflow. Copy to .github/workflows/evals.yml in YOUR project; it runs the gate on every push and pull request.
demo.py Runs the gate on the good app and the buggy app so you can see it flip. Start here.
../starters/add_eval_case.py Practice "every bug becomes a test".

Run it

python demo.py                       # see the gate pass, then fail on a regression
python run_evals.py ; echo $?        # the CI command: exit 0 = pass
python run_evals.py --buggy ; echo $?  # exit 1 = fail (CI blocks the merge)

How it works

  • The exit code is the hinge. CI passes a build when its command exits 0 and fails it on non-zero. run_evals.py does exactly that based on gate(...), so wiring it into CI needs nothing more.
  • The eval set is versioned and grows. Cases live in evalset.py in the repo; you add one for every bug you find (the lab challenge), so the suite accumulates real-world coverage over time.
  • CI is deterministic on purpose. The gate uses a deterministic app (or, for a real agent, a mock or recorded responses) so it is fast, free, and repeatable. Live model evals belong on a schedule with the key as a CI secret (the optional second job in evals-ci.yml), not on every commit.
  • Quality is tracked. Each run appends its pass rate to eval_history.json, turning a snapshot into a trend.

Verified (offline)

  • demo.py: the good app passes 3/3 (gate allows merge); the buggy app fails the reset case 2/3 (gate blocks merge).
  • run_evals.py exits 0 on the good app and 1 on the buggy app, the behaviour CI depends on.
  • eval_history.json records each run's rate (for example [1.0, 0.667]).
  • evals-ci.yml is valid YAML.
  • All .py files compile. No key needed; the optional live path reuses the M4 key via a CI secret.