M34 solution: Part E capstone, the on-call shift
One integrated on-call drill that runs M31 + M32 + M33 over a single incident, plus an eval gate that scores the whole shift. Offline, deterministic, no API key.
Files
| File | Role |
|---|---|
parts.py |
The Part E toolkit in one place: compact, labeled versions of the primitives from M31 (SLO/burn/alert/Incident/postmortem), M32 (triage/route/correlate), and M33 (ReleaseManager: canary/rollback). Each block names its source module. |
drill.py |
run(): the shift as a numbered story, a bad deploy → storm correlated → sev1 page → linked ticket → rollback → postmortem + regression → canaried fix. Returns a structured result + a trace. |
demo.py |
Narrates the shift in sections A–G, each labeled with the owning module. Start here. |
evals.py |
The eval gate (M26): seven checks over the shift; exits 0 if all hold, non-zero otherwise. |
../starters/extend.py |
Your turn: add a second incident type (e.g. a stale-index outage fixed by reindex) and a matching check. |
Run it
# the shift, narrated (offline, free, instant):
python demo.py
# the eval gate over the whole shift (exit 0 = good, 1 = a regression):
python evals.py ; echo "exit: $?"
The story it tells (which module owns each stage)
- Intake (M32): 40 alerts correlate into 5 incidents; a user ticket is triaged (sev2 → L2) and linked.
- Detect (M31): the live release regressed quality to 80% vs a 99% SLO; the 20x burn pages sev1.
- Mitigate (M31 → M33): the runbook's first play is a rollback to the last-good release; SLI 80% → 100%.
- Learn (M31 → M26): a blameless postmortem becomes a regression eval case for the gate.
- Ship the fix (M33): the corrected release is canaried against live and promoted.
- Verify (M26): the gate scores all of the above and exits non-zero if any step slipped.
Verified (offline)
demo.pyruns end to end and is deterministic: 40 alerts → 5 incidents; sev1 page at a 20x burn; rollback recovers the SLI 80% → 100%; the fix canary returnspromote; final live =v3-fix.evals.pyreturns 7/7, exit 0. Break the fix on purpose (makev3_fixanswer refunds wrong) and the canary rejects it, so it is never promoted, and the gate drops to 6/7, exit 1.parts.py,drill.pyare dependency-free and import without a key. Integrates M31–M33; the gate is the M26 pattern; the service under operation stands in for the M27 agent.