Skip to content

M34 solution: Part E capstone, the on-call shift

One integrated on-call drill that runs M31 + M32 + M33 over a single incident, plus an eval gate that scores the whole shift. Offline, deterministic, no API key.

Files

File Role
parts.py The Part E toolkit in one place: compact, labeled versions of the primitives from M31 (SLO/burn/alert/Incident/postmortem), M32 (triage/route/correlate), and M33 (ReleaseManager: canary/rollback). Each block names its source module.
drill.py run(): the shift as a numbered story, a bad deploy → storm correlated → sev1 page → linked ticket → rollback → postmortem + regression → canaried fix. Returns a structured result + a trace.
demo.py Narrates the shift in sections A–G, each labeled with the owning module. Start here.
evals.py The eval gate (M26): seven checks over the shift; exits 0 if all hold, non-zero otherwise.
../starters/extend.py Your turn: add a second incident type (e.g. a stale-index outage fixed by reindex) and a matching check.

Run it

# the shift, narrated (offline, free, instant):
python demo.py

# the eval gate over the whole shift (exit 0 = good, 1 = a regression):
python evals.py ; echo "exit: $?"

The story it tells (which module owns each stage)

  1. Intake (M32): 40 alerts correlate into 5 incidents; a user ticket is triaged (sev2 → L2) and linked.
  2. Detect (M31): the live release regressed quality to 80% vs a 99% SLO; the 20x burn pages sev1.
  3. Mitigate (M31 → M33): the runbook's first play is a rollback to the last-good release; SLI 80% → 100%.
  4. Learn (M31 → M26): a blameless postmortem becomes a regression eval case for the gate.
  5. Ship the fix (M33): the corrected release is canaried against live and promoted.
  6. Verify (M26): the gate scores all of the above and exits non-zero if any step slipped.

Verified (offline)

  • demo.py runs end to end and is deterministic: 40 alerts → 5 incidents; sev1 page at a 20x burn; rollback recovers the SLI 80% → 100%; the fix canary returns promote; final live = v3-fix.
  • evals.py returns 7/7, exit 0. Break the fix on purpose (make v3_fix answer refunds wrong) and the canary rejects it, so it is never promoted, and the gate drops to 6/7, exit 1.
  • parts.py, drill.py are dependency-free and import without a key. Integrates M31–M33; the gate is the M26 pattern; the service under operation stands in for the M27 agent.