M34 notes: Part E capstone (the one idea)

The one idea: operations support is not a list of separate tools, it is one loop that runs on every production incident: detect → mitigate → learn → ship the fix, and then it watches to make sure the fix held. M31, M32, and M33 each owned one arc of that loop; the capstone closes it. The mark of a mature ops team is not that nothing breaks, it is that when something breaks, this loop turns a crisis into a routine, and leaves the system a little safer than before.

1. The shift as a pipeline

Read the whole drill as one pipeline, each stage owned by a module you built:

Stage	Owner	What happens
Intake	M32	40 alerts correlate into a few incidents; a user ticket is triaged and linked
Detect	M31	the SLI is below the SLO, the burn rate pages sev1, an incident opens
Mitigate	M31 → M33	the runbook's fastest play is a rollback to the last-good release
Learn	M31 → M26	a blameless postmortem becomes a regression eval case
Ship the fix	M33	the corrected release is canaried against live and promoted
Verify	M26	an eval gate scores the entire shift and exits non-zero if anything slipped

2. Intake: turn floods into a short list (M32)

Two inboxes arrive at once. The alert storm (one bad deploy trips dozens of alerts) is collapsed by AIOps correlate into a handful of incidents, so you page on the cause. The user ticket ("wrong answers since this morning") is triaged and routed, and crucially linked to the same incident, so the human side and the machine side are one story, not two.

3. Detect: the page is earned, not noisy (M31)

The service is live on a release that regressed quality, so its SLI sits at 80% against a 99% SLO, a 20x burn. The two-window alert fires a sev1 page, and an Incident opens with a timeline. Note the order that follows: you mitigate before you diagnose.

4. Mitigate: rollback is the fastest safe play (M31 → M33)

The runbook for bad_release has one obvious first step: roll back to the last-good version (M33's ReleaseManager.rollback). The moment it runs, the SLI recovers from 80% to ~100%. The root cause investigation can take hours; the users are healthy again in seconds. A rollback you can do in one call is worth more than a forward-fix you are still writing.

Analogy. A pilot who loses an engine does not first work out why it failed, they run the checklist that keeps the plane flying (M31's runbook), divert to the nearest airport (M33's rollback), and only then, on the ground, investigate and file the report (the postmortem) that changes the maintenance schedule (the regression test). Operate first, diagnose second, learn always.

5. Learn: the incident becomes a test (M31 → M26)

write_postmortem renders the timeline into a blameless write-up (the root cause here is a system gap, "shipped without a canary", not a person). to_eval_case turns it into a regression case for the gate. This is the step that makes the system get safer over time: the exact bug that hurt users is now something CI will catch before it ships again.

6. Ship the fix: never re-break what you just fixed (M33)

The corrected release is not pushed straight to everyone, that is how v2 got out in the first place. It is canaried against the live baseline on the eval set (now including the new regression case). Because it holds quality, it is promoted. Had the "fix" reintroduced the bug, the canary would reject it and live would stay put.

7. Verify: the gate scores the whole shift (M26)

evals.py is the difference between "the demo looked good" and "this behaved correctly." It asserts seven things, the storm was reduced, the page was sev1, the ticket was routed, the incident resolved, the SLI recovered within SLO, the regression case actually catches the bad release (proof the gate would now block that bug), and the fix was promoted, and it exits non-zero if any fail. That exit code is what you would wire into CI (M26).

8. Putting it together

drill.py is the shift; demo.py narrates it; evals.py grades it. None of the pieces are new, you built them all in M31–M33. What is new is seeing them as a single closed loop over one incident, which is exactly what operations support is: not the absence of failure, but a practiced, automated, ever-improving response to it.

Words you will hear

On-call shift, incident pipeline, intake (alerts + tickets), correlate, mitigate before diagnose, rollback, blameless postmortem, regression eval, canary, eval gate / exit code, the ops loop (detect → mitigate → learn → ship). Full definitions in the glossary; the pieces are M31–M33.