M31: Incident response and on-call (Part E: Operations Support)

The course so far taught you to build AI systems. Part E is the layer that safeguards them once they are running. It starts here, because the first thing operations support owns is the 2am question: the agent you deployed in M27 is now failing for real users, the model provider is throwing 503s, and your phone is buzzing. Without a plan that is panic. With one, it is routine: you defined "healthy" as a number, an alert told you the budget was burning too fast, a runbook told you exactly what to try, and a postmortem turned the whole mess into a test so it cannot happen quietly again. Today you run that shift end to end, offline.

Today's win: an on-call drill where a real outage burns an error budget, pages you, gets triaged, is mitigated by a runbook until the service is healthy again, and ends in a blameless postmortem and a regression test, all demonstrated offline.

Today you will

Turn "healthy" into a number with an SLO, an SLI, and an error budget
Use a burn rate and a two-window alert to decide when to page a human vs open a ticket
Open and triage an incident, recording a timeline as you go
Follow a runbook of safe, reversible mitigations until the service recovers
Write a blameless postmortem and turn the incident into a regression eval (ties to M26)

Run of show (about 60 minutes)

Time	What we do
0:00	Hook: failure in production is a when, not an if, operations support owns the when
0:05	The one idea: define healthy, watch the burn, follow the runbook, learn (read `notes.md`)
0:12	Lab Part A: SLO/SLI/error budget and the two-window alert
0:30	Lab Part B: open → triage → run the runbook → resolve
0:48	Show: post your postmortem and the regression case it produced
1:00	Wrap

If you get stuck

Safeguards what you built: alerts on the metrics from M20, flips on the reliability patterns from M22, and feeds the regression gate from M26. It also assumes a deployed service like M27/M29.
The whole lab runs offline, free, no key, and instantly (the outage and recovery are simulated). No new libraries. Nothing here can touch a real system; MockSystem only pretends to be your service.
Each idea is a few lines in oncall.py and runbook.py. Read the one that matches the step you are on.

Optional challenge

Open starters/on_call.py and implement an escalation ladder: if a sev1 page is not acknowledged within N minutes it escalates L1 → L2 → L3, and a sev3 ticket never pages at all. It is the policy that makes sure a missed page becomes someone else's page, not a missed outage.