M31: Incident response and on-call (Part E: Operations Support)
The course so far taught you to build AI systems. Part E is the layer that safeguards them once they are running. It starts here, because the first thing operations support owns is the 2am question: the agent you deployed in M27 is now failing for real users, the model provider is throwing 503s, and your phone is buzzing. Without a plan that is panic. With one, it is routine: you defined "healthy" as a number, an alert told you the budget was burning too fast, a runbook told you exactly what to try, and a postmortem turned the whole mess into a test so it cannot happen quietly again. Today you run that shift end to end, offline.
Today's win: an on-call drill where a real outage burns an error budget, pages you, gets triaged, is mitigated by a runbook until the service is healthy again, and ends in a blameless postmortem and a regression test, all demonstrated offline.
Today you will
- Turn "healthy" into a number with an SLO, an SLI, and an error budget
- Use a burn rate and a two-window alert to decide when to page a human vs open a ticket
- Open and triage an incident, recording a timeline as you go
- Follow a runbook of safe, reversible mitigations until the service recovers
- Write a blameless postmortem and turn the incident into a regression eval (ties to M26)
Run of show (about 60 minutes)
| Time | What we do |
|---|---|
| 0:00 | Hook: failure in production is a when, not an if, operations support owns the when |
| 0:05 | The one idea: define healthy, watch the burn, follow the runbook, learn (read notes.md) |
| 0:12 | Lab Part A: SLO/SLI/error budget and the two-window alert |
| 0:30 | Lab Part B: open → triage → run the runbook → resolve |
| 0:48 | Show: post your postmortem and the regression case it produced |
| 1:00 | Wrap |
If you get stuck
- Safeguards what you built: alerts on the metrics from M20, flips on the reliability patterns from M22, and feeds the regression gate from M26. It also assumes a deployed service like M27/M29.
- The whole lab runs offline, free, no key, and instantly (the outage and recovery are simulated). No
new libraries. Nothing here can touch a real system;
MockSystemonly pretends to be your service. - Each idea is a few lines in
oncall.pyandrunbook.py. Read the one that matches the step you are on.
Optional challenge
Open starters/on_call.py and implement an escalation ladder: if a sev1 page
is not acknowledged within N minutes it escalates L1 → L2 → L3, and a sev3 ticket never pages at all.
It is the policy that makes sure a missed page becomes someone else's page, not a missed outage.