Skip to content

M31: Incident response and on-call (Part E: Operations Support)

The course so far taught you to build AI systems. Part E is the layer that safeguards them once they are running. It starts here, because the first thing operations support owns is the 2am question: the agent you deployed in M27 is now failing for real users, the model provider is throwing 503s, and your phone is buzzing. Without a plan that is panic. With one, it is routine: you defined "healthy" as a number, an alert told you the budget was burning too fast, a runbook told you exactly what to try, and a postmortem turned the whole mess into a test so it cannot happen quietly again. Today you run that shift end to end, offline.

Today's win: an on-call drill where a real outage burns an error budget, pages you, gets triaged, is mitigated by a runbook until the service is healthy again, and ends in a blameless postmortem and a regression test, all demonstrated offline.

Today you will

  • Turn "healthy" into a number with an SLO, an SLI, and an error budget
  • Use a burn rate and a two-window alert to decide when to page a human vs open a ticket
  • Open and triage an incident, recording a timeline as you go
  • Follow a runbook of safe, reversible mitigations until the service recovers
  • Write a blameless postmortem and turn the incident into a regression eval (ties to M26)

Run of show (about 60 minutes)

Time What we do
0:00 Hook: failure in production is a when, not an if, operations support owns the when
0:05 The one idea: define healthy, watch the burn, follow the runbook, learn (read notes.md)
0:12 Lab Part A: SLO/SLI/error budget and the two-window alert
0:30 Lab Part B: open → triage → run the runbook → resolve
0:48 Show: post your postmortem and the regression case it produced
1:00 Wrap

If you get stuck

  • Safeguards what you built: alerts on the metrics from M20, flips on the reliability patterns from M22, and feeds the regression gate from M26. It also assumes a deployed service like M27/M29.
  • The whole lab runs offline, free, no key, and instantly (the outage and recovery are simulated). No new libraries. Nothing here can touch a real system; MockSystem only pretends to be your service.
  • Each idea is a few lines in oncall.py and runbook.py. Read the one that matches the step you are on.

Optional challenge

Open starters/on_call.py and implement an escalation ladder: if a sev1 page is not acknowledged within N minutes it escalates L1 → L2 → L3, and a sev3 ticket never pages at all. It is the policy that makes sure a missed page becomes someone else's page, not a missed outage.