M31 notes: Incident response and on-call (the one idea)

The one idea: you cannot keep a system healthy until "healthy" is a number you watch. Operations support is the safeguarding layer over everything the course built: it defines that number (an SLO), measures it (an SLI), gets told the moment it is burning too fast (an alert), follows a known checklist to fix it (a runbook), and learns from it so the same failure cannot return quietly (a postmortem that becomes a test). AI engineering builds the system; operations support keeps it standing.

1. Why this is its own discipline

The build modules optimized for "it works." Production optimizes for "it keeps working, and when it does not, we recover fast." Those are different jobs. A model provider has an outage, a vector store fills up, a deploy goes bad, a prompt-injection wave hits, costs spike. None of that is rare. The safeguarding layer is the set of habits and tools that turn each one from a crisis into a checklist.

2. SLO, SLI, error budget

Three words that turn arguments into math:

Term	Plain meaning	Example
SLI (indicator)	what you measure	% of answers that returned successfully
SLO (objective)	the promise you hold yourself to	"99% succeed over 30 days"
Error budget	the failure the SLO allows	1% — your room to be imperfect

The error budget is the key idea: 100% is the wrong target (impossibly expensive, and it leaves no room to ship). The budget is permission to fail a little. When it is gone, you stop shipping features and fix reliability instead. SLO.error_budget is just 1 - objective.

3. Burn rate, and alerting on it

A raw error rate is a bad pager: alert on every blip and you train yourself to ignore it; alert too late and you miss the outage. Burn rate fixes this: it is how fast you are spending the budget. 1x means you will use exactly your month's budget in a month; 50x means it is gone in well under a day.

alert() uses two windows (the real Google-SRE shape, simplified): a short fast window and a long slow window. - Fast burn (budget gone in hours) → page a human now (sev1). - Sustained slow burn → still page, but a calmer sev2. - Mild slow burn → open a ticket, not a page. - Otherwise → ok.

Two windows stop a one-second blip from paging you, while still catching a slow leak that a single short window would miss.

4. On-call: page vs ticket

On-call is a rotation of who carries the pager. The kindest thing you can do for that person is make the pager trustworthy: it fires only when a human genuinely needs to act now. Everything else is a ticket. Severity (sev1/2/3) encodes both how bad and how fast someone must respond, so the response matches the problem.

Analogy. An error budget is the fuel gauge on a long drive. You do not need a full tank at all times, you need to not run dry before the next station. The burn rate is how hard you are pressing the accelerator; the two-window alert is the difference between the needle dipping over one hill (ignore it) and dropping steadily for twenty miles (find a station now). The runbook is the laminated card in the glovebox that tells any driver what to do.

5. The incident lifecycle

When the page fires, the same small loop runs every time:

Detect — the alert opens an Incident with a severity and a timeline.
Triage — name the likely symptom (here: model_provider_outage) so you can pick a runbook.
Mitigate — stop the bleeding before you find the root cause. Flip to a fallback, turn on caching, shed load. Mitigation first, diagnosis second.
Resolve — once healthy, record the root cause and close it.

The Incident object records each move on a timeline, which is the raw material for the postmortem.

6. Runbooks

A runbook is a named, ordered checklist of safe, reversible mitigations for a known symptom. It exists so the tired person on-call can act correctly without being the person who built the system. run_runbook applies steps in order against the service and stops the moment the SLI recovers — you do not keep applying mitigations you no longer need. In the demo, the provider-outage runbook takes the service from 50% → 85% → 95% → 100% healthy in three steps, then stops.

7. Blameless postmortems, and the loop back to evals

After resolution you write a blameless postmortem: blame the system, not the person ("the deploy had no canary," never "Sam pushed it"). Blame makes people hide incidents; blameless makes them share so everyone learns. write_postmortem renders one straight from the timeline.

The most important action item is automatic: every incident becomes a regression test. to_eval_case turns the incident into the eval shape M26 consumes, so the gate now fails if that exact failure ever returns. This is the loop that makes a system get more reliable over time instead of less.

8. Putting it together

The on-call shift in demo_mock.py: define the SLO → watch the SLI and burn → a fast burn pages sev1 → open and triage the incident → run the runbook until healthy → resolve → postmortem → regression case. None of it changes what the system does when it works. It changes how fast you recover when it does not, which is the entire job of the safeguarding layer.

Words you will hear

SLI / SLO / error budget, burn rate, two-window alert, on-call / paging, severity (sev1/2/3), incident lifecycle (detect → triage → mitigate → resolve), mitigation vs root cause, runbook / playbook, blameless postmortem, escalation ladder (the challenge), toil. Full definitions in the glossary.