M31 solution: incident response and on-call

A small, dependency-free on-call toolkit, plus a scripted incident that exercises it end to end. Everything runs offline: no API key, no network, deterministic.

Files

File	Role
`oncall.py`	The SRE backbone for an AI service: `SLO`, `sli`, `burn_rate`, `budget_remaining`, the two-window `alert` (page vs ticket), the `Incident` record (timeline, `mitigate`, `resolve`), and the loop-closers `write_postmortem` and `to_eval_case`.
`runbook.py`	`RUNBOOKS` (named mitigation checklists keyed by symptom), a `MockSystem` whose health recovers as mitigations are applied, and `run_runbook` (execute steps until healthy).
`demo_mock.py`	One on-call shift A→F: healthy SLI → outage burns the budget → alert pages sev1 → incident opened/triaged → runbook mitigates → postmortem + regression eval. Start here.
`../starters/on_call.py`	Your turn: add an escalation ladder (auto-escalate when an alert is not acknowledged in time).

Run it

# offline, free, instant, deterministic:
python demo_mock.py

No key and no .env are needed for this module; it is pure operations logic.

The ideas, and when each fires

SLO / SLI / error budget — define "healthy" as a number. The SLO is the promise (99% good), the SLI is the measurement, the error budget (1%) is how much failure you may spend before you must act.
burn rate + two-window alert — how fast you are spending the budget. A fast burn (budget gone in hours) pages a human now; a slow sustained burn opens a ticket. This is what stops both 2am pages for nothing and real outages going unnoticed.
the incident lifecycle — detect → triage → mitigate → resolve, recorded on a timeline as you go.
runbooks — a named checklist of safe, reversible mitigations for a known symptom, so whoever is on-call can act correctly without having built the system. run_runbook stops as soon as it is healthy.
blameless postmortem + regression — the timeline becomes a postmortem (blame the system, not the person), and the incident becomes a regression eval case (M26) so the same failure cannot ship twice.

Verified (offline)

demo_mock.py runs end to end and is deterministic (no randomness): a 99.7% SLI reads healthy (burn 0.30x); a 50%-failing fast window burns 50x and the two-window alert returns page / sev1; the model_provider_outage runbook recovers the MockSystem from 50% → 85% → 95% → 100% SLI in three steps and stops; the postmortem renders from the timeline; to_eval_case emits a regression-INC-001 case.
oncall.py and runbook.py are dependency-free and import without a key. Builds on M20 (the metrics you alert on), M22 (the reliability patterns the runbook flips on), and M26 (the regression gate).