M31 solution: incident response and on-call
A small, dependency-free on-call toolkit, plus a scripted incident that exercises it end to end. Everything runs offline: no API key, no network, deterministic.
Files
| File | Role |
|---|---|
oncall.py |
The SRE backbone for an AI service: SLO, sli, burn_rate, budget_remaining, the two-window alert (page vs ticket), the Incident record (timeline, mitigate, resolve), and the loop-closers write_postmortem and to_eval_case. |
runbook.py |
RUNBOOKS (named mitigation checklists keyed by symptom), a MockSystem whose health recovers as mitigations are applied, and run_runbook (execute steps until healthy). |
demo_mock.py |
One on-call shift A→F: healthy SLI → outage burns the budget → alert pages sev1 → incident opened/triaged → runbook mitigates → postmortem + regression eval. Start here. |
../starters/on_call.py |
Your turn: add an escalation ladder (auto-escalate when an alert is not acknowledged in time). |
Run it
# offline, free, instant, deterministic:
python demo_mock.py
.env are needed for this module; it is pure operations logic.
The ideas, and when each fires
- SLO / SLI / error budget — define "healthy" as a number. The SLO is the promise (99% good), the SLI is the measurement, the error budget (1%) is how much failure you may spend before you must act.
- burn rate + two-window alert — how fast you are spending the budget. A fast burn (budget gone in hours) pages a human now; a slow sustained burn opens a ticket. This is what stops both 2am pages for nothing and real outages going unnoticed.
- the incident lifecycle — detect → triage → mitigate → resolve, recorded on a timeline as you go.
- runbooks — a named checklist of safe, reversible mitigations for a known symptom, so whoever is
on-call can act correctly without having built the system.
run_runbookstops as soon as it is healthy. - blameless postmortem + regression — the timeline becomes a postmortem (blame the system, not the person), and the incident becomes a regression eval case (M26) so the same failure cannot ship twice.
Verified (offline)
demo_mock.pyruns end to end and is deterministic (no randomness): a 99.7% SLI reads healthy (burn 0.30x); a 50%-failing fast window burns 50x and the two-windowalertreturnspage / sev1; themodel_provider_outagerunbook recovers theMockSystemfrom 50% → 85% → 95% → 100% SLI in three steps and stops; the postmortem renders from the timeline;to_eval_caseemits aregression-INC-001case.oncall.pyandrunbook.pyare dependency-free and import without a key. Builds on M20 (the metrics you alert on), M22 (the reliability patterns the runbook flips on), and M26 (the regression gate).