Skip to content

M31 solution: incident response and on-call

A small, dependency-free on-call toolkit, plus a scripted incident that exercises it end to end. Everything runs offline: no API key, no network, deterministic.

Files

File Role
oncall.py The SRE backbone for an AI service: SLO, sli, burn_rate, budget_remaining, the two-window alert (page vs ticket), the Incident record (timeline, mitigate, resolve), and the loop-closers write_postmortem and to_eval_case.
runbook.py RUNBOOKS (named mitigation checklists keyed by symptom), a MockSystem whose health recovers as mitigations are applied, and run_runbook (execute steps until healthy).
demo_mock.py One on-call shift A→F: healthy SLI → outage burns the budget → alert pages sev1 → incident opened/triaged → runbook mitigates → postmortem + regression eval. Start here.
../starters/on_call.py Your turn: add an escalation ladder (auto-escalate when an alert is not acknowledged in time).

Run it

# offline, free, instant, deterministic:
python demo_mock.py
No key and no .env are needed for this module; it is pure operations logic.

The ideas, and when each fires

  • SLO / SLI / error budget — define "healthy" as a number. The SLO is the promise (99% good), the SLI is the measurement, the error budget (1%) is how much failure you may spend before you must act.
  • burn rate + two-window alerthow fast you are spending the budget. A fast burn (budget gone in hours) pages a human now; a slow sustained burn opens a ticket. This is what stops both 2am pages for nothing and real outages going unnoticed.
  • the incident lifecycle — detect → triage → mitigate → resolve, recorded on a timeline as you go.
  • runbooks — a named checklist of safe, reversible mitigations for a known symptom, so whoever is on-call can act correctly without having built the system. run_runbook stops as soon as it is healthy.
  • blameless postmortem + regression — the timeline becomes a postmortem (blame the system, not the person), and the incident becomes a regression eval case (M26) so the same failure cannot ship twice.

Verified (offline)

  • demo_mock.py runs end to end and is deterministic (no randomness): a 99.7% SLI reads healthy (burn 0.30x); a 50%-failing fast window burns 50x and the two-window alert returns page / sev1; the model_provider_outage runbook recovers the MockSystem from 50% → 85% → 95% → 100% SLI in three steps and stops; the postmortem renders from the timeline; to_eval_case emits a regression-INC-001 case.
  • oncall.py and runbook.py are dependency-free and import without a key. Builds on M20 (the metrics you alert on), M22 (the reliability patterns the runbook flips on), and M26 (the regression gate).