Lab M31: run an on-call shift
You'll need: Python and your venv. This lab needs no API key, costs nothing, and runs instantly and deterministically (the outage and the recovery are simulated). Time: about 45 minutes. Work in your breakout pair.
Heads up: there is no real service to break here.
MockSystemstands in for the agent you deployed earlier; an outage drags its health down and the runbook brings it back. Nothing leaves your machine.
This lab has two parts: - Part A: make "healthy" a number, then alert on the burn (SLO, SLI, error budget, two-window alert). - Part B: work the incident (detect → triage → run the runbook → resolve → postmortem → regression).
flowchart TB
SLI["measure SLI"] --> BR{"burn rate<br/>fast vs slow"}
BR -->|fast burn| PAGE["PAGE (sev1)"]
BR -->|slow burn| TKT["ticket (sev3)"]
BR -->|within budget| OK["ok"]
PAGE --> INC["open incident<br/>+ triage"]
INC --> RB["run runbook<br/>until healthy"]
RB --> RES["resolve"]
RES --> PM["postmortem"]
PM --> EV["regression eval (M26)"]
Part A: make "healthy" a number, then alert on it
Step 1: Set up
Copy the solution/ files into a folder and activate your venv. There are no
dependencies to install.
python -c "import oncall, runbook; print('ops toolkit ok')"
ops toolkit ok. (If not: run it from inside the folder with the .py files.)
Step 2: Run the whole shift once
python demo_mock.py
==== A. SLO & SLI: make 'healthy' a number ====
SLO: answer-success objective=99% error budget=1%
SLI now: 99.7% | budget remaining: 70% | burn: 0.30x -> healthy
Step 3: Watch the budget burn and the alert fire
Look at section B:
==== B. DETECT: an outage burns the budget, the alert fires ====
fast burn: 50x slow burn: 8x
ALERT -> action=PAGE severity=sev1 reason=fast burn 50.0x (error budget gone in hours)
oncall.py and read burn_rate and alert.
You should now see: a page needs the fast window hot; a mild slow burn would only open a ticket.
Step 4: Prove the alert is not trigger-happy
In a Python shell, show that a brief blip does not page:
python -c "import oncall as o; slo=o.SLO('x',0.99); print(o.alert(slo, [True]*98+[False]*2, [True]*999+[False]*1))"
('ok', None, ...) — 2% bad in the fast window is not a fast burn, and the slow
window is healthy, so nobody is woken. A trustworthy pager only fires when a human must act.
Part B: work the incident
Step 5: See the triage
Look at section C of the demo output:
==== C. TRIAGE: open and classify the incident ====
opened INC-001 severity=sev1 status=open
Incident and the symptom was named model_provider_outage. Naming the symptom is
what lets you pick the right runbook. Triage is matching the failure to a known play.
Step 6: Follow the runbook to recovery
Look at section D:
==== D. MITIGATE: follow the runbook until healthy ====
- Switch to the fallback model SLI now 85%
- Serve cached answers where possible SLI now 95%
- Shed low-priority traffic SLI now 100%
INC-001 status=resolved after 3 mitigation step(s)
run_runbook stops as
soon as the SLI recovers. Open runbook.py and read RUNBOOKS and
run_runbook. You stopped the bleeding before diagnosing the root cause, that is the right order.
Step 7: Read the postmortem and the regression case
Sections E and F print the blameless postmortem (written straight from the timeline) and the regression eval case the incident produced:
id: regression-INC-001 ... expect: service stays within SLO / degrades safely under this condition
Step 8: Make it your own
Change the outage severity and watch the response change. In a shell:
python -c "import oncall as o; slo=o.SLO('x',0.99); print(o.alert(slo, [True]*970+[False]*30, [True]*920+[False]*80))"
('page', 'sev2', ...), a sustained problem, not a flash fire.
Step 9: Show it
Post in the chat your postmortem from section E, or the one line from section F where your incident became a regression test.
If you get stuck
ModuleNotFoundError: oncall-> run from inside the folder that hasoncall.pyandrunbook.py.- The numbers differ from this lab -> the demo is deterministic; if you edited the windows, your
burn rates change too. Re-read
burn_rate: it isbad_fraction / error_budget. - "Why did it not page?" -> check the fast window. A page needs a hot fast burn; a healthy fast window with a warm slow window is a ticket, by design.