Lab M31: run an on-call shift

You'll need: Python and your venv. This lab needs no API key, costs nothing, and runs instantly and deterministically (the outage and the recovery are simulated). Time: about 45 minutes. Work in your breakout pair.

Heads up: there is no real service to break here. MockSystem stands in for the agent you deployed earlier; an outage drags its health down and the runbook brings it back. Nothing leaves your machine.

This lab has two parts: - Part A: make "healthy" a number, then alert on the burn (SLO, SLI, error budget, two-window alert). - Part B: work the incident (detect → triage → run the runbook → resolve → postmortem → regression).

flowchart TB
  SLI["measure SLI"] --> BR{"burn rate<br/>fast vs slow"}
  BR -->|fast burn| PAGE["PAGE (sev1)"]
  BR -->|slow burn| TKT["ticket (sev3)"]
  BR -->|within budget| OK["ok"]
  PAGE --> INC["open incident<br/>+ triage"]
  INC --> RB["run runbook<br/>until healthy"]
  RB --> RES["resolve"]
  RES --> PM["postmortem"]
  PM --> EV["regression eval (M26)"]

Part A: make "healthy" a number, then alert on it

Step 1: Set up

Copy the solution/ files into a folder and activate your venv. There are no dependencies to install.

python -c "import oncall, runbook; print('ops toolkit ok')"

You should now see: ops toolkit ok. (If not: run it from inside the folder with the .py files.)

Step 2: Run the whole shift once

python demo_mock.py

You should now see six sections, A to F. Look at section A:

==== A. SLO & SLI: make 'healthy' a number ====
SLO: answer-success objective=99%  error budget=1%
SLI now: 99.7%  | budget remaining: 70%  | burn: 0.30x  -> healthy

The SLO is the promise (99%), the error budget is the 1% you are allowed to fail, and right now you have spent only 30% of it (70% remaining). "Healthy" is now a number you can watch.

Step 3: Watch the budget burn and the alert fire

Look at section B:

==== B. DETECT: an outage burns the budget, the alert fires ====
fast burn: 50x   slow burn: 8x
ALERT -> action=PAGE  severity=sev1  reason=fast burn 50.0x (error budget gone in hours)

The last 100 requests are failing 50% of the time (the provider is throwing 503s). The fast window is burning the budget 50x too fast, so the two-window alert pages a human at sev1. Open oncall.py and read burn_rate and alert. You should now see: a page needs the fast window hot; a mild slow burn would only open a ticket.

Step 4: Prove the alert is not trigger-happy

In a Python shell, show that a brief blip does not page:

python -c "import oncall as o; slo=o.SLO('x',0.99); print(o.alert(slo, [True]*98+[False]*2, [True]*999+[False]*1))"

You should now see: ('ok', None, ...) — 2% bad in the fast window is not a fast burn, and the slow window is healthy, so nobody is woken. A trustworthy pager only fires when a human must act.

Part B: work the incident

Step 5: See the triage

Look at section C of the demo output:

==== C. TRIAGE: open and classify the incident ====
opened INC-001  severity=sev1  status=open

The alert opened an Incident and the symptom was named model_provider_outage. Naming the symptom is what lets you pick the right runbook. Triage is matching the failure to a known play.

Step 6: Follow the runbook to recovery

Look at section D:

==== D. MITIGATE: follow the runbook until healthy ====
  - Switch to the fallback model                  SLI now 85%
  - Serve cached answers where possible           SLI now 95%
  - Shed low-priority traffic                     SLI now 100%
INC-001 status=resolved  after 3 mitigation step(s)

Each step is a safe, reversible mitigation; health climbs back after each one; run_runbook stops as soon as the SLI recovers. Open runbook.py and read RUNBOOKS and run_runbook. You stopped the bleeding before diagnosing the root cause, that is the right order.

Step 7: Read the postmortem and the regression case

Sections E and F print the blameless postmortem (written straight from the timeline) and the regression eval case the incident produced:

id: regression-INC-001 ... expect: service stays within SLO / degrades safely under this condition

You should now see: the incident is now guarded three ways, an alert, a runbook, and a test (M26).

Step 8: Make it your own

Change the outage severity and watch the response change. In a shell:

python -c "import oncall as o; slo=o.SLO('x',0.99); print(o.alert(slo, [True]*970+[False]*30, [True]*920+[False]*80))"

You should now see: the fast window is now only burning 3x (under the page threshold) but the slow window is burning 8x, so you get a calmer ('page', 'sev2', ...), a sustained problem, not a flash fire.

Step 9: Show it

Post in the chat your postmortem from section E, or the one line from section F where your incident became a regression test.

If you get stuck

ModuleNotFoundError: oncall -> run from inside the folder that has oncall.py and runbook.py.
The numbers differ from this lab -> the demo is deterministic; if you edited the windows, your burn rates change too. Re-read burn_rate: it is bad_fraction / error_budget.
"Why did it not page?" -> check the fast window. A page needs a hot fast burn; a healthy fast window with a warm slow window is a ticket, by design.

Check yourself

Why is a 100% success target a mistake, and what is the error budget for?

100% is impossibly expensive and leaves no room to ship changes. The error budget is the failure your SLO deliberately allows (1% if the objective is 99%). While budget remains you can ship; when it is spent you stop and fix reliability. It turns "are we reliable enough?" into a number.

Why alert on burn rate over two windows instead of on the raw error rate?

The raw rate pages on every harmless blip (you learn to ignore it) or too late (you miss the outage). Burn rate is how fast you are spending the budget; the fast window catches real fires while the slow window confirms it is sustained, so the pager only fires when a human must act now.

Why mitigate before finding the root cause?

Users are hurting now. Mitigations (fallback model, cache, shed load) are safe, reversible moves that stop the bleeding immediately. Root-cause diagnosis can take hours; you do it after the service is healthy again, not before.

What makes a postmortem "blameless," and why does it matter?

It blames the system and its gaps ("no canary on the deploy"), never a person ("Sam pushed it"). Blame makes people hide incidents; blameless makes them share, so the whole team learns and the fix (often a new regression test) actually gets built.