Lab M34: run a full on-call shift (Part E capstone)

You'll need: Python and your venv. No API key, no cost, instant and deterministic (the service health and the releases are simulated). Time: about 45 minutes. Work in your breakout pair.

Heads up: this drill integrates the toolkits you built in M31–M33. The "service" stands in for the M27 support agent; its health depends on which release is live. Nothing real is touched.

This lab has two parts: - Part A: run the shift end to end and read the trace, which module owns each step. - Part B: run the eval gate, then break a release on purpose and watch the gate catch it.

flowchart TB
  ST["alert storm + ticket"] -->|M32| INC["incident opens"]
  INC -->|M31 burn alert| PAGE["page sev1"]
  PAGE -->|M31 runbook → M33| RB["rollback → SLI recovers"]
  RB -->|M31 → M26| PM["postmortem → regression case"]
  PM -->|M33 canary| FIX["fix promoted"]
  FIX -->|M26| GATE["eval gate scores the shift"]

Part A: run the shift

Step 1: Set up

Copy the solution/ files into a folder and activate your venv. Nothing to install.

python -c "import parts, drill; print('capstone ok')"

You should now see: capstone ok. (If not: run it from inside the folder with the .py files.)

Step 2: Run the whole shift

python demo.py

You should now see sections A to G. Look at A and B:

==== A. AIOPS: collapse the alert storm (M32) ====
  40 alerts -> 5 incidents (page on causes, not symptoms)
==== B. DETECT: the on-call page (M31) ====
  paged=True  severity=sev1  reason=fast burn 20x (budget gone in hours)
  SLI was 80% against a 99% SLO

A bad deploy regressed answer quality to 80%; AIOps collapsed the storm, and the burn paged you at sev1. Intake and detection, two modules, one incident.

Step 3: Read the mitigation and the fix

Look at sections D and F:

==== D. MITIGATE: runbook -> rollback (M31 + M33) ====
  rolled back v2-regressed -> v1-stable
  SLI recovered 80% -> 100%  (stop the bleeding first)
==== F. SHIP THE FIX: canary the corrected release (M33) ====
  v3-fix canary decision: promote  ->  live is now v3-fix

The runbook's first play was a rollback (instant recovery); the real fix was then canaried and promoted. Mitigate before you diagnose; ship the fix behind a canary.

Step 4: Read the trace, module by module

Look at section G. Each line names the module that owns it (M32 → M31 → M31+M33 → M31→M26 → M33). Open drill.py and match each trace line to the call that produced it. You should now see: the whole shift is the toolkits from M31–M33 in sequence, nothing new.

Part B: the gate, and breaking it on purpose

Step 5: Run the eval gate

python evals.py

You should now see:

  [PASS] AIOps correlated the storm
  ...
  [PASS] fix canaried & promoted to live
7/7 checks passed
EVAL GATE: PASS (exit 0) - the whole shift behaved correctly

The gate scored the entire shift and exited 0. This is what you would run in CI (M26).

Step 6: Break the fix and watch the gate catch it

Open drill.py and make the "fix" secretly reintroduce the bug, change v3_fix so the refund answer is wrong:

def v3_fix(q):
    return {"refund window": "7 days"}.get(q, ANSWERS[q])   # oops, still broken

Run the gate again:

python evals.py; echo "exit: $?"

You should now see the canary reject the broken fix, so it is never promoted, and the gate fail:

  [FAIL] fix canaried & promoted to live
6/7 checks passed
EVAL GATE: FAIL (exit 1) - a Part E pattern did not hold
exit: 1

The canary refused to ship a release that did not hold quality, and the gate turned red. Undo your change to make it green again. (This is the M26 loop: a regression makes the exit code non-zero.)

Step 7: Show it

Post in the chat your section G trace (all three modules firing on one incident), or the moment in Step 6 where the gate caught the broken fix.

If you get stuck

ModuleNotFoundError -> run from inside the folder with parts.py and drill.py.
The gate is green after Step 6 -> make sure you edited v3_fix (the fix), not v2_regressed, and saved the file. The canary compares the fix to the live baseline on the eval set.
"Which line is which module?" -> every block in parts.py has a header naming its source module (M31, M32, or M33).

Check yourself

Why mitigate (roll back) before finding the root cause?

Users are hurting now. A rollback to the last-good release is a safe, one-call way to make them healthy immediately. Root-cause analysis can take hours and is done after the service recovers, not before. The drill recovers the SLI from 80% to 100% the instant it rolls back.

How does the regression case (M31 → M26) make the system safer?

The incident's postmortem produces an eval case that the bad release fails and the fix passes. Added to the gate, it means CI will catch that exact bug before it can ship again. Each incident leaves behind a test, so the system gets more reliable over time.

Why canary the fix instead of just deploying it?

Because the original bug shipped precisely because a release went out with no canary. The canary scores the candidate against the live baseline on the eval set (now including the regression case) and promotes only if quality holds, so a "fix" that re-breaks something is rejected before users see it.

What does the eval gate's exit code give you?

A machine-checkable verdict on the whole shift. Exit 0 means every Part E pattern held; non-zero means one did not. That is exactly what CI (M26) keys on to block a bad change automatically.