Lab M34: run a full on-call shift (Part E capstone)
You'll need: Python and your venv. No API key, no cost, instant and deterministic (the service health and the releases are simulated). Time: about 45 minutes. Work in your breakout pair.
Heads up: this drill integrates the toolkits you built in M31–M33. The "service" stands in for the M27 support agent; its health depends on which release is live. Nothing real is touched.
This lab has two parts: - Part A: run the shift end to end and read the trace, which module owns each step. - Part B: run the eval gate, then break a release on purpose and watch the gate catch it.
flowchart TB
ST["alert storm + ticket"] -->|M32| INC["incident opens"]
INC -->|M31 burn alert| PAGE["page sev1"]
PAGE -->|M31 runbook → M33| RB["rollback → SLI recovers"]
RB -->|M31 → M26| PM["postmortem → regression case"]
PM -->|M33 canary| FIX["fix promoted"]
FIX -->|M26| GATE["eval gate scores the shift"]
Part A: run the shift
Step 1: Set up
Copy the solution/ files into a folder and activate your venv. Nothing to install.
python -c "import parts, drill; print('capstone ok')"
capstone ok. (If not: run it from inside the folder with the .py files.)
Step 2: Run the whole shift
python demo.py
==== A. AIOPS: collapse the alert storm (M32) ====
40 alerts -> 5 incidents (page on causes, not symptoms)
==== B. DETECT: the on-call page (M31) ====
paged=True severity=sev1 reason=fast burn 20x (budget gone in hours)
SLI was 80% against a 99% SLO
Step 3: Read the mitigation and the fix
Look at sections D and F:
==== D. MITIGATE: runbook -> rollback (M31 + M33) ====
rolled back v2-regressed -> v1-stable
SLI recovered 80% -> 100% (stop the bleeding first)
==== F. SHIP THE FIX: canary the corrected release (M33) ====
v3-fix canary decision: promote -> live is now v3-fix
Step 4: Read the trace, module by module
Look at section G. Each line names the module that owns it (M32 → M31 → M31+M33 → M31→M26 → M33).
Open drill.py and match each trace line to the call that produced it.
You should now see: the whole shift is the toolkits from M31–M33 in sequence, nothing new.
Part B: the gate, and breaking it on purpose
Step 5: Run the eval gate
python evals.py
[PASS] AIOps correlated the storm
...
[PASS] fix canaried & promoted to live
7/7 checks passed
EVAL GATE: PASS (exit 0) - the whole shift behaved correctly
Step 6: Break the fix and watch the gate catch it
Open drill.py and make the "fix" secretly reintroduce the bug, change
v3_fix so the refund answer is wrong:
def v3_fix(q):
return {"refund window": "7 days"}.get(q, ANSWERS[q]) # oops, still broken
python evals.py; echo "exit: $?"
[FAIL] fix canaried & promoted to live
6/7 checks passed
EVAL GATE: FAIL (exit 1) - a Part E pattern did not hold
exit: 1
Step 7: Show it
Post in the chat your section G trace (all three modules firing on one incident), or the moment in Step 6 where the gate caught the broken fix.
If you get stuck
ModuleNotFoundError-> run from inside the folder withparts.pyanddrill.py.- The gate is green after Step 6 -> make sure you edited
v3_fix(the fix), notv2_regressed, and saved the file. The canary compares the fix to the live baseline on the eval set. - "Which line is which module?" -> every block in
parts.pyhas a header naming its source module (M31, M32, or M33).