Skip to content

M34: Part E capstone, the on-call shift (Part E: Operations Support)

You built the safeguarding layer in three pieces: incident response (M31), the support desk and AIOps (M32), and data and release operations (M33). A real on-call shift is not those pieces in three folders, it is all of them firing on ONE outage. Today you run that shift end to end: a bad deploy ships overnight, an alert storm collapses into incidents, the burning error budget pages you, a user ticket comes in about the same problem, you roll back to stop the bleeding, you write the postmortem that becomes a regression test, and you canary the fix back into production, with an eval gate scoring the whole thing. This is the portfolio piece that proves you can operate an AI system.

Today's win: one integrated on-call drill where M31 + M32 + M33 handle a single incident end to end, detect → mitigate → learn → ship the fix, with a green eval gate (M26) over the whole shift.

Today you will

  • Watch an alert storm correlate into incidents (M32), then the burn page you at sev1 (M31)
  • See a user ticket about the same outage triaged and linked (M32)
  • Mitigate by rolling back the bad release (M31 runbook → M33 rollback) and watch the SLI recover
  • Turn the incident into a postmortem and a regression eval (M31 → M26)
  • Canary the fix back to live (M33), and run an eval gate that scores the entire shift (M26)

Run of show (about 70 minutes)

Time What we do
0:00 Hook: the safeguards become one on-call shift
0:05 Tour the drill and the map of which line is which module (read notes.md)
0:15 Lab Part A: run the shift end to end and read its trace
0:35 Lab Part B: run the eval gate, break a release on purpose, and watch the gate catch it
0:58 Show: post the trace where all three modules fired on one incident
1:10 Wrap and Part E retrospective

If you get stuck

  • This module integrates M31–M33 and ties to M26 (the gate) and M27 (the agent being operated). Each block in parts.py names the module it came from.
  • The whole capstone runs offline, free, no key (the service health and releases are simulated). No new installs. Nothing here can touch a real system.
  • Read drill.py top to bottom, it is the shift as a numbered story; then run python evals.py to see the gate score it.

Optional challenge

Open starters/extend.py and turn the drill into your own portfolio piece: add a second incident type (for example a stale-index outage that the runbook fixes by reindexing instead of rolling back), and add a matching eval check. Follow the M26 rule, add the check first, then make it pass, and keep python solution/evals.py green.