M35: Operations Support, going deeper (Part E: Operations Support)

The Part E orientation named five topics that round out the operations picture, structured logging, dashboards & SLIs, online evaluation, capacity & rate limits, and continuous improvement, and promised they were concepts to recognize. This optional module makes each one hands-on: five small, self-contained mini-labs you can run in a minute, each extending a module you already built. None is big enough to be its own module; together they are the practical corners of operating an AI system that the earlier modules pointed at but did not drill.

Today's win: five runnable operations tools, structured logs you can query, a golden-signals dashboard, an online-eval drift detector, a rate limiter with quotas, and a reliability flywheel, each demonstrated offline in seconds.

Today you will

Emit structured, correlated logs and pull one failing request's whole story (extends M20)
Compute the four golden signals + SLO burn into a tiny dashboard and see what breaches (M20/M31)
Run online evaluation on sampled live traffic and catch drift the offline gate missed (M26/M30)
Enforce a rate limit, a quota, and a concurrency limit and watch excess load get shed (M25/M29)
Turn incidents into regression guards and watch the reliability flywheel cut repeats (M30/M31)

Run of show (about 50 minutes, ~10 min per lab)

Time	What we do
0:00	Why these five round out the operations picture (read `notes.md`)
0:05	Labs 1–2: structured logging, then the dashboard
0:20	Labs 3–4: online eval, then rate limits & quotas
0:38	Lab 5: the continuous-improvement flywheel
0:46	Show: post the breach your dashboard flagged, or the drift online eval caught
0:50	Wrap

If you get stuck

Optional / go-deeper. Best after M31–M34; each lab names the module it extends. Read the Part E orientation first for where these fit.
Every lab runs offline, free, no key, instantly, and deterministically. No new libraries.
Run solution/demo.py to see all five, then open the one script you want to study.

Optional challenge

Open starters/extend.py and add a composite alert to the dashboard: page only when two or more golden signals breach at the same time, instead of on any single one. It is the real-world fix for alert fatigue, one breaching signal is often noise; several at once is an incident.