M35: Operations Support, going deeper (Part E: Operations Support)
The Part E orientation named five topics that round out the operations picture, structured logging, dashboards & SLIs, online evaluation, capacity & rate limits, and continuous improvement, and promised they were concepts to recognize. This optional module makes each one hands-on: five small, self-contained mini-labs you can run in a minute, each extending a module you already built. None is big enough to be its own module; together they are the practical corners of operating an AI system that the earlier modules pointed at but did not drill.
Today's win: five runnable operations tools, structured logs you can query, a golden-signals dashboard, an online-eval drift detector, a rate limiter with quotas, and a reliability flywheel, each demonstrated offline in seconds.
Today you will
- Emit structured, correlated logs and pull one failing request's whole story (extends M20)
- Compute the four golden signals + SLO burn into a tiny dashboard and see what breaches (M20/M31)
- Run online evaluation on sampled live traffic and catch drift the offline gate missed (M26/M30)
- Enforce a rate limit, a quota, and a concurrency limit and watch excess load get shed (M25/M29)
- Turn incidents into regression guards and watch the reliability flywheel cut repeats (M30/M31)
Run of show (about 50 minutes, ~10 min per lab)
| Time | What we do |
|---|---|
| 0:00 | Why these five round out the operations picture (read notes.md) |
| 0:05 | Labs 1–2: structured logging, then the dashboard |
| 0:20 | Labs 3–4: online eval, then rate limits & quotas |
| 0:38 | Lab 5: the continuous-improvement flywheel |
| 0:46 | Show: post the breach your dashboard flagged, or the drift online eval caught |
| 0:50 | Wrap |
If you get stuck
- Optional / go-deeper. Best after M31–M34; each lab names the module it extends. Read the Part E orientation first for where these fit.
- Every lab runs offline, free, no key, instantly, and deterministically. No new libraries.
- Run
solution/demo.pyto see all five, then open the one script you want to study.
Optional challenge
Open starters/extend.py and add a composite alert to the dashboard: page only
when two or more golden signals breach at the same time, instead of on any single one. It is the
real-world fix for alert fatigue, one breaching signal is often noise; several at once is an incident.