Skip to content

M35: Operations Support, going deeper (Part E: Operations Support)

The Part E orientation named five topics that round out the operations picture, structured logging, dashboards & SLIs, online evaluation, capacity & rate limits, and continuous improvement, and promised they were concepts to recognize. This optional module makes each one hands-on: five small, self-contained mini-labs you can run in a minute, each extending a module you already built. None is big enough to be its own module; together they are the practical corners of operating an AI system that the earlier modules pointed at but did not drill.

Today's win: five runnable operations tools, structured logs you can query, a golden-signals dashboard, an online-eval drift detector, a rate limiter with quotas, and a reliability flywheel, each demonstrated offline in seconds.

Today you will

  • Emit structured, correlated logs and pull one failing request's whole story (extends M20)
  • Compute the four golden signals + SLO burn into a tiny dashboard and see what breaches (M20/M31)
  • Run online evaluation on sampled live traffic and catch drift the offline gate missed (M26/M30)
  • Enforce a rate limit, a quota, and a concurrency limit and watch excess load get shed (M25/M29)
  • Turn incidents into regression guards and watch the reliability flywheel cut repeats (M30/M31)

Run of show (about 50 minutes, ~10 min per lab)

Time What we do
0:00 Why these five round out the operations picture (read notes.md)
0:05 Labs 1–2: structured logging, then the dashboard
0:20 Labs 3–4: online eval, then rate limits & quotas
0:38 Lab 5: the continuous-improvement flywheel
0:46 Show: post the breach your dashboard flagged, or the drift online eval caught
0:50 Wrap

If you get stuck

  • Optional / go-deeper. Best after M31–M34; each lab names the module it extends. Read the Part E orientation first for where these fit.
  • Every lab runs offline, free, no key, instantly, and deterministically. No new libraries.
  • Run solution/demo.py to see all five, then open the one script you want to study.

Optional challenge

Open starters/extend.py and add a composite alert to the dashboard: page only when two or more golden signals breach at the same time, instead of on any single one. It is the real-world fix for alert fatigue, one breaching signal is often noise; several at once is an incident.