M35 notes: going deeper (the one idea)

The one idea: the headline operations skills are incident response, the support desk, and release safety (M31–M34). Around them sit five smaller, unglamorous tools that make the headline ones actually work: you cannot debug an incident without queryable logs, cannot alert without a dashboard of the right signals, cannot trust production without online evaluation, cannot survive success without capacity limits, and cannot improve without turning each incident into a guard. Each is small enough to hold in your head, this module is one runnable example of each.

The Part E orientation explains the concepts; these notes are about the build, what each mini-lab actually demonstrates and the operations lesson it carries.

1. Structured logging & correlation (extends M20)

Build: a logger that writes records ({request_id, event, tool, latency_ms, error}) instead of prose, plus query() (find every failed retrieval) and correlate() (one request's whole story). Lesson: the first move in any incident is "show me everything that happened to this request." That is only possible if every log line and span shares one id. Free-text logs cannot answer it; structured, correlated logs answer it in one query. A trace (M20) is the picture; logs are the searchable diary.

2. Dashboards & the four golden signals (extends M20 + M31)

Build: compute latency (p50/p95), traffic, error rate, and saturation from a window of requests, plus SLO burn (M31), and render the few numbers that go on the wall, flagging breaches. Lesson: a dashboard's value is in what you leave off. Two hundred graphs hide a problem as well as none. The four golden signals map to user pain; the test for the wall is "would this change what I do right now?" Everything else is a drill-down you open only when a golden signal turns red.

3. Online evaluation (extends M26 + M30)

Build: sample a fraction of live interactions, score each with reference-free proxies (refusals, missing citations, stub answers), and flag drift when the rolling score drops. Lesson: the M26 gate is a fixed exam taken before shipping; it cannot know about inputs it never contained. Online eval is the pop quiz on real traffic, it catches the slow drift, the new question type, the quiet regression. You trade ground-truth labels (you have none live) for cheap proxies and a sample. Offline evals stop a known failure; online evals notice an unknown one.

4. Capacity, rate limits & quotas (extends M25 + M29)

Build: a token-bucket rate limiter, a per-tenant daily quota, and a concurrency limiter, the three levers that keep one busy client from taking down the service. Lesson: M25 made a request cheap; capacity asks what happens under many at once. Rate limits stop a single key from starving the rest; quotas stop runaway cost; concurrency limits tell you when to add a replica (M29). And you live inside your provider's limits too, so a 429 is handled with backoff (M22) and load-shedding, never a retry storm.

5. Continuous improvement, the flywheel (extends M30 + M31 + M26)

Build: curate incidents into deduped regression guards, then replay weeks of incidents and watch new ones fall while repeats get prevented by the guards already in place. Lesson: the measure of an operations team is not heroics, it is the slope of the incident graph. Every postmortem (M31) and every down-vote (M30) must leave behind a guard (M26). Do that and the same fire cannot burn twice; skip it and you firefight forever. This is the M30 data flywheel pointed at reliability instead of model quality.

Words you will hear

Structured logging / log correlation, the four golden signals (latency, traffic, errors, saturation), dashboard / SLI, online evaluation / drift, reference-free metrics, token bucket / rate limit / quota / concurrency limit, load shedding, reliability flywheel / guard. Definitions are in the glossary; the concepts are in the Part E orientation.