Operations Support notes: the one idea, and five topics that round it out

The one idea: building a system answers "does it work?" once; operating it answers "is it still working?" forever. Operations support is the standing discipline that keeps a live AI system healthy: it makes "healthy" measurable, notices the instant it slips, responds from a practiced playbook, and feeds every incident back in as a test so the same failure cannot return quietly. The four Part E modules are the hands-on pieces; this page is the why, plus five topics that extend modules you have already built.

Going deeper: five topics that round out operations support

These are the remaining corners of the operations picture, each living in a module you already built. M35: Operations Support, going deeper turns each into a short, runnable mini-lab; the summaries below are the why.

1. Structured logging & log correlation (extends M20)

A trace (M20) shows one request's path; logs are the running diary of everything the service does. The trick is to make them structured (JSON lines with fields: request_id, user, tool, latency_ms, error) instead of free-text prose, so you can query them ("show every request where the retrieval tool errored in the last hour"). The second trick is correlation: stamp one request_id (or trace id) onto every log line and span for a request, so when an incident hits you can pull the entire story of a single failing request across the model call, the tools, and the API layer. Unstructured logs are a shoebox of receipts; structured, correlated logs are a searchable ledger.

2. Dashboards & SLIs: what goes on the wall (extends M20, M31)

An SLI (M31) is a single health number; a dashboard is the handful of them you keep on screen. The discipline is restraint: a wall of 200 graphs hides problems as well as no graphs at all. A good operations dashboard shows the few signals that map to user pain, the four golden signals, latency, traffic, errors, and saturation (how full your capacity is), plus your SLO burn. The test for putting a graph on the wall: would it change what you do right now? If not, it belongs in a drill-down, not the dashboard.

3. Online evaluation: score live traffic (extends M26, M30)

The eval gate in M26 runs offline on a fixed golden set before you ship. Online evaluation is the complement: sample a slice of real production traffic and score it continuously, so you catch quality drift that your fixed test set never anticipated (a new kind of question, a slow regression, a change in your users). Because you cannot have ground-truth labels for live traffic, online eval leans on reference-free signals: an LLM-as-judge (M20) on a sample, user feedback (M30), and proxy metrics (refusal rate, citation rate, answer length). Offline evals stop a known regression at the door; online evals notice an unknown one in the wild.

4. Capacity, rate limits & quotas (extends M25, M29)

M25 made each request cheaper; capacity planning asks whether the system survives many requests at once. Three levers: rate limits (cap requests per user/key per minute, so one client cannot starve the rest), quotas (a budget per tenant per day, so cost cannot run away), and concurrency limits (how many in-flight requests a replica handles before you add replicas, M29). You also live inside your provider's rate limits, so production code handles a 429 with backoff (M22) and sheds or queues load rather than hammering. Capacity is where cost (M25), reliability (M22), and deployment (M29) meet.

5. Continuous improvement from postmortems & evals (extends M30, M31, M26)

The point of operations support is not heroics during an incident, it is that the system gets more reliable over time. Every incident postmortem (M31) and every piece of user feedback (M30) should leave behind a concrete change: a new regression eval (M26), a tightened alert, a new runbook step, a fixed class of bug. This is the data flywheel (M30) aimed at reliability instead of model quality: real failures become permanent guards. A team that only firefights faces the same fires forever; a team that closes the loop watches its incident rate fall.

Why operations support is the right place to end

The course began by insisting the hard part of AI engineering is the engineering, not the AI. Part E is the sharpest form of that lesson: none of the operations safeguards are about machine learning, they are about running software that real people depend on. An AI feature that impresses in a demo but pages someone every night, leaks data through a stale index, or ships a regression to everyone is not finished. Operations support is what turns "it works on my machine" into "it works for everyone, keeps working, and gets safer every week." That is what it means to truly ship.

Words you will hear

Operator vs builder, LLMOps / AgentOps / AIOps, the operations lifecycle (deploy → observe → respond → improve), structured logging / log correlation, dashboard / the four golden signals (latency, traffic, errors, saturation), online evaluation, rate limit / quota / concurrency limit, continuous improvement / the reliability flywheel, toil. The hands-on terms are in M31–M34; full definitions in the glossary.