Skip to content

Operations Support, explained (Part E orientation)

Read this before M31. Just as M0 opens the course by explaining what AI engineering is, this page opens Part E by explaining what operations support is, and why a course about building AI systems ends with a part about running them.

The one line: AI engineering is the backbone (you build the system); operations support is the safeguarding layer that keeps what you built running, supported, and recoverable in production, watching over the architecture, the databases, and the builds.

No code here, this is the map of the part. The hands-on modules are M31–M34.


1. Operator vs builder: you run what others build

Most of this course trained you as a builder: take a blank file and ship a chatbot, a RAG assistant, an agent. Operations support is a different stance, the operator: the system already exists and real users depend on it right now, and your job is to keep it healthy, notice the moment it is not, and recover fast when it breaks. Builders optimize for "it works"; operators optimize for "it keeps working, and when it doesn't, we recover quickly and learn." Same system, different question.

You do not have to be the person who built a service to operate it, in fact you usually are not. That is exactly why the safeguards in Part E exist: SLOs make "healthy" a number anyone can read, runbooks turn "what do we do at 2am" into a checklist anyone can follow, and canaries let anyone ship a change without having to trust it blindly.

2. Three lenses: LLMOps, AgentOps, AIOps

"Operations support for AI" gets called three things, depending on which part you are looking through:

Lens Means Where in Part E
LLMOps operating systems built on LLMs: prompts, RAG, cost, evals, deploys, data M33, and M20/M25/M26/M29 it builds on
AgentOps operating agents specifically: traces, memory, reliability, tool-use, multi-step failure M31, and M20/M21/M22 it builds on
AIOps using AI to operate anything: triaging logs, correlating alerts, reducing noise M32

They overlap heavily and you do not need to police the boundaries. The point is that "operations support" spans both operating the AI you built and using AI to help you operate.

3. The operations lifecycle: deploy → observe → respond → improve

Everything in Part E is one loop that runs forever around a live system:

        ┌───────────► DEPLOY ──────────┐
        │         (ship it safely)      ▼
     IMPROVE                          OBSERVE
  (learn, harden)                 (watch the signals)
        ▲                              │
        └────────── RESPOND ◄──────────┘
                (fix it fast)
  • Deploy — ship safely: config, probes, canary, rollback (M29, M33).
  • Observe — make "healthy" visible: traces, metrics, SLOs, logs, dashboards (M20, M31).
  • Respond — when it breaks: alert, page, triage, runbook, mitigate (M31, M32).
  • Improve — leave it safer: postmortem → regression test → re-deploy (M31 → M26 → M33).

The capstone (M34) runs this whole loop on one incident.

4. A day in the role

A day in operations support is not heads-down building. It is:

  • a dashboard you glance at (are we within our SLOs? is the error budget burning?),
  • an alert that fires when a signal crosses a line (and a pager that you trust because it only fires when a human must act),
  • a ticket queue of user problems to triage and route under an SLA,
  • the occasional incident that pages you, where the runbook and a rollback matter more than cleverness,
  • and, between fires, improvement: turning the last incident into a test, tightening an alert, writing the runbook you wished you'd had.

The measure of a good operations team is not that nothing breaks. It is that when something breaks, this is routine, and the system ends up a little safer than before.


The Part E map

Start here, then work the four modules in order:

Module The arc of the loop it owns
M31 · Incident response & on-call respond + improve: SLOs, alerts, the incident lifecycle, runbooks, postmortems
M32 · AI support desk & AIOps respond (intake): triage tickets, route under SLA, correlate alert storms
M33 · Data & release operations deploy + safeguard the data: reindex, retention, backup, canary, rollback
M34 · Part E capstone the whole loop on one incident, with an eval gate over it

Going deeper (topics that extend earlier modules)

Part E leans on, and points back to, work you already did. notes.md surveys five topics that round out the operations picture, each living in the module it extends: structured logging & dashboards/SLIs (M20), online evaluation (M26), capacity, rate limits & quotas (M25/M29), and continuous improvement (M30). Each is now a hands-on mini-lab in M35: Operations Support, going deeper.