Skip to content

AI Operations Support — coverage map

Auto-derived from ai-operations-support.md by generate_mindmap.py. Do not hand-edit.

This is the operations-support lens over the existing AI Engineering course (modules M0-M33): a competency map for the safeguarding track that wraps the course, not a rename or replacement. AI engineering is the backbone; operations support keeps what was built running, supported, and recoverable. The table below shows where each module lives on this map; M31-M33 are the new Part E modules, and the remaining [NEW] nodes are proposed gap-fills.

Every existing module → where it now lives

Module Now taught under (operations-support branch · node)
M0 1 What AI Operations Support is
1 What you support: chatbots · RAG assistants · agents · multi-agent systems
1 How an LLM works — enough to reason about failures
2 Models & selection — capabilities, cost, context window
M1 2 Variables, types, logic, data — lists/dicts/JSON
M2 2 Variables, types, logic, data — lists/dicts/JSON
M3 2 Functions, files, libraries, errors, virtualenv
M4 2 First AI app · API keys · request → response
M5 2 Prompt engineering — to diagnose & fix prompt-level issues
M6 2 Messages API · params · streaming · structured JSON
2 Models & selection — capabilities, cost, context window
9 Streaming for perceived latency
M7 2 RAG basics — what a knowledge assistant actually is
10 Vector store / index operations — freshness · re-embedding
M8 2 RAG basics — what a knowledge assistant actually is
7 RAG answer correctness — does it match the source?
M9 2 Agents & tools — the ReAct loop you will operate
3 Single tool-using agent — function calling · ReAct loop
M10 8 OWASP LLM Top 10 as a shipping checklist
8 Prompt injection: direct & indirect
8 Guardrail layer — input / output filtering
8 Content moderation · abuse / anomaly detection
8 Red-team your own app — authorized, synthetic only
M11 4 Wrap as a service: FastAPI endpoints
4 Containerize: Docker — slim base · non-root · pinned deps
5 The signals to watch: cost · latency · error rate · quality
13 Build capstone — ship a complete AI app (RAG or agent)
M12 3 Multimodal systems — vision / audio
M13 2 Models & selection — capabilities, cost, context window
3 Open-source & local models — Ollama · Hugging Face · quantization
M14 8 Content moderation · abuse / anomaly detection
10 PII redaction at write time — privacy first
12 Responsible AI in operations — fairness · transparency · human-in-the-loop
12 Governance frameworks: EU AI Act · NIST AI RMF · ISO/IEC 42001
M15 10 Fine-tune when behaviour (not facts) must change
M16 3 MCP servers & clients — the connector standard
M17 12 Optional internals: build an LM from scratch, so it is not magic
M18 3 Multi-agent orchestration — orchestrator · sub-agents · connectors
4 Wrap as a service: FastAPI endpoints
M19 3 Agent frameworks landscape — LangGraph · CrewAI · AutoGen · SDKs · n8n
M20 5 Tracing: every model & tool call as a span
5 The signals to watch: cost · latency · error rate · quality
5 Token & dollar accounting per request
5 Tool-usage & step-count metrics
5 Structured logging & log correlation
5 Dashboards & SLIs — what goes on the wall
5 Production tooling: LangSmith · Langfuse · Phoenix · OpenTelemetry
7 Golden test set + rule-based scorers
7 Check the answer AND the trace
7 LLM-as-judge for open-ended answers
9 Re-run evals — every optimization is a quality bet
M21 4 Statelessness → many replicas behind a load balancer
10 Short-term memory under a token budget
10 Long-term memory — save & recall across sessions
10 Checkpoint & resume the whole state
10 Stateless service vs persisted state — the trade-off
M22 6 Retry with exponential backoff — transient errors only
6 Timeouts on hung calls
6 Fallback / graceful degradation in an outage
6 Step caps to stop runaway loops
6 Circuit breaker for a dead dependency
6 Human-approval gates for risky, world-changing actions
M23 8 OWASP LLM Top 10 as a shipping checklist
8 Prompt injection: direct & indirect
8 Excessive agency & data exfiltration
8 Defenses: least privilege · allowlists · treat content as data
8 Redact secrets · defense in depth
M24 3 Agentic RAG & research agents — retrieval as a tool · multi-hop · citations
11 Surface citations & cost in the UI
M25 5 Token & dollar accounting per request
9 Estimate $ & latency from token counts
9 Prompt caching — pay once for a stable prefix
9 Model routing — cheap fast model for easy steps
9 Token trimming — and what you cannot cheaply cut
9 Batch API for offline throughput
9 Capacity, rate limits & quota management
9 Re-run evals — every optimization is a quality bet
M26 4 CI/CD for AI services
7 Eval-driven development — every bug becomes a test
7 Eval gate as an exit code — block bad merges
7 Eval gate in CI — GitHub Actions on every push
7 Regression detection & quality tracking over time
7 Online evaluation — sample & score live traffic
M27 13 Complete-agent capstone — RAG + memory + observability + reliability + security behind an API
M28 9 Streaming for perceived latency
11 Agent UX — stream progress & answer live
11 Surface citations & cost in the UI
11 Cancellation — stop iterating, stop the cost
11 Serve over Server-Sent Events
M29 4 Containerize: Docker — slim base · non-root · pinned deps
4 Config from the environment (12-factor) · fail fast on bad config
4 Secrets from the env, never in code
4 Liveness vs readiness probes
4 Graceful startup & shutdown — warm up → drain
4 Statelessness → many replicas behind a load balancer
8 Secrets management & rotation
10 Stateless service vs persisted state — the trade-off
M30 10 The data flywheel — capture interactions + feedback
10 Curate feedback → eval cases + fine-tuning data
10 PII redaction at write time — privacy first
12 Close the loop on a cadence — curation is judgement
12 Beware feedback bias & amplification
12 Continuous improvement from postmortems & evals
M31 6 SLOs · SLIs · error budgets
6 On-call & alerting — paging · thresholds · alert noise
6 Incident lifecycle: detect → triage → mitigate → resolve
6 Runbooks & playbooks for common failures
6 Escalation paths — when to wake a human
6 Blameless postmortems & follow-up actions
M32 11 Ticketing / helpdesk integration — Jira · ServiceNow · Zendesk shape
11 Support tiers L1 / L2 / L3 & SLAs
11 Human escalation & handoff workflows
11 AIOps for IT — AI that triages logs, alerts & anomalies
M33 4 Change management: versioning · canary · rollback
8 Secrets management & rotation
10 Vector store / index operations — freshness · re-embedding
M34 13 Ops-support capstone — run a deployed agent through a full incident + improvement cycle
13 Inject a fault · get paged · run the runbook · mitigate
13 Write the postmortem · add a regression eval · ship the fix
M35 5 Structured logging & log correlation
5 Dashboards & SLIs — what goes on the wall
7 Online evaluation — sample & score live traffic
9 Capacity, rate limits & quota management
12 Continuous improvement from postmortems & evals

Coverage: 36 / 36 existing modules placed in the operations-support map.

Proposed [NEW] gap-fills (do not exist as modules yet)

These are the operations-support topics the current course does not cover, surfaced by the reframe. They are proposals on the map, not built modules.

Branch Proposed node

0 proposed [NEW] nodes across the map.