AI Operations Support — coverage map

Auto-derived from ai-operations-support.md by generate_mindmap.py. Do not hand-edit.

This is the operations-support lens over the existing AI Engineering course (modules M0-M33): a competency map for the safeguarding track that wraps the course, not a rename or replacement. AI engineering is the backbone; operations support keeps what was built running, supported, and recoverable. The table below shows where each module lives on this map; M31-M33 are the new Part E modules, and the remaining [NEW] nodes are proposed gap-fills.

Every existing module → where it now lives

Module	Now taught under (operations-support branch · node)
M0	1 What AI Operations Support is 1 What you support: chatbots · RAG assistants · agents · multi-agent systems 1 How an LLM works — enough to reason about failures 2 Models & selection — capabilities, cost, context window
M1	2 Variables, types, logic, data — lists/dicts/JSON
M2	2 Variables, types, logic, data — lists/dicts/JSON
M3	2 Functions, files, libraries, errors, virtualenv
M4	2 First AI app · API keys · request → response
M5	2 Prompt engineering — to diagnose & fix prompt-level issues
M6	2 Messages API · params · streaming · structured JSON 2 Models & selection — capabilities, cost, context window 9 Streaming for perceived latency
M7	2 RAG basics — what a knowledge assistant actually is 10 Vector store / index operations — freshness · re-embedding
M8	2 RAG basics — what a knowledge assistant actually is 7 RAG answer correctness — does it match the source?
M9	2 Agents & tools — the ReAct loop you will operate 3 Single tool-using agent — function calling · ReAct loop
M10	8 OWASP LLM Top 10 as a shipping checklist 8 Prompt injection: direct & indirect 8 Guardrail layer — input / output filtering 8 Content moderation · abuse / anomaly detection 8 Red-team your own app — authorized, synthetic only
M11	4 Wrap as a service: FastAPI endpoints 4 Containerize: Docker — slim base · non-root · pinned deps 5 The signals to watch: cost · latency · error rate · quality 13 Build capstone — ship a complete AI app (RAG or agent)
M12	3 Multimodal systems — vision / audio
M13	2 Models & selection — capabilities, cost, context window 3 Open-source & local models — Ollama · Hugging Face · quantization
M14	8 Content moderation · abuse / anomaly detection 10 PII redaction at write time — privacy first 12 Responsible AI in operations — fairness · transparency · human-in-the-loop 12 Governance frameworks: EU AI Act · NIST AI RMF · ISO/IEC 42001
M15	10 Fine-tune when behaviour (not facts) must change
M16	3 MCP servers & clients — the connector standard
M17	12 Optional internals: build an LM from scratch, so it is not magic
M18	3 Multi-agent orchestration — orchestrator · sub-agents · connectors 4 Wrap as a service: FastAPI endpoints
M19	3 Agent frameworks landscape — LangGraph · CrewAI · AutoGen · SDKs · n8n
M20	5 Tracing: every model & tool call as a span 5 The signals to watch: cost · latency · error rate · quality 5 Token & dollar accounting per request 5 Tool-usage & step-count metrics 5 Structured logging & log correlation 5 Dashboards & SLIs — what goes on the wall 5 Production tooling: LangSmith · Langfuse · Phoenix · OpenTelemetry 7 Golden test set + rule-based scorers 7 Check the answer AND the trace 7 LLM-as-judge for open-ended answers 9 Re-run evals — every optimization is a quality bet
M21	4 Statelessness → many replicas behind a load balancer 10 Short-term memory under a token budget 10 Long-term memory — save & recall across sessions 10 Checkpoint & resume the whole state 10 Stateless service vs persisted state — the trade-off
M22	6 Retry with exponential backoff — transient errors only 6 Timeouts on hung calls 6 Fallback / graceful degradation in an outage 6 Step caps to stop runaway loops 6 Circuit breaker for a dead dependency 6 Human-approval gates for risky, world-changing actions
M23	8 OWASP LLM Top 10 as a shipping checklist 8 Prompt injection: direct & indirect 8 Excessive agency & data exfiltration 8 Defenses: least privilege · allowlists · treat content as data 8 Redact secrets · defense in depth
M24	3 Agentic RAG & research agents — retrieval as a tool · multi-hop · citations 11 Surface citations & cost in the UI
M25	5 Token & dollar accounting per request 9 Estimate $ & latency from token counts 9 Prompt caching — pay once for a stable prefix 9 Model routing — cheap fast model for easy steps 9 Token trimming — and what you cannot cheaply cut 9 Batch API for offline throughput 9 Capacity, rate limits & quota management 9 Re-run evals — every optimization is a quality bet
M26	4 CI/CD for AI services 7 Eval-driven development — every bug becomes a test 7 Eval gate as an exit code — block bad merges 7 Eval gate in CI — GitHub Actions on every push 7 Regression detection & quality tracking over time 7 Online evaluation — sample & score live traffic
M27	13 Complete-agent capstone — RAG + memory + observability + reliability + security behind an API
M28	9 Streaming for perceived latency 11 Agent UX — stream progress & answer live 11 Surface citations & cost in the UI 11 Cancellation — stop iterating, stop the cost 11 Serve over Server-Sent Events
M29	4 Containerize: Docker — slim base · non-root · pinned deps 4 Config from the environment (12-factor) · fail fast on bad config 4 Secrets from the env, never in code 4 Liveness vs readiness probes 4 Graceful startup & shutdown — warm up → drain 4 Statelessness → many replicas behind a load balancer 8 Secrets management & rotation 10 Stateless service vs persisted state — the trade-off
M30	10 The data flywheel — capture interactions + feedback 10 Curate feedback → eval cases + fine-tuning data 10 PII redaction at write time — privacy first 12 Close the loop on a cadence — curation is judgement 12 Beware feedback bias & amplification 12 Continuous improvement from postmortems & evals
M31	6 SLOs · SLIs · error budgets 6 On-call & alerting — paging · thresholds · alert noise 6 Incident lifecycle: detect → triage → mitigate → resolve 6 Runbooks & playbooks for common failures 6 Escalation paths — when to wake a human 6 Blameless postmortems & follow-up actions
M32	11 Ticketing / helpdesk integration — Jira · ServiceNow · Zendesk shape 11 Support tiers L1 / L2 / L3 & SLAs 11 Human escalation & handoff workflows 11 AIOps for IT — AI that triages logs, alerts & anomalies
M33	4 Change management: versioning · canary · rollback 8 Secrets management & rotation 10 Vector store / index operations — freshness · re-embedding
M34	13 Ops-support capstone — run a deployed agent through a full incident + improvement cycle 13 Inject a fault · get paged · run the runbook · mitigate 13 Write the postmortem · add a regression eval · ship the fix
M35	5 Structured logging & log correlation 5 Dashboards & SLIs — what goes on the wall 7 Online evaluation — sample & score live traffic 9 Capacity, rate limits & quota management 12 Continuous improvement from postmortems & evals

Coverage: 36 / 36 existing modules placed in the operations-support map.

Proposed `[NEW]` gap-fills (do not exist as modules yet)

These are the operations-support topics the current course does not cover, surfaced by the reframe. They are proposals on the map, not built modules.

Branch	Proposed node

0 proposed [NEW] nodes across the map.

AI Operations Support — coverage map

Every existing module → where it now lives

Proposed [NEW] gap-fills (do not exist as modules yet)

Proposed `[NEW]` gap-fills (do not exist as modules yet)