AI Operations Support — coverage map
Auto-derived from ai-operations-support.md by generate_mindmap.py. Do not hand-edit.
This is the operations-support lens over the existing AI Engineering course (modules M0-M33): a competency map for the safeguarding track that wraps the course, not a rename or replacement. AI engineering is the backbone; operations support keeps what was built running, supported, and recoverable. The table below shows where each module lives on this map; M31-M33 are the new Part E modules, and the remaining [NEW] nodes are proposed gap-fills.
Every existing module → where it now lives
| Module | Now taught under (operations-support branch · node) |
|---|---|
| M0 | 1 What AI Operations Support is 1 What you support: chatbots · RAG assistants · agents · multi-agent systems 1 How an LLM works — enough to reason about failures 2 Models & selection — capabilities, cost, context window |
| M1 | 2 Variables, types, logic, data — lists/dicts/JSON |
| M2 | 2 Variables, types, logic, data — lists/dicts/JSON |
| M3 | 2 Functions, files, libraries, errors, virtualenv |
| M4 | 2 First AI app · API keys · request → response |
| M5 | 2 Prompt engineering — to diagnose & fix prompt-level issues |
| M6 | 2 Messages API · params · streaming · structured JSON 2 Models & selection — capabilities, cost, context window 9 Streaming for perceived latency |
| M7 | 2 RAG basics — what a knowledge assistant actually is 10 Vector store / index operations — freshness · re-embedding |
| M8 | 2 RAG basics — what a knowledge assistant actually is 7 RAG answer correctness — does it match the source? |
| M9 | 2 Agents & tools — the ReAct loop you will operate 3 Single tool-using agent — function calling · ReAct loop |
| M10 | 8 OWASP LLM Top 10 as a shipping checklist 8 Prompt injection: direct & indirect 8 Guardrail layer — input / output filtering 8 Content moderation · abuse / anomaly detection 8 Red-team your own app — authorized, synthetic only |
| M11 | 4 Wrap as a service: FastAPI endpoints 4 Containerize: Docker — slim base · non-root · pinned deps 5 The signals to watch: cost · latency · error rate · quality 13 Build capstone — ship a complete AI app (RAG or agent) |
| M12 | 3 Multimodal systems — vision / audio |
| M13 | 2 Models & selection — capabilities, cost, context window 3 Open-source & local models — Ollama · Hugging Face · quantization |
| M14 | 8 Content moderation · abuse / anomaly detection 10 PII redaction at write time — privacy first 12 Responsible AI in operations — fairness · transparency · human-in-the-loop 12 Governance frameworks: EU AI Act · NIST AI RMF · ISO/IEC 42001 |
| M15 | 10 Fine-tune when behaviour (not facts) must change |
| M16 | 3 MCP servers & clients — the connector standard |
| M17 | 12 Optional internals: build an LM from scratch, so it is not magic |
| M18 | 3 Multi-agent orchestration — orchestrator · sub-agents · connectors 4 Wrap as a service: FastAPI endpoints |
| M19 | 3 Agent frameworks landscape — LangGraph · CrewAI · AutoGen · SDKs · n8n |
| M20 | 5 Tracing: every model & tool call as a span 5 The signals to watch: cost · latency · error rate · quality 5 Token & dollar accounting per request 5 Tool-usage & step-count metrics 5 Structured logging & log correlation 5 Dashboards & SLIs — what goes on the wall 5 Production tooling: LangSmith · Langfuse · Phoenix · OpenTelemetry 7 Golden test set + rule-based scorers 7 Check the answer AND the trace 7 LLM-as-judge for open-ended answers 9 Re-run evals — every optimization is a quality bet |
| M21 | 4 Statelessness → many replicas behind a load balancer 10 Short-term memory under a token budget 10 Long-term memory — save & recall across sessions 10 Checkpoint & resume the whole state 10 Stateless service vs persisted state — the trade-off |
| M22 | 6 Retry with exponential backoff — transient errors only 6 Timeouts on hung calls 6 Fallback / graceful degradation in an outage 6 Step caps to stop runaway loops 6 Circuit breaker for a dead dependency 6 Human-approval gates for risky, world-changing actions |
| M23 | 8 OWASP LLM Top 10 as a shipping checklist 8 Prompt injection: direct & indirect 8 Excessive agency & data exfiltration 8 Defenses: least privilege · allowlists · treat content as data 8 Redact secrets · defense in depth |
| M24 | 3 Agentic RAG & research agents — retrieval as a tool · multi-hop · citations 11 Surface citations & cost in the UI |
| M25 | 5 Token & dollar accounting per request 9 Estimate $ & latency from token counts 9 Prompt caching — pay once for a stable prefix 9 Model routing — cheap fast model for easy steps 9 Token trimming — and what you cannot cheaply cut 9 Batch API for offline throughput 9 Capacity, rate limits & quota management 9 Re-run evals — every optimization is a quality bet |
| M26 | 4 CI/CD for AI services 7 Eval-driven development — every bug becomes a test 7 Eval gate as an exit code — block bad merges 7 Eval gate in CI — GitHub Actions on every push 7 Regression detection & quality tracking over time 7 Online evaluation — sample & score live traffic |
| M27 | 13 Complete-agent capstone — RAG + memory + observability + reliability + security behind an API |
| M28 | 9 Streaming for perceived latency 11 Agent UX — stream progress & answer live 11 Surface citations & cost in the UI 11 Cancellation — stop iterating, stop the cost 11 Serve over Server-Sent Events |
| M29 | 4 Containerize: Docker — slim base · non-root · pinned deps 4 Config from the environment (12-factor) · fail fast on bad config 4 Secrets from the env, never in code 4 Liveness vs readiness probes 4 Graceful startup & shutdown — warm up → drain 4 Statelessness → many replicas behind a load balancer 8 Secrets management & rotation 10 Stateless service vs persisted state — the trade-off |
| M30 | 10 The data flywheel — capture interactions + feedback 10 Curate feedback → eval cases + fine-tuning data 10 PII redaction at write time — privacy first 12 Close the loop on a cadence — curation is judgement 12 Beware feedback bias & amplification 12 Continuous improvement from postmortems & evals |
| M31 | 6 SLOs · SLIs · error budgets 6 On-call & alerting — paging · thresholds · alert noise 6 Incident lifecycle: detect → triage → mitigate → resolve 6 Runbooks & playbooks for common failures 6 Escalation paths — when to wake a human 6 Blameless postmortems & follow-up actions |
| M32 | 11 Ticketing / helpdesk integration — Jira · ServiceNow · Zendesk shape 11 Support tiers L1 / L2 / L3 & SLAs 11 Human escalation & handoff workflows 11 AIOps for IT — AI that triages logs, alerts & anomalies |
| M33 | 4 Change management: versioning · canary · rollback 8 Secrets management & rotation 10 Vector store / index operations — freshness · re-embedding |
| M34 | 13 Ops-support capstone — run a deployed agent through a full incident + improvement cycle 13 Inject a fault · get paged · run the runbook · mitigate 13 Write the postmortem · add a regression eval · ship the fix |
| M35 | 5 Structured logging & log correlation 5 Dashboards & SLIs — what goes on the wall 7 Online evaluation — sample & score live traffic 9 Capacity, rate limits & quota management 12 Continuous improvement from postmortems & evals |
Coverage: 36 / 36 existing modules placed in the operations-support map.
Proposed [NEW] gap-fills (do not exist as modules yet)
These are the operations-support topics the current course does not cover, surfaced by the reframe. They are proposals on the map, not built modules.
| Branch | Proposed node |
|---|---|
0 proposed
[NEW]nodes across the map.