Lab M32: build an AI support desk (and tame the alert storm)

You'll need: Python and your venv. No API key, no cost, instant and deterministic (tickets and alerts are fixtures; the classifier is keyword rules so your numbers match this guide). Time: about 45 minutes. Work in your breakout pair.

Heads up: the "AI classifier" here is deterministic on purpose, so the lab is reproducible. In production you would drop an LLM call (M5/M9) into triage() and everything around it stays the same.

This lab has two parts: - Part A: triage tickets, gate the uncertain ones to a human, and route the rest under an SLA. - Part B: escalate an SLA breach, then run AIOps correlation on a 40-alert storm.

flowchart TB
  T["ticket"] --> TR["triage<br/>severity + confidence"]
  TR --> G{"confident?"}
  G -->|no| H["human review"]
  G -->|yes| R["route to tier<br/>+ SLA"]
  R --> S{"SLA missed?"}
  S -->|yes| E["escalate a tier"]
  S -->|no| OK["in progress"]
  AL["40 alerts"] --> C["AIOps correlate"] --> I["a few incidents → M31"]

Part A: triage, the confidence gate, and routing

Step 1: Set up

Copy the solution/ files into a folder and activate your venv. Nothing to install.

python -c "import support_desk, aiops; print('desk ok')"

You should now see: desk ok. (If not: run it from inside the folder with the .py files.)

Step 2: Run the whole desk once

python demo_mock.py

You should now see five sections, A to E. Look at section A:

==== A. TRIAGE: classify each ticket (severity + confidence) ====
T1: severity=sev1  confidence=1.0  | The whole app is down, all users cannot log in!
T4: severity=sev3  confidence=0.2  | Hmm, something just feels a bit off lately.

The clear outage (T1) scores high; the vague message (T4) scores low. Triage gives you both a label and how much to trust it. Open support_desk.py and read triage.

Step 3: See the confidence gate send T4 to a human

Look at section B:

T4: status=human_review tier=human  low confidence 0.2 -> human triage (do not auto-route)

Every other ticket was routed automatically; T4 was not, because triage was unsure. Read route: below the confidence threshold it refuses to guess and hands off to a person. You should now see: an automated decision you are not sure of is one a human should make (M22).

Step 4: Prove the gate is yours to set

python -c "import support_desk as s; t=s.route(s.triage(s.Ticket('X','something feels off')), low_confidence=0.1); print(t.status, t.tier)"

You should now see: routed L1 — with a looser threshold (0.1) the same vague ticket is now auto-routed instead of going to a human. You just tuned how cautious the desk is.

Part B: SLA escalation and AIOps correlation

Step 5: Watch an SLA breach escalate

Look at section C:

T2: status=escalated    tier=L3     SLA breach (90m > 60m) -> escalate to L3

T2 is a sev2 with a 60-minute SLA, but it waited 90 minutes, so it escalated one tier (L2→L3). Read sla_check. Escalation guarantees a stuck ticket becomes someone more senior's problem, not a missed SLA. Everything else stayed within its SLA.

Step 6: Collapse the alert storm (AIOps)

Look at section D:

==== D. AIOPS: collapse an alert storm into incidents ====
  INC#1 model-api/5xx: 18 alerts over t=0-5m
  ...
40 alerts -> 5 incidents  | noise reduction: 88%

One outage fired dozens of alerts; correlate grouped them by service+symptom within a time window into a handful of incidents. Open aiops.py and read correlate. You should now see: the two model-api/5xx flares stay separate because they are far apart in time, a fresh incident, not the old one.

Step 7: Make the window your own

python -c "import aiops, demo_mock as d; s=d.build_storm(); print('window=60 ->', len(aiops.correlate(s, window=60)), 'incidents')"

You should now see: window=60 -> 4 incidents. With a one-hour window the late 5xx flare merges into the first one, fewer incidents, but you might merge two genuinely separate problems. The window is a tradeoff you tune.

Step 8: Show it

Post in the chat your noise-reduction number from section D and which ticket the confidence gate sent to a human (section B).

If you get stuck

ModuleNotFoundError -> run from inside the folder with support_desk.py and aiops.py.
My ticket got the wrong severity -> triage is keyword-based; read RULES. A ticket with no known keywords scores low confidence and goes to a human, which is the safe default.
AIOps merged things I didn't expect -> check the window. A larger window groups more aggressively; too large and separate incidents merge.

Check yourself

Why score a confidence, not just a severity label?

Because acting on a wrong label at scale is worse than not automating. The confidence says how much to trust the label; below a threshold the desk hands off to a human instead of confidently mis-routing. It is the human-in-the-loop gate from M22, applied to classification.

What is the difference between a tier and a severity?

Severity is how urgent the problem is (sev1/2/3). A tier is who handles it (L1 front-line, L2 technical, L3 engineering). Routing maps one to the other, and the SLA sets the promised response time for that severity.

Why does AIOps correlation matter for an AI system specifically?

AI systems have many moving parts (model API, vector store, tools, replicas), so one root cause trips many alerts at once. Correlating them into incidents means the on-call (M31) pages on causes, not symptoms, and is not buried by alert fatigue.

How do M32 and M31 fit together?

M32 turns floods into a short, trustworthy list: triaged tickets and correlated incidents. Each correlated incident is exactly what M31 opens, triages, and runs a runbook against. M32 is intake; M31 is response.