Lab M32: build an AI support desk (and tame the alert storm)
You'll need: Python and your venv. No API key, no cost, instant and deterministic (tickets and alerts are fixtures; the classifier is keyword rules so your numbers match this guide). Time: about 45 minutes. Work in your breakout pair.
Heads up: the "AI classifier" here is deterministic on purpose, so the lab is reproducible. In production you would drop an LLM call (M5/M9) into
triage()and everything around it stays the same.
This lab has two parts: - Part A: triage tickets, gate the uncertain ones to a human, and route the rest under an SLA. - Part B: escalate an SLA breach, then run AIOps correlation on a 40-alert storm.
flowchart TB
T["ticket"] --> TR["triage<br/>severity + confidence"]
TR --> G{"confident?"}
G -->|no| H["human review"]
G -->|yes| R["route to tier<br/>+ SLA"]
R --> S{"SLA missed?"}
S -->|yes| E["escalate a tier"]
S -->|no| OK["in progress"]
AL["40 alerts"] --> C["AIOps correlate"] --> I["a few incidents → M31"]
Part A: triage, the confidence gate, and routing
Step 1: Set up
Copy the solution/ files into a folder and activate your venv. Nothing to install.
python -c "import support_desk, aiops; print('desk ok')"
desk ok. (If not: run it from inside the folder with the .py files.)
Step 2: Run the whole desk once
python demo_mock.py
==== A. TRIAGE: classify each ticket (severity + confidence) ====
T1: severity=sev1 confidence=1.0 | The whole app is down, all users cannot log in!
T4: severity=sev3 confidence=0.2 | Hmm, something just feels a bit off lately.
support_desk.py and read triage.
Step 3: See the confidence gate send T4 to a human
Look at section B:
T4: status=human_review tier=human low confidence 0.2 -> human triage (do not auto-route)
route:
below the confidence threshold it refuses to guess and hands off to a person.
You should now see: an automated decision you are not sure of is one a human should make (M22).
Step 4: Prove the gate is yours to set
python -c "import support_desk as s; t=s.route(s.triage(s.Ticket('X','something feels off')), low_confidence=0.1); print(t.status, t.tier)"
routed L1 — with a looser threshold (0.1) the same vague ticket is now
auto-routed instead of going to a human. You just tuned how cautious the desk is.
Part B: SLA escalation and AIOps correlation
Step 5: Watch an SLA breach escalate
Look at section C:
T2: status=escalated tier=L3 SLA breach (90m > 60m) -> escalate to L3
sla_check. Escalation guarantees a stuck ticket becomes someone more senior's problem, not a
missed SLA. Everything else stayed within its SLA.
Step 6: Collapse the alert storm (AIOps)
Look at section D:
==== D. AIOPS: collapse an alert storm into incidents ====
INC#1 model-api/5xx: 18 alerts over t=0-5m
...
40 alerts -> 5 incidents | noise reduction: 88%
correlate grouped them by service+symptom within a time window
into a handful of incidents. Open aiops.py and read correlate.
You should now see: the two model-api/5xx flares stay separate because they are far apart in
time, a fresh incident, not the old one.
Step 7: Make the window your own
python -c "import aiops, demo_mock as d; s=d.build_storm(); print('window=60 ->', len(aiops.correlate(s, window=60)), 'incidents')"
window=60 -> 4 incidents. With a one-hour window the late 5xx flare merges
into the first one, fewer incidents, but you might merge two genuinely separate problems. The window
is a tradeoff you tune.
Step 8: Show it
Post in the chat your noise-reduction number from section D and which ticket the confidence gate sent to a human (section B).
If you get stuck
ModuleNotFoundError-> run from inside the folder withsupport_desk.pyandaiops.py.- My ticket got the wrong severity ->
triageis keyword-based; readRULES. A ticket with no known keywords scores low confidence and goes to a human, which is the safe default. - AIOps merged things I didn't expect -> check the
window. A larger window groups more aggressively; too large and separate incidents merge.