Skip to content

M32: AI support desk and AIOps (Part E: Operations Support)

M31 covered the night the system breaks. Most days are quieter and noisier at the same time: a queue of user tickets ("it's down!", "how do I…?", "the answer was wrong"), and a firehose of machine alerts where one outage trips forty pages. Operations support is the desk that handles both, and it is the perfect place to use AI: a classifier reads each ticket and decides how urgent it is and who should get it, an SLA clock makes sure nothing rots in the queue, and an AIOps step collapses the alert storm down to the few real incidents underneath. Today you build that desk, offline, and you make it hand the uncertain cases to a human instead of guessing.

Today's win: a support desk that triages tickets by severity and confidence, routes them to the right tier under an SLA, escalates the ones that miss it, sends the uncertain ones to a human, and an AIOps step that turns a 40-alert storm into a handful of incidents, all offline.

Today you will

  • Triage a ticket: assign a severity and a confidence score from its text
  • Add a confidence gate so uncertain triage goes to a human, not a wrong auto-route (M14/M22)
  • Route by severity to a tier (L1/L2/L3) with an SLA, and escalate when the SLA is missed
  • Run an AIOps correlation that groups an alert storm into incidents (page on causes, not symptoms)
  • See how each correlated incident is exactly what M31 opens and fixes with a runbook

Run of show (about 60 minutes)

Time What we do
0:00 Hook: the desk handles both the user queue and the machine firehose
0:05 The one idea: classify, route under an SLA, and reduce noise (read notes.md)
0:12 Lab Part A: triage, the confidence gate, and routing
0:30 Lab Part B: SLA escalation and AIOps alert correlation
0:50 Show: post your noise-reduction number and the ticket that went to a human
1:00 Wrap

If you get stuck

  • Safeguards the users of what you built. The triage classifier is an LLM call in production (M5/M9); the confidence-to-human gate is the approval idea from M22; the correlated incidents feed M31.
  • The whole lab runs offline, free, no key, and instantly (tickets and alerts are fixtures). No new libraries. The "classifier" is deterministic keyword rules so your results match this guide exactly.
  • The desk path is a few lines in support_desk.py; correlation is in aiops.py. Read the one for the step you are on.

Optional challenge

Open starters/priority_queue.py and implement an SLA-urgency queue: given the routed tickets and how long each has waited, order them by time remaining to the SLA so the desk always works the most-at-risk ticket next, instead of first-in-first-out. It is the difference between meeting your SLAs and missing them while busy.