M32 notes: AI support desk and AIOps (the one idea)
The one idea: operations support is where you both run AI and use AI to keep the lights on. The support desk takes two streams, tickets from humans and alerts from machines, and makes each one manageable: classify and route every ticket under a promise (an SLA), and collapse every alert storm into the few real incidents underneath. The trick that keeps it safe is humility: when the classifier is unsure, a human decides; when one cause fires forty alerts, you page once.
1. Two inboxes, one desk
| Stream | Looks like | What the desk must do |
|---|---|---|
| Tickets (humans) | "it's down", "how do I…", "the answer was wrong" | triage → route → answer within an SLA |
| Alerts (machines) | 40 pages from one outage | correlate into incidents → hand to M31 |
Both are floods. Untriaged, they bury the signal: the sev1 outage sits behind a password question, and the real incident hides inside forty duplicate pages. The desk's job is to impose order so humans spend attention where it matters.
2. Triage = classify + confidence
triage() reads a ticket and assigns a severity (sev1/2/3) and a confidence. In production the
classifier is an LLM (the prompting of M5, the tool/agent of M9); here it is deterministic keyword
rules so the lab is reproducible and testable. The shape is what matters, and it is the same either
way: text in, a label and a confidence out.
The confidence is not decoration. It is the difference between a desk that helps and one that confidently mis-routes. A clear "the whole app is down" scores high; a vague "something feels off" scores low.
3. The confidence gate: when unsure, ask a human
route() checks the confidence first. Below the threshold it refuses to auto-route and sends the
ticket to human triage. This is the same human-in-the-loop principle as the approval gate in M22,
applied to classification: an automated decision you are not sure of is one a person should make. A
desk that auto-acts on shaky labels is worse than no automation, because it is confidently wrong at
scale.
4. Routing, tiers, and SLAs
Once trusted, severity decides two things:
| Severity | Tier (first response) | SLA (time to first response) |
|---|---|---|
| sev1 | L2 | 15 min |
| sev2 | L2 | 60 min |
| sev3 | L1 | 4 hours |
Tiers are depth of expertise: L1 front-line (common how-tos), L2 technical (real bugs), L3 engineering/escalation. The SLA is the promise to the user about when, not whether. The SLA is what turns "we'll get to it" into something you can measure and be held to.
5. Escalation: nothing rots in a queue
sla_check() compares how long a ticket has waited to its SLA. Past the deadline it escalates one
tier up the ladder (L1→L2→L3). Escalation is the safety net: it guarantees that a ticket nobody picked
up does not silently age out, it becomes someone more senior's problem before the user gives up.
Analogy. The desk is a hospital emergency room. Triage is the nurse at the door deciding who is critical and who can wait, and saying "I'm not sure, let me get a doctor" when a case is unclear (the confidence gate). The SLA is the target wait time on the wall. Escalation is calling the attending physician when a patient has waited too long. AIOps is realizing that twelve people who arrived together from one car crash are one incident, not twelve.
6. AIOps: page on causes, not symptoms
The machine inbox has the opposite problem: not too vague, too loud. One model-provider outage trips
the error-rate alert, the latency alert, and the health-check alert, on every replica, every minute.
correlate() groups alerts that share a cause signature (same service + symptom) and fall within a
short time window into one incident. In the demo, 40 alerts become 5 incidents, an 88% cut in
pages, and the two model-api 5xx flares stay separate because they are far apart in time (a fresh
incident, not the old one).
This is the entire premise of AIOps: use simple correlation (or, fancier, learned similarity) to turn noise into signal. Each correlated incident is exactly what M31 opens, triages, and runs a runbook against, the two modules are the two ends of one pipeline.
7. Putting it together
demo_mock.py: five tickets are triaged (the vague one drops to a human), routed under SLAs, and the
one that waited too long escalates; then a 40-alert storm correlates down to a few incidents. The desk
turned two floods into a short, prioritized, trustworthy list, which is the whole point of operations
support: keep human attention on the things that actually need it.
Words you will hear
Triage, severity (sev1/2/3), confidence / confidence gate, support tier (L1/L2/L3), SLA (service-level agreement), escalation, handoff / human-in-the-loop, AIOps, alert correlation / deduplication, noise reduction, alert fatigue. Full definitions in the glossary.