M32 notes: AI support desk and AIOps (the one idea)

The one idea: operations support is where you both run AI and use AI to keep the lights on. The support desk takes two streams, tickets from humans and alerts from machines, and makes each one manageable: classify and route every ticket under a promise (an SLA), and collapse every alert storm into the few real incidents underneath. The trick that keeps it safe is humility: when the classifier is unsure, a human decides; when one cause fires forty alerts, you page once.

1. Two inboxes, one desk

Stream	Looks like	What the desk must do
Tickets (humans)	"it's down", "how do I…", "the answer was wrong"	triage → route → answer within an SLA
Alerts (machines)	40 pages from one outage	correlate into incidents → hand to M31

Both are floods. Untriaged, they bury the signal: the sev1 outage sits behind a password question, and the real incident hides inside forty duplicate pages. The desk's job is to impose order so humans spend attention where it matters.

2. Triage = classify + confidence

triage() reads a ticket and assigns a severity (sev1/2/3) and a confidence. In production the classifier is an LLM (the prompting of M5, the tool/agent of M9); here it is deterministic keyword rules so the lab is reproducible and testable. The shape is what matters, and it is the same either way: text in, a label and a confidence out.

The confidence is not decoration. It is the difference between a desk that helps and one that confidently mis-routes. A clear "the whole app is down" scores high; a vague "something feels off" scores low.

3. The confidence gate: when unsure, ask a human

route() checks the confidence first. Below the threshold it refuses to auto-route and sends the ticket to human triage. This is the same human-in-the-loop principle as the approval gate in M22, applied to classification: an automated decision you are not sure of is one a person should make. A desk that auto-acts on shaky labels is worse than no automation, because it is confidently wrong at scale.

4. Routing, tiers, and SLAs

Once trusted, severity decides two things:

Severity	Tier (first response)	SLA (time to first response)
sev1	L2	15 min
sev2	L2	60 min
sev3	L1	4 hours

Tiers are depth of expertise: L1 front-line (common how-tos), L2 technical (real bugs), L3 engineering/escalation. The SLA is the promise to the user about when, not whether. The SLA is what turns "we'll get to it" into something you can measure and be held to.

5. Escalation: nothing rots in a queue

sla_check() compares how long a ticket has waited to its SLA. Past the deadline it escalates one tier up the ladder (L1→L2→L3). Escalation is the safety net: it guarantees that a ticket nobody picked up does not silently age out, it becomes someone more senior's problem before the user gives up.

Analogy. The desk is a hospital emergency room. Triage is the nurse at the door deciding who is critical and who can wait, and saying "I'm not sure, let me get a doctor" when a case is unclear (the confidence gate). The SLA is the target wait time on the wall. Escalation is calling the attending physician when a patient has waited too long. AIOps is realizing that twelve people who arrived together from one car crash are one incident, not twelve.

6. AIOps: page on causes, not symptoms

The machine inbox has the opposite problem: not too vague, too loud. One model-provider outage trips the error-rate alert, the latency alert, and the health-check alert, on every replica, every minute. correlate() groups alerts that share a cause signature (same service + symptom) and fall within a short time window into one incident. In the demo, 40 alerts become 5 incidents, an 88% cut in pages, and the two model-api 5xx flares stay separate because they are far apart in time (a fresh incident, not the old one).

This is the entire premise of AIOps: use simple correlation (or, fancier, learned similarity) to turn noise into signal. Each correlated incident is exactly what M31 opens, triages, and runs a runbook against, the two modules are the two ends of one pipeline.

7. Putting it together

demo_mock.py: five tickets are triaged (the vague one drops to a human), routed under SLAs, and the one that waited too long escalates; then a 40-alert storm correlates down to a few incidents. The desk turned two floods into a short, prioritized, trustworthy list, which is the whole point of operations support: keep human attention on the things that actually need it.

Words you will hear

Triage, severity (sev1/2/3), confidence / confidence gate, support tier (L1/L2/L3), SLA (service-level agreement), escalation, handoff / human-in-the-loop, AIOps, alert correlation / deduplication, noise reduction, alert fatigue. Full definitions in the glossary.