Lab M35: five go-deeper operations tools

You'll need: Python and your venv. No API key, no cost, instant and deterministic. Time: about 45 minutes (≈10 min per lab; do them in any order). Work in your breakout pair.

Optional / go-deeper, best after M31–M34. Each lab is a small, self-contained script that extends a module you already built. Nothing here is simulated wrongly: the operations (querying logs, computing signals, sampling, rate-limiting, curating) are the real ones.

flowchart LR
  OBS["OBSERVE"] --> L1["1 · structured logs (M20)"]
  OBS --> L2["2 · dashboard & golden signals (M20/M31)"]
  RESP["RESPOND"] --> L3["3 · online eval / drift (M26/M30)"]
  DEP["DEPLOY"] --> L4["4 · rate limits & quotas (M25/M29)"]
  IMP["IMPROVE"] --> L5["5 · the flywheel (M30/M31)"]

Step 0: Set up

Copy the solution/ files into a folder and activate your venv. Nothing to install.

python -c "import structured_logging, dashboard, online_eval, rate_limit, improvement; print('go-deeper ok')"

You should now see: go-deeper ok. Run everything at once with python demo.py, or one at a time below.

Lab 1 — Structured logging & correlation (extends M20)

python structured_logging.py

You should now see sections A–C. In C, the whole story of the failing request req-B:

==== C. CORRELATE: the whole story of the failing request (req-B) ====
   {... 'event': 'request' ...}
   {... 'event': 'tool', 'tool': 'retrieve', 'latency_ms': 5000, 'error': True}
   {... 'event': 'response', 'status': 503}

Takeaway: because every line shares request_id, you reconstruct one request's path in a single correlate() call, and query(event='tool', error=True) finds failures across all requests.

Lab 2 — Dashboard & the four golden signals (extends M20 + M31)

python dashboard.py

You should now see a dashboard where four rows are flagged BREACH:

  latency p95   5000ms    BREACH
  error rate    10.0%     BREACH
  saturation    100%      BREACH
  SLO burn      10.0x     BREACH

Takeaway: the wall shows the few signals that map to user pain. Open dashboard.py and read THRESHOLDS, they decide what counts as "too far." (You will tame this with a composite alert in the challenge.)

Lab 3 — Online evaluation (extends M26 + M30)

python online_eval.py

You should now see the healthy window score 1.0, then the full stream score 0.6 with drift detected:

==== B. ONLINE EVAL over the full stream (quality drifted in the 2nd half) ====
   {'sampled': 10, 'of': 50, 'avg_score': 0.6, 'drift_detected': True}

Takeaway: the offline M26 gate never saw these drifting inputs; sampling and scoring live traffic is how you notice a regression the fixed test set could not.

Lab 4 — Capacity, rate limits & quotas (extends M25 + M29)

python rate_limit.py

You should now see the token bucket allow 5 then reject, the quota stop the 4th request, and concurrency cap at 2:

  8 requests at t=0: ['allow','allow','allow','allow','allow','429','429','429']
  acme: ['allow', 'allow', 'allow', 'over-quota']
  acquire x3: [True, True, False]

Takeaway: three levers, rate limit, quota, concurrency, protect a shared service. A 429 is normal; your own code handles the provider's 429 with backoff (M22), not a retry storm.

Lab 5 — Continuous improvement, the flywheel (extends M30 + M31)

python improvement.py

You should now see new incidents fall while repeats get prevented as guards accumulate:

  week  incidents  new  repeats_prevented  total_guards
    1        3       3          0               3
    4        1       0          1               4

Takeaway: every incident becomes a regression guard (M26), so the same failure cannot recur unnoticed. The slope of "new incidents" is the real scoreboard.

Step 6: Show it

Post in the chat one result: the correlated failing request (Lab 1), a dashboard breach (Lab 2), the drift online eval caught (Lab 3), or the falling incident count (Lab 5).

If you get stuck

ModuleNotFoundError -> run from inside the folder with the .py files (or run python demo.py).
My numbers differ -> every lab is deterministic; if you edited the synthetic inputs, the outputs change with them. Re-read the _simulate / _window / _stream / _weeks helper at the bottom of each script.

Check yourself

Why does correlation need a shared id, not just timestamps?

Many requests run at once, so timestamps interleave. A shared `request_id` (or trace id) on every log line and span is what lets you pull exactly one request's lines out of the mixed stream, which is the first thing you do in an incident.

What does online evaluation catch that the M26 gate cannot?

The M26 gate runs on a fixed golden set before shipping, so it only catches failures it already has a test for. Online eval samples real live traffic and scores it with proxies, so it catches drift and new input types the fixed set never anticipated.

Why three different capacity levers instead of one?

They solve different problems: a rate limit stops one key from starving others (fairness), a quota caps a tenant's total cost (budget), and a concurrency limit tells you when a replica is full (scaling, M29). A single knob cannot do all three.

What makes the reliability "flywheel" turn?

Turning every incident into a regression guard (M26). Once a failure signature is guarded, re-seeing it is a prevented repeat rather than a new incident, so the new-incident rate falls over time instead of staying flat.