M22: Agent reliability and ops (Part D: Agentic Systems)

Your agent works on your laptop. Then a real user hits it and the API rate-limits you, a call hangs, a server returns a 503, or the agent gets stuck in a loop and quietly spends 40 dollars. None of that is rare; it is Tuesday. Today you harden an agent so it survives the real world: it retries blips, gives up on calls that hang, falls back when things stay broken, caps its own steps so it cannot run away, and asks a human before doing anything risky. You will inject each failure on purpose and watch the agent handle it.

Today's win: a reliable agent that recovers from a flaky API, refuses to loop forever, fails safely during an outage, and blocks a risky action until a human approves it, all demonstrated offline.

Today you will

Add retries with backoff so a transient error (rate limit, 503, timeout) does not kill a request
Add a timeout so a hung call is given up on instead of freezing the agent
Add a fallback / graceful degrade so an outage returns a safe message, not a crash
Add a step cap so a runaway loop cannot burn unlimited tokens (ties to the cost view in M20)
Add a human-approval gate so risky, world-changing actions need a yes first (human-in-the-loop, M14)

Run of show (about 60 minutes)

Time	What we do
0:00	Hook: failure is the normal case in production
0:05	The one idea: assume every call can fail, and plan for it (read `notes.md`)
0:12	Lab Part A: retry, timeout, and graceful degrade
0:32	Lab Part B: step caps and the human-approval gate
0:52	Show: post an agent that survives a fault you injected
1:00	Wrap

If you get stuck

Builds on M11 (deployment), M18 (the orchestrated agent), and M20 (cost and error visibility). Reuse your .env key only for the optional live run.
The core lab runs offline, free, no key, and instantly (backoff waits are stubbed out). No new libraries. Nothing here can harm your computer; the "risky" tool only pretends to send an email.
Each pattern is a few lines in reliability.py. Read the one that matches the failure you are looking at.

Optional challenge

Open starters/add_policy.py and implement a circuit breaker: after several failures in a row it stops calling a dead service entirely for a cooldown, instead of retrying every request. It is the pattern that keeps a partial outage from turning into a full one.