M22: Agent reliability and ops (Part D: Agentic Systems)
Your agent works on your laptop. Then a real user hits it and the API rate-limits you, a call hangs, a server returns a 503, or the agent gets stuck in a loop and quietly spends 40 dollars. None of that is rare; it is Tuesday. Today you harden an agent so it survives the real world: it retries blips, gives up on calls that hang, falls back when things stay broken, caps its own steps so it cannot run away, and asks a human before doing anything risky. You will inject each failure on purpose and watch the agent handle it.
Today's win: a reliable agent that recovers from a flaky API, refuses to loop forever, fails safely during an outage, and blocks a risky action until a human approves it, all demonstrated offline.
Today you will
- Add retries with backoff so a transient error (rate limit, 503, timeout) does not kill a request
- Add a timeout so a hung call is given up on instead of freezing the agent
- Add a fallback / graceful degrade so an outage returns a safe message, not a crash
- Add a step cap so a runaway loop cannot burn unlimited tokens (ties to the cost view in M20)
- Add a human-approval gate so risky, world-changing actions need a yes first (human-in-the-loop, M14)
Run of show (about 60 minutes)
| Time | What we do |
|---|---|
| 0:00 | Hook: failure is the normal case in production |
| 0:05 | The one idea: assume every call can fail, and plan for it (read notes.md) |
| 0:12 | Lab Part A: retry, timeout, and graceful degrade |
| 0:32 | Lab Part B: step caps and the human-approval gate |
| 0:52 | Show: post an agent that survives a fault you injected |
| 1:00 | Wrap |
If you get stuck
- Builds on M11 (deployment), M18 (the orchestrated agent), and M20 (cost and error visibility). Reuse your
.envkey only for the optional live run. - The core lab runs offline, free, no key, and instantly (backoff waits are stubbed out). No new libraries. Nothing here can harm your computer; the "risky" tool only pretends to send an email.
- Each pattern is a few lines in
reliability.py. Read the one that matches the failure you are looking at.
Optional challenge
Open starters/add_policy.py and implement a circuit breaker: after
several failures in a row it stops calling a dead service entirely for a cooldown, instead of retrying
every request. It is the pattern that keeps a partial outage from turning into a full one.