Skip to content

M22: Agent reliability and ops (Part D: Agentic Systems)

Your agent works on your laptop. Then a real user hits it and the API rate-limits you, a call hangs, a server returns a 503, or the agent gets stuck in a loop and quietly spends 40 dollars. None of that is rare; it is Tuesday. Today you harden an agent so it survives the real world: it retries blips, gives up on calls that hang, falls back when things stay broken, caps its own steps so it cannot run away, and asks a human before doing anything risky. You will inject each failure on purpose and watch the agent handle it.

Today's win: a reliable agent that recovers from a flaky API, refuses to loop forever, fails safely during an outage, and blocks a risky action until a human approves it, all demonstrated offline.

Today you will

  • Add retries with backoff so a transient error (rate limit, 503, timeout) does not kill a request
  • Add a timeout so a hung call is given up on instead of freezing the agent
  • Add a fallback / graceful degrade so an outage returns a safe message, not a crash
  • Add a step cap so a runaway loop cannot burn unlimited tokens (ties to the cost view in M20)
  • Add a human-approval gate so risky, world-changing actions need a yes first (human-in-the-loop, M14)

Run of show (about 60 minutes)

Time What we do
0:00 Hook: failure is the normal case in production
0:05 The one idea: assume every call can fail, and plan for it (read notes.md)
0:12 Lab Part A: retry, timeout, and graceful degrade
0:32 Lab Part B: step caps and the human-approval gate
0:52 Show: post an agent that survives a fault you injected
1:00 Wrap

If you get stuck

  • Builds on M11 (deployment), M18 (the orchestrated agent), and M20 (cost and error visibility). Reuse your .env key only for the optional live run.
  • The core lab runs offline, free, no key, and instantly (backoff waits are stubbed out). No new libraries. Nothing here can harm your computer; the "risky" tool only pretends to send an email.
  • Each pattern is a few lines in reliability.py. Read the one that matches the failure you are looking at.

Optional challenge

Open starters/add_policy.py and implement a circuit breaker: after several failures in a row it stops calling a dead service entirely for a cooldown, instead of retrying every request. It is the pattern that keeps a partial outage from turning into a full one.