M22 notes: Agent reliability and ops (the one idea)
The one idea: in production, every call your agent makes can fail, and the agent itself can misbehave (loop, stall, take a costly action). Reliability is not one feature; it is a small set of patterns, one per failure mode, that you wrap around the agent so it recovers when it can and fails safely when it cannot. Demo code assumes success. Production code assumes failure and plans for it.
1. The failures that actually happen
When an agent talks to a model API and to tools, these go wrong constantly:
| Failure | What it looks like | The pattern that handles it |
|---|---|---|
| Transient blip | rate limit (429), 503, brief network error | retry with backoff |
| Hung call | a request that never returns | timeout |
| Real outage | the service is down for minutes | fallback / graceful degrade |
| Runaway loop | the agent keeps calling tools, never finishing | step cap |
| Risky action | the agent wants to send mail, delete data, spend money | human-approval gate |
Each is small on its own. Together they are the difference between a demo and something you let real users touch.
Analogy. Reliability patterns are the safety systems in a car. Retry is trying the ignition again. The timeout is the engine cut-off. Fallback is the spare tire. The step cap is the rev limiter that stops the engine destroying itself. The approval gate is the seatbelt warning that will not let you drive off until you buckle up. You hope to use none of them, and you never remove them.
2. Retry with backoff
Most API errors are transient: wait a moment and try again and it works. retry in reliability.py
calls the function, and on a retryable error waits and tries again, increasing the wait each time
(0.5s, then 1s, then 2s). This exponential backoff matters: hammering a rate-limited service
instantly just gets you rate-limited harder; backing off gives it room to recover.
Two rules: only retry transient errors (a 503, not a "your prompt is invalid" 400, which will fail every time), and cap the attempts so you do not retry forever. In real code you also add a little randomness ("jitter") so many clients do not all retry on the same beat.
3. Timeout
A retry does not help a call that never returns; it just hangs. call_with_deadline runs the call and
gives up after N seconds, raising a transient error so retry can then try a fresh call. Note the honest
caveat in the code: Python cannot truly kill the background work, so you ALSO set a client-side request
timeout on the SDK; the deadline here models the caller deciding to stop waiting. Without timeouts, one
slow dependency freezes your whole agent.
4. Fallback and graceful degradation
Sometimes retrying does not help because the thing is actually down. fallback tries options in order
and returns the first that works: maybe a cheaper or simpler model, a cached answer, or finally a plain
"we are having trouble, try again later". The agent in agent.py does the last kind: if every retry of
the model call fails, it returns a calm message and sets degraded: True instead of throwing a stack
trace at the user. Failing safely is a feature. Degrade, do not crash.
5. Step caps: stop the runaway
An agent decides its own steps, so a bad prompt or a confused model can loop: call a tool, call it
again, forever, each call costing tokens. StepLimiter counts steps and stops the run once it passes a
cap. This is a hard money-safety control, and it pairs with the cost and tool-usage visibility you built
in M20: observability shows you a loop happened; the step cap stops it from being expensive.
6. Human-in-the-loop: the approval gate
Reading data is safe to automate. Actions that change the world (send an email, delete a record,
spend money, deploy) are not, especially when an agent might be wrong or manipulated. approval_gate
lets safe tools run automatically but requires a human "yes" before a risky tool runs. In agent.py,
multiply is safe and runs freely; send_email is risky, so by default it is blocked and recorded
until an approver says yes. The agent still proposes the action; a human decides. This is the same
human-in-the-loop principle from M14 and M18, enforced in code.
Decide which tools are risky by what they can do if the agent is wrong: anything irreversible or outward-facing should be gated.
7. Putting it together
The reliable loop in agent.py, each turn:
1. limiter.tick() first, so a runaway loop is stopped before any cost,
2. the model call wrapped in timeout then retry, degrading to a safe message if it still fails,
3. for each tool the model wants, an approval_gate check (risky tools need a yes) before running it.
None of this changes what the agent does when everything works. It changes what happens when things go wrong, which in production is often. Add these once and reuse them around every agent you ship.
Words you will hear
Transient error, retry / exponential backoff / jitter, timeout, fallback / graceful degradation, step cap / runaway loop, human-in-the-loop / approval gate, circuit breaker (the challenge), idempotency (safe to retry). Full definitions in the glossary.