M22 notes: Agent reliability and ops (the one idea)

The one idea: in production, every call your agent makes can fail, and the agent itself can misbehave (loop, stall, take a costly action). Reliability is not one feature; it is a small set of patterns, one per failure mode, that you wrap around the agent so it recovers when it can and fails safely when it cannot. Demo code assumes success. Production code assumes failure and plans for it.

1. The failures that actually happen

When an agent talks to a model API and to tools, these go wrong constantly:

Failure	What it looks like	The pattern that handles it
Transient blip	rate limit (429), 503, brief network error	retry with backoff
Hung call	a request that never returns	timeout
Real outage	the service is down for minutes	fallback / graceful degrade
Runaway loop	the agent keeps calling tools, never finishing	step cap
Risky action	the agent wants to send mail, delete data, spend money	human-approval gate

Each is small on its own. Together they are the difference between a demo and something you let real users touch.

Analogy. Reliability patterns are the safety systems in a car. Retry is trying the ignition again. The timeout is the engine cut-off. Fallback is the spare tire. The step cap is the rev limiter that stops the engine destroying itself. The approval gate is the seatbelt warning that will not let you drive off until you buckle up. You hope to use none of them, and you never remove them.

2. Retry with backoff

Most API errors are transient: wait a moment and try again and it works. retry in reliability.py calls the function, and on a retryable error waits and tries again, increasing the wait each time (0.5s, then 1s, then 2s). This exponential backoff matters: hammering a rate-limited service instantly just gets you rate-limited harder; backing off gives it room to recover.

Two rules: only retry transient errors (a 503, not a "your prompt is invalid" 400, which will fail every time), and cap the attempts so you do not retry forever. In real code you also add a little randomness ("jitter") so many clients do not all retry on the same beat.

3. Timeout

A retry does not help a call that never returns; it just hangs. call_with_deadline runs the call and gives up after N seconds, raising a transient error so retry can then try a fresh call. Note the honest caveat in the code: Python cannot truly kill the background work, so you ALSO set a client-side request timeout on the SDK; the deadline here models the caller deciding to stop waiting. Without timeouts, one slow dependency freezes your whole agent.

4. Fallback and graceful degradation

Sometimes retrying does not help because the thing is actually down. fallback tries options in order and returns the first that works: maybe a cheaper or simpler model, a cached answer, or finally a plain "we are having trouble, try again later". The agent in agent.py does the last kind: if every retry of the model call fails, it returns a calm message and sets degraded: True instead of throwing a stack trace at the user. Failing safely is a feature. Degrade, do not crash.

5. Step caps: stop the runaway

An agent decides its own steps, so a bad prompt or a confused model can loop: call a tool, call it again, forever, each call costing tokens. StepLimiter counts steps and stops the run once it passes a cap. This is a hard money-safety control, and it pairs with the cost and tool-usage visibility you built in M20: observability shows you a loop happened; the step cap stops it from being expensive.

6. Human-in-the-loop: the approval gate

Reading data is safe to automate. Actions that change the world (send an email, delete a record, spend money, deploy) are not, especially when an agent might be wrong or manipulated. approval_gate lets safe tools run automatically but requires a human "yes" before a risky tool runs. In agent.py, multiply is safe and runs freely; send_email is risky, so by default it is blocked and recorded until an approver says yes. The agent still proposes the action; a human decides. This is the same human-in-the-loop principle from M14 and M18, enforced in code.

Decide which tools are risky by what they can do if the agent is wrong: anything irreversible or outward-facing should be gated.

7. Putting it together

The reliable loop in agent.py, each turn: 1. limiter.tick() first, so a runaway loop is stopped before any cost, 2. the model call wrapped in timeout then retry, degrading to a safe message if it still fails, 3. for each tool the model wants, an approval_gate check (risky tools need a yes) before running it.

None of this changes what the agent does when everything works. It changes what happens when things go wrong, which in production is often. Add these once and reuse them around every agent you ship.

Words you will hear

Transient error, retry / exponential backoff / jitter, timeout, fallback / graceful degradation, step cap / runaway loop, human-in-the-loop / approval gate, circuit breaker (the challenge), idempotency (safe to retry). Full definitions in the glossary.