Skip to content

M22 solution: agent reliability and ops

A small, dependency-free toolkit of production reliability patterns, plus a reliable agent that uses them. Everything runs offline with injected faults: no API key, no tokens, and backoff waits are stubbed so it is instant.

Files

File Role
reliability.py The patterns: retry (exponential backoff on transient errors), call_with_deadline (timeout), fallback (try options in order), StepLimiter (cap steps), approval_gate (human yes for risky actions), plus the TransientError / StepLimitExceeded / ApprovalDenied exceptions.
agent.py run(...): the ReAct loop hardened with all of the above. Each model call is timeout + retry; total failure degrades to a safe message; a step cap stops runaway loops; risky tools (send_email) need approval, safe tools (multiply) do not. Returns {answer, steps, blocked, degraded}. Injectable client for fault injection.
demo_mock.py Injects each fault into a fake model (flaky, looping, down, risky) and shows the matching pattern handle it. Start here.
../starters/add_policy.py Implement a circuit breaker.

Run it

# offline, free, instant (faults are injected, backoff is stubbed):
python demo_mock.py

# live (optional, costs a few tokens): put your key in .env first
cp ../starters/.env.example .env      # then edit .env and paste your key
python agent.py

The patterns, and when each fires

  • retry + backoff: transient errors (429, 503, brief network blips). Waits 0.5s, 1s, 2s between tries; only retries TransientError, never a permanent 400; caps attempts. Add jitter in production.
  • timeout (call_with_deadline): a hung call. Gives up after N seconds and raises a transient error so retry can try fresh. Honest caveat: Python cannot kill the worker thread, so set a client-side request timeout too; this models the caller giving up.
  • fallback / graceful degrade: a real outage. The agent returns a calm message and degraded: True instead of crashing. Degrade, do not throw.
  • step cap (StepLimiter): a runaway loop. Stops the run past the cap. Pairs with M20 (observability shows the loop, the cap prevents the bill).
  • approval gate (approval_gate): risky, world-changing actions. Safe tools run automatically; risky tools are blocked until an approver says yes (human-in-the-loop, M14).

Verified (offline)

  • Units: retry recovers on attempt 3 with backoff delays [0.5, 1.0] and raises when attempts run out; call_with_deadline raises on a slow call; fallback uses the backup; StepLimiter stops past the cap; approval_gate denies a risky action and passes safe ones.
  • Agent under injected faults: a flaky model (fails twice) still returns 391; a looping model is stopped by the cap (degraded: True); a full outage degrades to a safe message; a risky send_email is blocked by default and runs only when the approver returns yes.
  • All files compile; demo_mock.py runs end to end offline. Live runs reuse the M4 key and cost tokens.