M22 solution: agent reliability and ops
A small, dependency-free toolkit of production reliability patterns, plus a reliable agent that uses them. Everything runs offline with injected faults: no API key, no tokens, and backoff waits are stubbed so it is instant.
Files
| File | Role |
|---|---|
reliability.py |
The patterns: retry (exponential backoff on transient errors), call_with_deadline (timeout), fallback (try options in order), StepLimiter (cap steps), approval_gate (human yes for risky actions), plus the TransientError / StepLimitExceeded / ApprovalDenied exceptions. |
agent.py |
run(...): the ReAct loop hardened with all of the above. Each model call is timeout + retry; total failure degrades to a safe message; a step cap stops runaway loops; risky tools (send_email) need approval, safe tools (multiply) do not. Returns {answer, steps, blocked, degraded}. Injectable client for fault injection. |
demo_mock.py |
Injects each fault into a fake model (flaky, looping, down, risky) and shows the matching pattern handle it. Start here. |
../starters/add_policy.py |
Implement a circuit breaker. |
Run it
# offline, free, instant (faults are injected, backoff is stubbed):
python demo_mock.py
# live (optional, costs a few tokens): put your key in .env first
cp ../starters/.env.example .env # then edit .env and paste your key
python agent.py
The patterns, and when each fires
- retry + backoff: transient errors (429, 503, brief network blips). Waits 0.5s, 1s, 2s between
tries; only retries
TransientError, never a permanent 400; caps attempts. Add jitter in production. - timeout (
call_with_deadline): a hung call. Gives up after N seconds and raises a transient error so retry can try fresh. Honest caveat: Python cannot kill the worker thread, so set a client-side request timeout too; this models the caller giving up. - fallback / graceful degrade: a real outage. The agent returns a calm message and
degraded: Trueinstead of crashing. Degrade, do not throw. - step cap (
StepLimiter): a runaway loop. Stops the run past the cap. Pairs with M20 (observability shows the loop, the cap prevents the bill). - approval gate (
approval_gate): risky, world-changing actions. Safe tools run automatically; risky tools are blocked until an approver says yes (human-in-the-loop, M14).
Verified (offline)
- Units:
retryrecovers on attempt 3 with backoff delays[0.5, 1.0]and raises when attempts run out;call_with_deadlineraises on a slow call;fallbackuses the backup;StepLimiterstops past the cap;approval_gatedenies a risky action and passes safe ones. - Agent under injected faults: a flaky model (fails twice) still returns 391; a looping model is stopped
by the cap (
degraded: True); a full outage degrades to a safe message; a riskysend_emailis blocked by default and runs only when the approver returns yes. - All files compile;
demo_mock.pyruns end to end offline. Live runs reuse the M4 key and cost tokens.