M20 solution: agent observability and evaluation toolkit

A tiny, dependency-free toolkit that makes an agent shippable: see every step it takes, and score whether those steps were right. It instruments the multiply agent from M19.

The core runs offline with a mock model (no API key, no tokens). Only the optional live run in agent.py / the LLM-as-judge scorer call Claude and cost tokens.

Files

File	Role
`tracer.py`	Observability. `Span` (one step) and `Trace` (a run): records kind, name, inputs, output, tokens, duration, status, and prints a readable tree. The unit production tools (LangSmith, Langfuse, Arize Phoenix, OpenTelemetry) record.
`agent.py`	The M19 multiply ReAct loop, instrumented: every model call and tool call is recorded as a span. `run(task, client=None, trace=None)` returns `(answer, trace)`. Accepts an injected client so it is testable.
`evals.py`	Evaluation. A `CASES` golden set, five rule-based scorers (`answer_contains`, `called_tool`, `tool_args`, `no_errors`, `within_budget`), an optional `score_llm_judge`, and `run_suite(...)` that prints a scorecard and returns a summary.
`demo_mock.py`	Runs Part A (trace) and Part B (eval green, then red after breaking the tool) fully offline. Start here.
`../starters/add_scorer.py`	Add your own scorer (a `token_budget` example is included).

Run it

# offline, free, no key needed (uses a mock model):
python demo_mock.py

# live (optional, costs a few tokens): put your key in .env first
cp ../starters/.env.example .env      # then edit .env and paste your key
python agent.py
python evals.py

How it works

Trace. agent.run opens a span around each model call and each tool call, then calls span.finish(output, tokens, status). Failures are recorded with status="error" instead of being hidden, because the failure is exactly what you need to see.
Scorers read the trace, not just the answer. called_tool and tool_args confirm the agent reached its answer the right way (by calling multiply(23, 17)), which a final-string check cannot.
Regression demo. demo_mock.py changes multiply to add instead of multiply and re-runs the suite; the scorecard drops from 100% to 0% and names the failing check. That is a regression caught before a user ever saw it.
Rule-based first, judge second. Deterministic scorers are free, instant, and never drift. The LLM-as-judge (score_llm_judge) is for open-ended answers a rule cannot check; it costs tokens.

Verified

tracer + traced agent (mock model, real tools): produces 2 model spans, 1 tool span multiply(23, 17) = 391, 24 tokens, 0 errors.
evals.run_suite scores a correct agent at 100%, and a deliberately broken agent (multiply changed to add) at 0%, with the answer_contains and tool-result checks flagged.
score_llm_judge verified with a mocked grader (returns PASS).
All four .py files compile; demo_mock.py runs end to end offline.
Live runs need a real ANTHROPIC_API_KEY and cost tokens (pilot); the mock path is free.