M20 solution: agent observability and evaluation toolkit
A tiny, dependency-free toolkit that makes an agent shippable: see every step it takes, and score whether those steps were right. It instruments the multiply agent from M19.
The core runs offline with a mock model (no API key, no tokens). Only the optional live run in
agent.py / the LLM-as-judge scorer call Claude and cost tokens.
Files
| File | Role |
|---|---|
tracer.py |
Observability. Span (one step) and Trace (a run): records kind, name, inputs, output, tokens, duration, status, and prints a readable tree. The unit production tools (LangSmith, Langfuse, Arize Phoenix, OpenTelemetry) record. |
agent.py |
The M19 multiply ReAct loop, instrumented: every model call and tool call is recorded as a span. run(task, client=None, trace=None) returns (answer, trace). Accepts an injected client so it is testable. |
evals.py |
Evaluation. A CASES golden set, five rule-based scorers (answer_contains, called_tool, tool_args, no_errors, within_budget), an optional score_llm_judge, and run_suite(...) that prints a scorecard and returns a summary. |
demo_mock.py |
Runs Part A (trace) and Part B (eval green, then red after breaking the tool) fully offline. Start here. |
../starters/add_scorer.py |
Add your own scorer (a token_budget example is included). |
Run it
# offline, free, no key needed (uses a mock model):
python demo_mock.py
# live (optional, costs a few tokens): put your key in .env first
cp ../starters/.env.example .env # then edit .env and paste your key
python agent.py
python evals.py
How it works
- Trace.
agent.runopens a span around each model call and each tool call, then callsspan.finish(output, tokens, status). Failures are recorded withstatus="error"instead of being hidden, because the failure is exactly what you need to see. - Scorers read the trace, not just the answer.
called_toolandtool_argsconfirm the agent reached its answer the right way (by callingmultiply(23, 17)), which a final-string check cannot. - Regression demo.
demo_mock.pychangesmultiplyto add instead of multiply and re-runs the suite; the scorecard drops from 100% to 0% and names the failing check. That is a regression caught before a user ever saw it. - Rule-based first, judge second. Deterministic scorers are free, instant, and never drift. The
LLM-as-judge (
score_llm_judge) is for open-ended answers a rule cannot check; it costs tokens.
Verified
tracer+ tracedagent(mock model, real tools): produces 2 model spans, 1 tool spanmultiply(23, 17) = 391, 24 tokens, 0 errors.evals.run_suitescores a correct agent at 100%, and a deliberately broken agent (multiply changed to add) at 0%, with theanswer_containsand tool-result checks flagged.score_llm_judgeverified with a mocked grader (returns PASS).- All four
.pyfiles compile;demo_mock.pyruns end to end offline. - Live runs need a real
ANTHROPIC_API_KEYand cost tokens (pilot); the mock path is free.