Skip to content

M20 solution: agent observability and evaluation toolkit

A tiny, dependency-free toolkit that makes an agent shippable: see every step it takes, and score whether those steps were right. It instruments the multiply agent from M19.

The core runs offline with a mock model (no API key, no tokens). Only the optional live run in agent.py / the LLM-as-judge scorer call Claude and cost tokens.

Files

File Role
tracer.py Observability. Span (one step) and Trace (a run): records kind, name, inputs, output, tokens, duration, status, and prints a readable tree. The unit production tools (LangSmith, Langfuse, Arize Phoenix, OpenTelemetry) record.
agent.py The M19 multiply ReAct loop, instrumented: every model call and tool call is recorded as a span. run(task, client=None, trace=None) returns (answer, trace). Accepts an injected client so it is testable.
evals.py Evaluation. A CASES golden set, five rule-based scorers (answer_contains, called_tool, tool_args, no_errors, within_budget), an optional score_llm_judge, and run_suite(...) that prints a scorecard and returns a summary.
demo_mock.py Runs Part A (trace) and Part B (eval green, then red after breaking the tool) fully offline. Start here.
../starters/add_scorer.py Add your own scorer (a token_budget example is included).

Run it

# offline, free, no key needed (uses a mock model):
python demo_mock.py

# live (optional, costs a few tokens): put your key in .env first
cp ../starters/.env.example .env      # then edit .env and paste your key
python agent.py
python evals.py

How it works

  • Trace. agent.run opens a span around each model call and each tool call, then calls span.finish(output, tokens, status). Failures are recorded with status="error" instead of being hidden, because the failure is exactly what you need to see.
  • Scorers read the trace, not just the answer. called_tool and tool_args confirm the agent reached its answer the right way (by calling multiply(23, 17)), which a final-string check cannot.
  • Regression demo. demo_mock.py changes multiply to add instead of multiply and re-runs the suite; the scorecard drops from 100% to 0% and names the failing check. That is a regression caught before a user ever saw it.
  • Rule-based first, judge second. Deterministic scorers are free, instant, and never drift. The LLM-as-judge (score_llm_judge) is for open-ended answers a rule cannot check; it costs tokens.

Verified

  • tracer + traced agent (mock model, real tools): produces 2 model spans, 1 tool span multiply(23, 17) = 391, 24 tokens, 0 errors.
  • evals.run_suite scores a correct agent at 100%, and a deliberately broken agent (multiply changed to add) at 0%, with the answer_contains and tool-result checks flagged.
  • score_llm_judge verified with a mocked grader (returns PASS).
  • All four .py files compile; demo_mock.py runs end to end offline.
  • Live runs need a real ANTHROPIC_API_KEY and cost tokens (pilot); the mock path is free.