Lab M20: see your agent (observability) and grade it (evaluation)

You'll need: your venv and the anthropic plus python-dotenv you installed in M4. The core lab needs no API key and costs nothing (it uses a mock model). A live run at the end is optional. Time: about 45 minutes. Work in your breakout pair.

Heads up: an agent picks its own steps, so "it ran without an error" does not mean "it did the right thing". Observability lets you SEE every step; evaluation SCORES whether the steps were right. We reuse the multiply agent from M19. Nothing here can harm your computer.

This lab has two parts: - Part A: instrument the agent and read its trace. - Part B: write an eval suite, then break the agent and watch the scorecard catch it.

flowchart LR
  T["task"] --> AG["traced agent"]
  AG -->|records each step| TR["TRACE: spans<br/>model + tool calls"]
  TR --> EV["EVAL scorers"]
  EV --> SC["SCORECARD<br/>pass / fail"]

Part A: observability (read the trace)

Step 1: Set up

Copy the solution/ files and starters/.env.example into a folder. Activate your venv.

python -c "import anthropic, dotenv; print('deps ok')"

You should now see: deps ok. (If not, run pip install anthropic python-dotenv, the M4 libraries.)

Step 2: Run the offline demo and read the trace

python demo_mock.py

This runs the agent on a fake model, so it is free and needs no key.

You should now see, under PART A, a trace like this:

  1. [model] claude-opus-4-8  (... tok, ok)   out: tool_use
  2. [tool] multiply           in: {'a': 23, 'b': 17}   out: 391
  3. [model] claude-opus-4-8                  out: end_turn
  totals: 2 model call(s), 1 tool call(s), 24 tokens, 0 error(s)

Read it top to bottom: the model asked for the tool, the tool ran and returned 391, then the model gave its final answer. That is your agent thinking, on the record.

Step 3: See where the trace comes from

Open agent.py. Find the two trace.record(...) lines: one wraps the model call, one wraps each tool call, and each is closed with .finish(...). Open tracer.py and read Span and Trace.

You should now see: every field in the printed trace (kind, name, inputs, output, tokens, status) maps to one line in Span. There is no magic; production tools (LangSmith, Langfuse, OpenTelemetry) record the same spans, then add dashboards on top.

Step 4: Make the trace show an error

In tracer.py nothing needs changing; instead, in a Python shell, call the tool with a bad argument to see a failure get recorded:

python -c "from tracer import Trace; t=Trace(); s=t.record('tool','multiply',{'a':1}); s.finish('missing b', status='error'); t.print_tree()"

You should now see: a span with error status and 1 error(s) in the totals. Errors are data too: observability records the failures, which is exactly when you need it most.

Part B: evaluation (grade the agent)

Step 5: Run the eval suite on a correct agent

The same demo_mock.py run already printed PART B. Look at it again (or rerun).

You should now see a green scorecard:

EVAL SCORECARD
  [PASS] basic: 'The answer is 391.'
  [PASS] small: 'The answer is 42.'
  ----
  2/2 cases passed  (100%)

Open evals.py: each case lists what a correct run looks like, and each scorer (answer_contains, called_tool, tool_args, no_errors, within_budget) checks one thing. Note that called_tool and tool_args read the trace, not just the answer.

Step 6: Break the agent and catch the regression

The demo already does this for you at the bottom: it changes multiply to ADD instead of multiply and re-runs the suite.

You should now see a red scorecard:

  [FAIL] basic: 'The answer is 40.'
        miss: answer_contains (want '391' in answer)
  ...
  0/2 cases passed  (0%)

This is the whole point. You changed the tool and the evals caught it instantly and told you which check failed. Run your eval suite after every change to a prompt, tool, model, or framework.

Step 7: Add your own scorer

Open starters/add_scorer.py. It adds a token_budget scorer (fail if a run uses too many tokens). Lower the cap to a small number and run it:

python add_scorer.py

You should now see: the new token_budget check appear, and (if you set the cap below 24) the cases fail on it. You just added a cost guardrail to your evals. Put the cap back when done.

Step 8 (optional, costs a few tokens): grade a real run

Put your key in .env (copy .env.example), then run the agent live and trace it:

cp .env.example .env      # then edit .env and paste your key
python agent.py

You should now see: a real trace (real token counts and timings) ending in an answer containing 391. The trace and the scorers work the same on a live run; only now the numbers are real.

Step 9: Show it

Post in the chat: your PART A trace, and both scorecards (the green 100% and the red 0%). One picture of an agent you can see and measure.

If you get stuck

ModuleNotFoundError: anthropic -> pip install anthropic python-dotenv with your venv active (M4 libraries).
demo_mock.py cannot find agent/evals -> run it from inside the folder that holds all the solution .py files.
The scorecard is green when you expected red -> make sure you actually saved the change to multiply (the demo does it for you; if editing by hand, re-save and rerun).
ANTHROPIC_API_KEY error in Step 8 -> your .env is not named exactly .env, or the key line is wrong. See api-keys.md. Steps 1 to 7 do not need a key.

Check yourself

Why is "did it return 391?" not a good enough eval on its own?

Because an agent could guess 391 without ever calling the tool, or call the wrong tool and still land on 391 by luck. The `called_tool` and `tool_args` scorers read the trace to confirm it got there the right way. Observability is what makes that check possible.

What is the difference between observability and evaluation?

Observability records WHAT the agent did (the trace: every model and tool call, with tokens, timing, errors). Evaluation scores whether that was RIGHT (test cases plus scorers). You need both.

When would you use an LLM-as-judge instead of a rule-based scorer?

When the answer is open-ended (a summary, an explanation) and there is no exact string or tool call to check. The judge grades against a rubric. It costs tokens and can be wrong, so prefer rule-based checks first and use the judge only where rules cannot reach.

Why run evals after every change?

To catch regressions. A new prompt, tool, model, or framework can silently break behaviour that used to work. Evals are unit tests for your agent: green means the change was safe.