Lab M20: see your agent (observability) and grade it (evaluation)
You'll need: your venv and the anthropic plus python-dotenv you installed in M4. The core
lab needs no API key and costs nothing (it uses a mock model). A live run at the end is optional.
Time: about 45 minutes. Work in your breakout pair.
Heads up: an agent picks its own steps, so "it ran without an error" does not mean "it did the right thing". Observability lets you SEE every step; evaluation SCORES whether the steps were right. We reuse the multiply agent from M19. Nothing here can harm your computer.
This lab has two parts: - Part A: instrument the agent and read its trace. - Part B: write an eval suite, then break the agent and watch the scorecard catch it.
flowchart LR
T["task"] --> AG["traced agent"]
AG -->|records each step| TR["TRACE: spans<br/>model + tool calls"]
TR --> EV["EVAL scorers"]
EV --> SC["SCORECARD<br/>pass / fail"]
Part A: observability (read the trace)
Step 1: Set up
Copy the solution/ files and starters/.env.example into
a folder. Activate your venv.
python -c "import anthropic, dotenv; print('deps ok')"
deps ok. (If not, run pip install anthropic python-dotenv, the M4 libraries.)
Step 2: Run the offline demo and read the trace
python demo_mock.py
You should now see, under PART A, a trace like this:
1. [model] claude-opus-4-8 (... tok, ok) out: tool_use
2. [tool] multiply in: {'a': 23, 'b': 17} out: 391
3. [model] claude-opus-4-8 out: end_turn
totals: 2 model call(s), 1 tool call(s), 24 tokens, 0 error(s)
Step 3: See where the trace comes from
Open agent.py. Find the two trace.record(...) lines: one wraps the model
call, one wraps each tool call, and each is closed with .finish(...). Open
tracer.py and read Span and Trace.
You should now see: every field in the printed trace (kind, name, inputs, output, tokens, status)
maps to one line in Span. There is no magic; production tools (LangSmith, Langfuse, OpenTelemetry)
record the same spans, then add dashboards on top.
Step 4: Make the trace show an error
In tracer.py nothing needs changing; instead, in a Python shell, call the tool with a bad argument
to see a failure get recorded:
python -c "from tracer import Trace; t=Trace(); s=t.record('tool','multiply',{'a':1}); s.finish('missing b', status='error'); t.print_tree()"
error status and 1 error(s) in the totals. Errors are data
too: observability records the failures, which is exactly when you need it most.
Part B: evaluation (grade the agent)
Step 5: Run the eval suite on a correct agent
The same demo_mock.py run already printed PART B. Look at it again (or rerun).
You should now see a green scorecard:
EVAL SCORECARD
[PASS] basic: 'The answer is 391.'
[PASS] small: 'The answer is 42.'
----
2/2 cases passed (100%)
evals.py: each case lists what a correct run looks like, and each
scorer (answer_contains, called_tool, tool_args, no_errors, within_budget) checks one thing.
Note that called_tool and tool_args read the trace, not just the answer.
Step 6: Break the agent and catch the regression
The demo already does this for you at the bottom: it changes multiply to ADD instead of multiply and
re-runs the suite.
You should now see a red scorecard:
[FAIL] basic: 'The answer is 40.'
miss: answer_contains (want '391' in answer)
...
0/2 cases passed (0%)
Step 7: Add your own scorer
Open starters/add_scorer.py. It adds a token_budget scorer (fail if a
run uses too many tokens). Lower the cap to a small number and run it:
python add_scorer.py
token_budget check appear, and (if you set the cap below 24) the
cases fail on it. You just added a cost guardrail to your evals. Put the cap back when done.
Step 8 (optional, costs a few tokens): grade a real run
Put your key in .env (copy .env.example), then run the agent live and trace it:
cp .env.example .env # then edit .env and paste your key
python agent.py
Step 9: Show it
Post in the chat: your PART A trace, and both scorecards (the green 100% and the red 0%). One picture
of an agent you can see and measure.
If you get stuck
ModuleNotFoundError: anthropic->pip install anthropic python-dotenvwith your venv active (M4 libraries).demo_mock.pycannot findagent/evals-> run it from inside the folder that holds all the solution.pyfiles.- The scorecard is green when you expected red -> make sure you actually saved the change to
multiply(the demo does it for you; if editing by hand, re-save and rerun). ANTHROPIC_API_KEYerror in Step 8 -> your.envis not named exactly.env, or the key line is wrong. Seeapi-keys.md. Steps 1 to 7 do not need a key.