M20: Agent observability and evaluation (Part D: Agentic Systems)

You built agents (M9), orchestrated them (M18), and built them in eight frameworks (M19). One question decides whether any of that is safe to ship: how do you KNOW it works? An agent that takes its own steps can quietly call the wrong tool, loop forever, or give a confident wrong answer. Today you build the two tools that make agents shippable: observability (see every step it took) and evaluation (score whether each step was right), and you watch your evals catch a bug the moment it is introduced.

Today's win: a tracer that shows every model call and tool call your agent made, plus an eval scorecard that turns red the instant the agent regresses.

Today you will

Add observability: record every step (model calls, tool calls, tokens, timing, errors) as a trace
Read a trace to debug an agent that misbehaves, instead of guessing
Write an evaluation harness: test cases plus scorers that check the answer AND the trace
Catch a regression: break the agent on purpose and watch the scorecard fail
Know the production tools (LangSmith, Langfuse, Arize Phoenix, OpenTelemetry) and what they record

Run of show (about 60 minutes)

Time	What we do
0:00	Hook: you cannot ship what you cannot see or measure
0:05	The one idea: observability tells you WHAT happened, evaluation tells you if it was RIGHT (read `notes.md`)
0:12	Lab Part A: instrument the agent and read its trace
0:30	Lab Part B: write an eval suite, then break the agent and catch the regression
0:52	Show: post a trace and a scorecard (one green, one red)
1:00	Wrap

If you get stuck

Builds on M9 (the agent loop), M19 (the multiply agent we reuse), and M10 (evaluation ideas). Reuse your .env key only for live runs; the tracer and the rule-based evals run with a mock and cost nothing.
The whole toolkit is plain Python, no new libraries to install. Nothing here can harm your computer.
If a trace looks wrong, that is the point: the trace is how you find the bug. Re-read the You should now see line under each step.

Optional challenge

Add a loop detector scorer: fail any run where the same tool is called more than three times in one trace (a classic runaway-agent smell). Then write a test case that triggers it. Catching loops before they cost real money is exactly what observability plus evaluation is for.