M20: Agent observability and evaluation (Part D: Agentic Systems)
You built agents (M9), orchestrated them (M18), and built them in eight frameworks (M19). One question decides whether any of that is safe to ship: how do you KNOW it works? An agent that takes its own steps can quietly call the wrong tool, loop forever, or give a confident wrong answer. Today you build the two tools that make agents shippable: observability (see every step it took) and evaluation (score whether each step was right), and you watch your evals catch a bug the moment it is introduced.
Today's win: a tracer that shows every model call and tool call your agent made, plus an eval scorecard that turns red the instant the agent regresses.
Today you will
- Add observability: record every step (model calls, tool calls, tokens, timing, errors) as a trace
- Read a trace to debug an agent that misbehaves, instead of guessing
- Write an evaluation harness: test cases plus scorers that check the answer AND the trace
- Catch a regression: break the agent on purpose and watch the scorecard fail
- Know the production tools (LangSmith, Langfuse, Arize Phoenix, OpenTelemetry) and what they record
Run of show (about 60 minutes)
| Time | What we do |
|---|---|
| 0:00 | Hook: you cannot ship what you cannot see or measure |
| 0:05 | The one idea: observability tells you WHAT happened, evaluation tells you if it was RIGHT (read notes.md) |
| 0:12 | Lab Part A: instrument the agent and read its trace |
| 0:30 | Lab Part B: write an eval suite, then break the agent and catch the regression |
| 0:52 | Show: post a trace and a scorecard (one green, one red) |
| 1:00 | Wrap |
If you get stuck
- Builds on M9 (the agent loop), M19 (the multiply agent we reuse), and M10 (evaluation ideas). Reuse your
.envkey only for live runs; the tracer and the rule-based evals run with a mock and cost nothing. - The whole toolkit is plain Python, no new libraries to install. Nothing here can harm your computer.
- If a trace looks wrong, that is the point: the trace is how you find the bug. Re-read the You should now see line under each step.
Optional challenge
Add a loop detector scorer: fail any run where the same tool is called more than three times in one trace (a classic runaway-agent smell). Then write a test case that triggers it. Catching loops before they cost real money is exactly what observability plus evaluation is for.