Skip to content

M20: Agent observability and evaluation (Part D: Agentic Systems)

You built agents (M9), orchestrated them (M18), and built them in eight frameworks (M19). One question decides whether any of that is safe to ship: how do you KNOW it works? An agent that takes its own steps can quietly call the wrong tool, loop forever, or give a confident wrong answer. Today you build the two tools that make agents shippable: observability (see every step it took) and evaluation (score whether each step was right), and you watch your evals catch a bug the moment it is introduced.

Today's win: a tracer that shows every model call and tool call your agent made, plus an eval scorecard that turns red the instant the agent regresses.

Today you will

  • Add observability: record every step (model calls, tool calls, tokens, timing, errors) as a trace
  • Read a trace to debug an agent that misbehaves, instead of guessing
  • Write an evaluation harness: test cases plus scorers that check the answer AND the trace
  • Catch a regression: break the agent on purpose and watch the scorecard fail
  • Know the production tools (LangSmith, Langfuse, Arize Phoenix, OpenTelemetry) and what they record

Run of show (about 60 minutes)

Time What we do
0:00 Hook: you cannot ship what you cannot see or measure
0:05 The one idea: observability tells you WHAT happened, evaluation tells you if it was RIGHT (read notes.md)
0:12 Lab Part A: instrument the agent and read its trace
0:30 Lab Part B: write an eval suite, then break the agent and catch the regression
0:52 Show: post a trace and a scorecard (one green, one red)
1:00 Wrap

If you get stuck

  • Builds on M9 (the agent loop), M19 (the multiply agent we reuse), and M10 (evaluation ideas). Reuse your .env key only for live runs; the tracer and the rule-based evals run with a mock and cost nothing.
  • The whole toolkit is plain Python, no new libraries to install. Nothing here can harm your computer.
  • If a trace looks wrong, that is the point: the trace is how you find the bug. Re-read the You should now see line under each step.

Optional challenge

Add a loop detector scorer: fail any run where the same tool is called more than three times in one trace (a classic runaway-agent smell). Then write a test case that triggers it. Catching loops before they cost real money is exactly what observability plus evaluation is for.