M20 notes: Agent observability and evaluation (the one idea)

The one idea: an agent decides its own steps, so you cannot trust it just because it ran without crashing. You need two things. Observability records WHAT the agent did (every model call, every tool call, with inputs, outputs, tokens, timing, errors). Evaluation scores whether that was RIGHT (did it call the correct tool, with the correct arguments, and reach the correct answer, within budget). Observability is your eyes; evaluation is your tests.

1. Why a normal app is easier than an agent

A normal program runs the same path every time, so a few unit tests cover it. An agent is different: the model chooses, on each run, whether to call a tool, which tool, and with what arguments. That freedom is the point of an agent, and it is also why an agent can fail in ways a normal app cannot:

it calls the wrong tool, or the right tool with wrong arguments,
it loops, calling the same tool over and over and burning tokens,
it gives a fluent, confident answer that is simply wrong,
it silently swallows a tool error and keeps going.

None of these throw a Python exception. The program "works". Only observability and evaluation reveal that it did the wrong thing.

Analogy. A trace is the flight recorder (the black box). After any flight, good or bad, you can replay exactly what happened. An eval suite is the pre-flight checklist: a fixed set of checks you run every time before you let the thing fly.

2. Observability: the trace

A trace is the record of one agent run. It is made of spans, one per step. In tracer.py a span records:

Field	What it tells you
kind	"model" (a call to the LLM) or "tool" (a call to one of your functions)
name	which model, or which tool
inputs	what went in (the messages, or the tool arguments)
output	what came back (stop reason, or the tool result)
tokens	how many output tokens the model call used (this is your cost)
duration	how long the step took (this is your latency)
status	"ok" or "error"

Read top to bottom and you can see the agent think: model call returns "tool_use", tool runs and returns a value, model call returns the final answer. When something is wrong, the trace shows you exactly which step, instead of leaving you to guess from a single final string.

This is precisely what production tools record for you: LangSmith, Langfuse, Arize Phoenix, Weights and Biases Weave, and the open standard OpenTelemetry (its "GenAI" spans). They add dashboards, search, and team sharing on top, but the unit is always the same: spans in a trace. Build it once by hand (we do) and those tools stop being mysterious.

3. Evaluation: test cases plus scorers

An eval is a test for an agent. You need two pieces:

A dataset of cases. Each case is an input plus what a correct run looks like (sometimes called a "golden" set). In evals.py: the task, the expected substring in the answer, the tool you expect it to call, the arguments you expect, and a step budget.
Scorers. Each scorer looks at the answer and/or the trace and returns pass or fail. Our scorers:
answer_contains: is the right number in the final answer?
called_tool: did it actually use the tool (checked against the trace)?
tool_args: did it call the tool with the right arguments?
no_errors: did any step error?
within_budget: did it stay under the model-call cap (catches loops and waste)?

Notice that scoring uses the trace, not just the final answer. "Did it get 391?" is not enough; "did it get 391 BY calling multiply(23, 17)?" is the real question. An agent that guesses the right answer without using its tool is still broken, and only the trace reveals it.

Rule-based vs LLM-as-judge

Our scorers are rule-based: deterministic, instant, free, and they never drift. Use them whenever the right answer is checkable (a number, a tool call, a JSON field). For open-ended answers (a summary, an explanation) there is no exact string to match, so teams use an LLM-as-judge: a second model call that grades the answer against a rubric and replies PASS or FAIL (score_llm_judge in evals.py). It is powerful but it costs tokens and can itself be wrong, so prefer rule-based checks first and reserve the judge for what rules cannot capture.

4. The payoff: catching regressions

The reason to write evals is the same reason you write unit tests: so that a future change does not silently break something that used to work. In the lab you run the suite (all green), then change the multiply tool to ADD instead of multiply, and run the suite again. It turns red immediately, and the scorecard tells you which check failed and why. That is a regression caught before it ever reached a user. Run your eval suite on every change to your prompt, tools, model, or framework.

5. What to measure in production

Beyond pass/fail, the trace already carries the numbers you report on a dashboard:

Cost: total tokens per run (sum of the model spans). Watch it trend up.
Latency: total duration. Users feel this.
Error rate: fraction of runs with an errored span.
Tool usage: which tools fire, how often, and whether any run loops.

Add human-in-the-loop review (M14) for a sample of real traces, because no automated scorer catches everything. Observability plus evaluation plus a human spot-check is the shippable combination.

Words you will hear

Observability, trace, span, evaluation (eval), scorer / metric, golden dataset, regression, LLM-as-judge, cost / latency / error rate, OpenTelemetry. Full definitions in the glossary.