M25 notes: Cost and performance optimization (the one idea)

The one idea: an agent's bill is just tokens times price, summed over every call it makes. So you lower it two ways: send fewer tokens (cache what repeats, trim what you do not need) and pay a lower price per token (route easy steps to a cheaper model). Latency follows the same shape. M20 showed you the numbers; this is how you bring them down, and how to be sure you did not break quality doing it.

1. Where the money actually goes

Per call you pay for input tokens (what you send) and output tokens (what the model generates), at a per-model price. An agent multiplies this by many calls. In the lab's 5-step pipeline, the naive cost is dominated by two things:

the 2,000-token prefix (system instructions plus retrieved context) resent on every call, and
the output tokens of the hard steps (the actual reasoning and the written reply).

Seeing this split is the whole game: it tells you which lever helps. Repeated input is wasteful and very cuttable. Reasoning output is the thing of value and is expensive to cut without hurting quality. Good optimization attacks the waste and leaves the value alone.

Analogy. A delivery company cuts fuel cost by routing better and not driving empty trucks (the waste), not by leaving the packages behind (the value). Caching and routing are the empty-truck fixes.

2. Lever 1: prompt caching (stop paying for what repeats)

If many calls share a long, identical prefix (instructions, examples, a retrieved document), you should not pay full input price for it every time. Prompt caching writes that prefix once (a little more than normal input, about 1.25x) and then reads it on later calls for about a tenth of the price. In the lab, caching the 2,000-token prefix across the 5 calls cuts cost roughly in half on its own, and changes nothing about the output. It is the highest-leverage, lowest-risk optimization for agents, because agents reuse the same system prompt and context constantly. (Caching does little for latency, though.)

3. Lever 2: model routing (right-size the model per step)

Not every step needs your strongest model. Classifying intent or detecting language is easy; a small fast model (Haiku, about 5x cheaper than Opus on input) does it well. Deep reasoning and the final written reply are hard; keep those on Opus. Routing sends each step to the cheapest model that can do it. In the lab this cuts cost and, because the small model is faster, cuts latency too (the pipeline drops from about 18 seconds to about 11). The danger is obvious: route a HARD step to a weak model and quality falls. So you route by difficulty, conservatively, and you verify (section 6).

4. Lever 3: trim tokens you do not need

Every token in the prompt is paid for on every call, so padding compounds. Trim retrieved context to what is relevant (M21's budgeting), drop dead examples, ask for shorter output when you can, and use structured output (M6) instead of verbose prose. Trimming is unglamorous but it stacks with the others.

5. What you cannot cheaply cut

The hard steps' OUTPUT tokens are the reasoning you are paying the model to do. Caching does not touch output, and routing them to a weak model is exactly the mistake to avoid. This is why, in the lab, "both levers" is only a little better than "cache only": once you remove the repeated input, what is left is real work. Knowing what is irreducible keeps you from chasing savings that cost you quality.

6. Optimization is a bet on quality, so measure it

Every change here is a hypothesis: "this will be cheaper AND just as good." The first half you estimate (this module); the second half you must verify, with the eval suite from M20. Route a step to Haiku, then run your evals: if the scorecard stays green, keep it; if it drops, route that step back to Opus. Never ship a cost optimization on faith. Cheaper-but-wrong is not cheaper; it is broken.

7. A quick word on latency and perceived speed

Cost and latency are related but not the same. Routing to a faster model cuts real latency. Streaming does not cut total time but dramatically improves perceived speed: the user sees the first words almost immediately instead of staring at a blank screen (you used streaming in M6). For anything user-facing, stream. For throughput on non-urgent bulk work, the Batch API trades latency for about half the cost (the lab challenge).

Words you will hear

Tokens (input vs output), prompt caching (write vs read), model routing, token trimming, latency vs perceived latency, streaming (M6), Batch API, cost per 1M tokens. Full definitions in the glossary.