Lab M25: cut an agent's cost in half (and prove it)

You'll need: your venv. The core lab needs no API key and costs nothing (it is a pure estimator). A live token-measurement step at the end is optional. Time: about 40 minutes. Work in your breakout pair.

Heads up: we estimate dollars and latency from token counts, then watch caching, routing, and trimming bring them down. No model is called for the core lab, so there is nothing to spend and nothing that can harm your computer. Prices are illustrative; check current pricing for real numbers.

This lab has two parts: - Part A: measure the naive pipeline, then add prompt caching. - Part B: add routing and trimming, read the savings table, and learn what you cannot cut.

flowchart LR
  N["naive\nall Opus, no cache"] -->|cache the prefix| C["cache only\n~50% off"]
  N -->|route easy to Haiku| R["route only\n~38% off, faster"]
  C --> B["both\n~52% off, faster"]
  R --> B

Part A: measure, then cache

Step 1: Set up

Copy the solution/ files into a folder. Activate your venv. No installs, no key.

python -c "print('ready')"

You should now see: ready.

Step 2: Run the estimator

python demo_mock.py

You should now see a table like this (numbers are deterministic):

config                               $/run     $/10k  latency  savings
naive (all Opus, no cache)         0.06685       669    18.5s   0%
cache only (all Opus, cached)      0.03335       334    18.5s  50%
route only (Haiku for easy)        0.04177       418    11.5s  38%
both (route + cache)               0.03217       322    11.5s  52%

You should now see: the naive pipeline is 669 dollars per 10,000 conversations. That is the number we are going to cut.

Step 3: Read what caching did

Compare the naive and cache only rows: same models, but caching the 2,000-token prefix roughly halves the cost (669 to 334) and changes nothing about the answers. Open pricing.py and read dollars: a cached read costs about a tenth of normal input.

You should now see: caching helps cost a lot but latency not at all (still 18.5s). It removes repeated input you were paying for over and over.

Step 4: Find where the money goes

In a Python shell, price the most expensive single step:

python -c "import pricing; print(round(pricing.dollars('claude-opus-4-8', in_tokens=2200, out_tokens=300),5))"

You should now see: about 0.0185, the "decide resolution" step. Its 300 OUTPUT tokens cost 300/1e6*25 = 0.0075 that no cache can remove. Output on hard steps is the cost you cannot cheaply cut, because it is the actual reasoning.

Part B: route, trim, and the quality check

Step 5: Read what routing did

Compare naive and route only: sending the three easy steps to Haiku (about 5x cheaper, and faster) cuts cost 38 percent AND drops latency from 18.5s to 11.5s. Open optimize.py and read route.

You should now see: routing helps both cost and speed, because the small model is cheaper and faster, but only for steps that are genuinely easy.

Step 6: See trimming compound

The demo's trim section shows a 900-token context cut to 200. Because the context rides along on every call, trimming multiplies across the pipeline. Try it:

python -c "import optimize, pricing; print(pricing.approx_tokens(optimize.trim('word '*500, 100)))"

You should now see: about 100 tokens. Every token you remove here is removed from every call.

Step 7: The quality check (the important one)

Routing is a bet: "Haiku is good enough for this easy step." You verify that bet with your M20 eval suite. The rule:

You should now see (in your own words): after routing a step to a cheaper model, re-run the M20 evals. Green means keep the saving; red means route that step back to the strong model. Never ship a cost cut on faith. Cheaper-but-wrong is not cheaper.

Step 8 (optional, costs a few tokens): measure real usage

With a key in .env, make one real call and read the true token counts:

cp .env.example .env      # then edit .env and paste your key
python -c "import anthropic; r=anthropic.Anthropic().messages.create(model='claude-haiku-4-5',max_tokens=20,messages=[{'role':'user','content':'Say hi in 3 words.'}]); print(r.usage)"

You should now see: a usage object with input_tokens and output_tokens, the real numbers you would feed into pricing.dollars. Steps 1 to 7 need no key.

Step 9: Show it

Post your before-and-after: the naive $/10k next to the both $/10k, and one sentence on the quality check you would run before shipping the change.

If you get stuck

ModuleNotFoundError -> run from inside the folder with the solution .py files.
Numbers differ from the table -> make sure you did not edit STEPS or PREFIX_TOKENS in optimize.py.
"Is the price real?" -> the prices in pricing.py are the course model table and are illustrative; check current Anthropic pricing for production estimates.
ANTHROPIC_API_KEY error in Step 8 -> your .env is not named exactly .env, or the key line is wrong. Steps 1 to 7 need no key.

Check yourself

Why does caching help cost so much for an agent specifically?

Agents reuse the same system prompt and context across many calls. Without caching you pay full input price for that identical prefix every time; caching writes it once and reads it cheaply after, removing a cost you were paying repeatedly.

Routing cut cost AND latency. Why both?

The smaller model (Haiku) is both cheaper per token and faster to run, so sending easy steps to it saves money and time at once. The catch: it is weaker, so only route steps that are genuinely easy.

Why is "both levers" only a little cheaper than "cache only"?

Once caching removes the repeated input, what is left is mostly the hard steps' output tokens, the real reasoning, which neither caching nor routing can cut without hurting quality. That part is irreducible.

What must you do after any cost optimization, and why?

Re-run your M20 eval suite. Every optimization is a bet that quality holds; if the scorecard drops, the saving broke something. Cheaper-but-wrong is not a saving.