Lab M25: cut an agent's cost in half (and prove it)
You'll need: your venv. The core lab needs no API key and costs nothing (it is a pure estimator). A live token-measurement step at the end is optional. Time: about 40 minutes. Work in your breakout pair.
Heads up: we estimate dollars and latency from token counts, then watch caching, routing, and trimming bring them down. No model is called for the core lab, so there is nothing to spend and nothing that can harm your computer. Prices are illustrative; check current pricing for real numbers.
This lab has two parts: - Part A: measure the naive pipeline, then add prompt caching. - Part B: add routing and trimming, read the savings table, and learn what you cannot cut.
flowchart LR
N["naive\nall Opus, no cache"] -->|cache the prefix| C["cache only\n~50% off"]
N -->|route easy to Haiku| R["route only\n~38% off, faster"]
C --> B["both\n~52% off, faster"]
R --> B
Part A: measure, then cache
Step 1: Set up
Copy the solution/ files into a folder. Activate your venv. No installs, no key.
python -c "print('ready')"
ready.
Step 2: Run the estimator
python demo_mock.py
config $/run $/10k latency savings
naive (all Opus, no cache) 0.06685 669 18.5s 0%
cache only (all Opus, cached) 0.03335 334 18.5s 50%
route only (Haiku for easy) 0.04177 418 11.5s 38%
both (route + cache) 0.03217 322 11.5s 52%
Step 3: Read what caching did
Compare the naive and cache only rows: same models, but caching the 2,000-token prefix roughly
halves the cost (669 to 334) and changes nothing about the answers. Open
pricing.py and read dollars: a cached read costs about a tenth of normal
input.
You should now see: caching helps cost a lot but latency not at all (still 18.5s). It removes repeated input you were paying for over and over.
Step 4: Find where the money goes
In a Python shell, price the most expensive single step:
python -c "import pricing; print(round(pricing.dollars('claude-opus-4-8', in_tokens=2200, out_tokens=300),5))"
0.0185, the "decide resolution" step. Its 300 OUTPUT tokens cost
300/1e6*25 = 0.0075 that no cache can remove. Output on hard steps is the cost you cannot cheaply
cut, because it is the actual reasoning.
Part B: route, trim, and the quality check
Step 5: Read what routing did
Compare naive and route only: sending the three easy steps to Haiku (about 5x cheaper, and faster)
cuts cost 38 percent AND drops latency from 18.5s to 11.5s. Open optimize.py
and read route.
You should now see: routing helps both cost and speed, because the small model is cheaper and faster, but only for steps that are genuinely easy.
Step 6: See trimming compound
The demo's trim section shows a 900-token context cut to 200. Because the context rides along on every call, trimming multiplies across the pipeline. Try it:
python -c "import optimize, pricing; print(pricing.approx_tokens(optimize.trim('word '*500, 100)))"
100 tokens. Every token you remove here is removed from every call.
Step 7: The quality check (the important one)
Routing is a bet: "Haiku is good enough for this easy step." You verify that bet with your M20 eval suite. The rule:
You should now see (in your own words): after routing a step to a cheaper model, re-run the M20 evals. Green means keep the saving; red means route that step back to the strong model. Never ship a cost cut on faith. Cheaper-but-wrong is not cheaper.
Step 8 (optional, costs a few tokens): measure real usage
With a key in .env, make one real call and read the true token counts:
cp .env.example .env # then edit .env and paste your key
python -c "import anthropic; r=anthropic.Anthropic().messages.create(model='claude-haiku-4-5',max_tokens=20,messages=[{'role':'user','content':'Say hi in 3 words.'}]); print(r.usage)"
usage object with input_tokens and output_tokens, the real numbers you
would feed into pricing.dollars. Steps 1 to 7 need no key.
Step 9: Show it
Post your before-and-after: the naive $/10k next to the both $/10k, and one sentence on the
quality check you would run before shipping the change.
If you get stuck
ModuleNotFoundError-> run from inside the folder with the solution.pyfiles.- Numbers differ from the table -> make sure you did not edit
STEPSorPREFIX_TOKENSinoptimize.py. - "Is the price real?" -> the prices in
pricing.pyare the course model table and are illustrative; check current Anthropic pricing for production estimates. ANTHROPIC_API_KEYerror in Step 8 -> your.envis not named exactly.env, or the key line is wrong. Steps 1 to 7 need no key.