Skip to content

M25 solution: cost and performance optimization

An offline estimator that prices an agent pipeline and shows what caching, routing, and trimming save. No model is called: it is pure arithmetic over a pricing model, so it runs with no API key and no spend.

Prices are illustrative (the course model table) and cache multipliers approximate Anthropic prompt caching (write about 1.25x input, read about 0.1x). Check current pricing before relying on a number.

Files

File Role
pricing.py The cost and latency model: PRICES per 1M tokens, dollars(...) (with cache write/read pricing), latency(...), approx_tokens(...).
optimize.py The levers (route, trim) and the 5-step STEPS workload, plus estimate(routing=, caching=) and the four CONFIGS.
demo_mock.py Prints the four configs as a cost/latency/savings table, scaled to 10,000 conversations, plus the trim demo and the quality caveat. Start here.
../starters/add_lever.py Add the Batch API discount as a fifth lever.

Run it

python demo_mock.py        # offline, free, no key
python optimize.py         # the bare four-config comparison

Results (deterministic)

config $/run $/10k latency savings
naive (all Opus, no cache) 0.06685 669 18.5s 0%
cache only (all Opus, cached) 0.03335 334 18.5s 50%
route only (Haiku for easy) 0.04177 418 11.5s 38%
both (route + cache) 0.03217 322 11.5s 52%

What it teaches

  • Caching removes the repeated 2,000-token prefix and roughly halves cost; it does little for latency.
  • Routing sends easy steps to a 5x cheaper, faster model, cutting both cost and latency, but only for steps that are genuinely easy.
  • Both is only a little better than caching alone, because what remains is the hard steps' OUTPUT tokens, the real reasoning, which is the cost you cannot cheaply cut.
  • Trimming removes tokens that ride on every call, so it compounds.
  • Quality is the catch: routing is a bet that a cheaper model is good enough. Verify it with the M20 eval suite after every change; keep the saving only if the scorecard stays green.

Verified (offline)

  • dollars matches hand-computed values for a normal call, a cached read, and a cache write.
  • route picks Haiku for easy and Opus for hard; trim reduces tokens to the budget.
  • The four configs: naive about 0.06685/run, cache about 0.03335, route about 0.04177, both about 0.03217; every optimization is cheaper than naive and "both" is the cheapest.
  • All files compile; demo_mock.py runs end to end offline. No key needed; the optional usage-measurement exercise reuses the M4 key.