M25 solution: cost and performance optimization

An offline estimator that prices an agent pipeline and shows what caching, routing, and trimming save. No model is called: it is pure arithmetic over a pricing model, so it runs with no API key and no spend.

Prices are illustrative (the course model table) and cache multipliers approximate Anthropic prompt caching (write about 1.25x input, read about 0.1x). Check current pricing before relying on a number.

Files

File	Role
`pricing.py`	The cost and latency model: `PRICES` per 1M tokens, `dollars(...)` (with cache write/read pricing), `latency(...)`, `approx_tokens(...)`.
`optimize.py`	The levers (`route`, `trim`) and the 5-step `STEPS` workload, plus `estimate(routing=, caching=)` and the four `CONFIGS`.
`demo_mock.py`	Prints the four configs as a cost/latency/savings table, scaled to 10,000 conversations, plus the trim demo and the quality caveat. Start here.
`../starters/add_lever.py`	Add the Batch API discount as a fifth lever.

Run it

python demo_mock.py        # offline, free, no key
python optimize.py         # the bare four-config comparison

Results (deterministic)

config	$/run	$/10k	latency	savings
naive (all Opus, no cache)	0.06685	669	18.5s	0%
cache only (all Opus, cached)	0.03335	334	18.5s	50%
route only (Haiku for easy)	0.04177	418	11.5s	38%
both (route + cache)	0.03217	322	11.5s	52%

What it teaches

Caching removes the repeated 2,000-token prefix and roughly halves cost; it does little for latency.
Routing sends easy steps to a 5x cheaper, faster model, cutting both cost and latency, but only for steps that are genuinely easy.
Both is only a little better than caching alone, because what remains is the hard steps' OUTPUT tokens, the real reasoning, which is the cost you cannot cheaply cut.
Trimming removes tokens that ride on every call, so it compounds.
Quality is the catch: routing is a bet that a cheaper model is good enough. Verify it with the M20 eval suite after every change; keep the saving only if the scorecard stays green.

Verified (offline)

dollars matches hand-computed values for a normal call, a cached read, and a cache write.
route picks Haiku for easy and Opus for hard; trim reduces tokens to the budget.
The four configs: naive about 0.06685/run, cache about 0.03335, route about 0.04177, both about 0.03217; every optimization is cheaper than naive and "both" is the cheapest.
All files compile; demo_mock.py runs end to end offline. No key needed; the optional usage-measurement exercise reuses the M4 key.