Skip to content

M25: Cost and performance optimization (Part D: Agentic Systems)

Agents make many model calls, and every call costs money and time. The pipeline you build today costs about 7 cents a run, which sounds like nothing, until you run it 10,000 times and it is 669 dollars, and slow. M20 taught you to SEE the cost. This module is how to cut it, often by half or more, without hurting quality: cache the parts that repeat, send the easy steps to a cheaper model, and stop paying to send tokens you do not need.

Today's win: the same agent pipeline, optimized from 669 dollars per 10,000 runs down to about 322 and noticeably faster, with the trade-offs measured, not guessed, all computed offline.

Today you will

  • Estimate the dollars and latency of an agent pipeline from token counts (no spend required)
  • Apply prompt caching: pay once for a stable prefix, then read it cheaply on every later call
  • Apply model routing: cheap fast model for easy steps, strong model for hard ones
  • Apply token trimming, and see why output tokens on hard steps are the cost you cannot cheaply cut
  • Learn the rule: every optimization is a bet on quality, so you re-run your M20 evals to check it

Run of show (about 60 minutes)

Time What we do
0:00 Hook: 7 cents becomes 669 dollars
0:05 The one idea: cut repeated input and route by difficulty (read notes.md)
0:12 Lab Part A: measure the naive pipeline, then add caching
0:30 Lab Part B: add routing and trimming; read the savings table
0:50 Show: post your before-and-after cost table
1:00 Wrap

If you get stuck

  • Builds on M20 (seeing cost and latency) and the model table from M0/M6. The core lab is a pure estimator: no API key, no spend, instant.
  • No new libraries. Nothing here can harm your computer. Prices are illustrative (the course model table); always check current pricing before relying on a number.
  • If a number looks off, open pricing.py: every figure traces to one formula.

Optional challenge

Open starters/add_lever.py and add a fifth lever: the Batch API discount (about 50 percent off for non-urgent work). Apply it to the batchable easy steps and see the extra savings, then note the trade-off: batching is cheaper but not instant, so only batch work that can wait.