Skip to content

Notes: M6: Driving the model from code

M4 got a reply on screen; M5 made the reply good. M6 is about using the API like an engineer: shaping the request with parameters, choosing how the response arrives (all at once or streamed), and, the headline, getting back structured data your program can use, not just prose a human reads. The mental shift: a request is just data you control, and a response is data you parse. Once you see it that way, the model is a component in your software, not a chat window.

The request, in full

Every call is the same messages.create shape from M4, with a few more dials:

response = client.messages.create(
    model="claude-opus-4-8",     # which model
    max_tokens=300,              # cap on reply length (and cost)
    temperature=0.7,             # randomness (model-dependent, see below)
    system="...",                # the standing brief (M5)
    messages=[...],              # the conversation (M2 dictionaries again)
    output_config={...},         # constrain the output shape (below)
)
You don't need all of these every time, but knowing each dial is what "fluent" means.

max_tokens: length and cost

A token is roughly ¾ of a word, the unit models read and write in, and the unit you're billed in. max_tokens is the hard cap on the reply: hit it and the text stops mid-sentence (the response's stop_reason becomes "max_tokens"). Set it high enough for the answer you want, low when you deliberately want short output (a classification label, a single name) to save tokens. It caps the output; your input (the whole conversation) costs tokens too.

temperature: randomness (and a real model quirk)

temperature controls how random the model's word choices are, on a 0-1 scale: - Low (0.0-0.3): focused and repeatable. Best for extraction, classification, anything where you want the same answer each time. - High (0.7-1.0): varied and creative. Best for brainstorming, names, creative writing.

Here's the catch you met in the lab: the newest Opus models (claude-opus-4-8, 4.7) and Fable manage their own randomness and reject a temperature setting: sending one returns a 400 error. Models like claude-haiku-4-5 and claude-sonnet-4-6 still accept it. So part of "driving the API" is knowing which knobs a given model supports, and picking the model for the job: which is also a cost decision (Haiku is far cheaper for high-volume or experimental work). When in doubt, check the model's docs.

Streaming: don't make the user wait

By default a call blocks: your program waits for the entire reply, then you get it. For a long reply that's an awkward pause followed by a wall of text. Streaming hands you the reply in little chunks as the model writes them, so words appear live (like every chat app you've used):

with client.messages.stream(model=MODEL, max_tokens=400, messages=msgs) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
Same request, better experience. Use it for anything longer than a sentence, or any interactive UI. (The non-streaming call is simpler and perfectly fine for short, behind-the-scenes work like the extractor below.)

Structured output: the big one

This is the M6 payoff and the bridge to building real apps. In M5 you asked for JSON in the prompt and parsed it defensively, because the model might wrap it in ```json fences or add a stray sentence. Structured output makes valid JSON a guarantee. You hand the API a JSON schema: the exact shape you want, with field names, types, and even enums, and it constrains the model so the reply is always valid JSON in that shape:

EXPENSE_SCHEMA = {
  "type": "object",
  "properties": {
    "item": {"type": "string"},
    "amount": {"type": "number"},
    "category": {"type": "string", "enum": ["food", "transport", "equipment", "other"]},
    "reimbursable": {"type": "boolean"},
  },
  "required": ["item", "amount", "category", "reimbursable"],
  "additionalProperties": False,
}

response = client.messages.create(
    model="claude-opus-4-8", max_tokens=300,
    messages=[{"role": "user", "content": f"Extract the expense:\n\n{messy_text}"}],
    output_config={"format": {"type": "json_schema", "schema": EXPENSE_SCHEMA}},
)
text = next(b.text for b in response.content if b.type == "text")
data = json.loads(text)        # guaranteed valid, no fence-stripping, no try/except gymnastics
Now data is a clean Python dict with exactly your fields, ready to save to a file (M3), put in a database, show in a UI, or pass to the next step. This is how you turn the unstructured world (messy emails, notes, reviews) into structured data software can act on. It's the quiet workhorse behind a huge share of real LLM apps.

flowchart LR
  Messy["messy free text<br/>'lunch w team ~48 quid, work expense'"] --> API["messages.create<br/>+ schema"]
  API --> JSON["guaranteed JSON<br/>{item, amount, category, reimbursable}"]
  JSON --> Use["save · store · display · next step"]

Handling the response

A response is an object, not a string. The pieces you'll actually touch: - response.content: a list of blocks. For a normal reply, content[0] is a text block and the text is content[0].text. (Being a list leaves room for other block types, like tool calls in M9.) - response.stop_reason: why it stopped: "end_turn" (finished naturally), "max_tokens" (hit your cap, raise it or stream), and a few others. Checking it is how robust apps notice a truncated answer. - response.usage: token counts (input/output), which is how you track cost.

Go deeper (optional, not needed for today's win) - **Why `next(b.text for b in response.content if b.type == "text")`?** It grabs the first *text* block specifically, skipping any non-text blocks. With a plain reply, `content[0].text` works too; the generator is just the robust version. - **`messages.parse` + a schema class:** the SDK can validate the JSON into a typed object for you (using a library called Pydantic). We used a plain schema dict here to stay in the JSON world you already know; the typed approach is a nice next step. - **The SDK retries** transient errors (429, 5xx) automatically with backoff, you usually don't need your own retry loop. - **Determinism isn't guaranteed even at `temperature=0`**: it's *more* repeatable, not byte-identical every time. - **Count tokens before you send.** Since you pay per token and the context window is finite, you can measure a request's size up front with the SDK's token-counting call (`client.messages.count_tokens( model=..., messages=...)`), handy to estimate cost or check a big document fits. (Don't use other tokenizers like `tiktoken`; they're for other models and miscount.) - **Other providers, same shape.** OpenAI, Google Gemini, Mistral, etc. all expose a similar "messages in → reply out" API; once you can drive Claude's, switching is mostly renaming the client and model. (Hugging Face's Inference SDK and Ollama (M13) follow the pattern too.)

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. What does max_tokens do, and how do you tell a reply was cut off by it?

Show answer

It's the hard cap on reply length (in tokens, ~¾ word each), and it caps cost. If the reply hits the cap it stops mid-sentence and response.stop_reason is "max_tokens". Raise the cap or stream to get the full answer.

2. When would you use a low vs high temperature, and which models won't accept it?

Show answer

Low (0-0.3) for focused, repeatable tasks (extraction, classification); high (0.7-1.0) for variety (brainstorming, creative writing). The newest Opus models (claude-opus-4-8) and Fable reject temperature (400 error), use a model like claude-haiku-4-5 or claude-sonnet-4-6 when you need the knob.

3. What does streaming change, and when is it worth it?

Show answer

Instead of waiting for the whole reply, you get it in chunks as it's written, so words appear live. Worth it for anything longer than a sentence or any interactive UI. A plain (blocking) call is fine for short, behind-the-scenes work.

4. How is M6's structured output better than M5's "please return JSON"?

Show answer

M5's prompt-only JSON might arrive fenced or malformed, so you parse defensively. M6's structured output hands the API a schema and guarantees the reply is valid JSON in that exact shape, so you json.loads once, confidently, with no fence-stripping or try/except.

5. What two things does response.content[0].text skip over that a real app might check?

Show answer

response.stop_reason (did it finish, or hit max_tokens?) and response.usage (token counts / cost). And content is a list because a reply can contain more than one text block (e.g. tool calls in M9).


New words (also in resources/glossary.md): max_tokens (recap), temperature, streaming, blocking call, output_config / JSON schema, structured output (recap), stop_reason, usage, content block.

Source: original, written for this course. API details (messages.create parameters, the temperature removal on the newest Opus models, output_config.format structured outputs, messages.stream/text_stream, stop_reason/usage) follow Anthropic's official Claude API documentation and were verified against the installed SDK (anthropic 0.109.2); examples are original and were run with the live model call mocked. No third-party text or figures; diagrams are original.