M12: Multimodal AI (breadth module, best after M6)
Every app so far spoke only text. But models can also see, and the world is full of images: receipts, screenshots, whiteboards, charts, photos. Today you give your app eyes: send a picture and a question, get an answer. Then you'll know the wider multimodal landscape, images in, images out, audio, video, and which tools do which.
Today's win: an app that looks at an image you choose and answers questions about it, and you can explain what "multimodal" means and what each kind needs.
Today you will
- Send an image + a question to a multimodal model and get a text answer (vision / image understanding)
- Understand the one new idea: an
imagecontent block alongside your text (same Messages API) - Survey the rest of multimodal, image generation, speech-to-text, text-to-speech, video: and which providers do them
Run of show (~45 min)
| Time | What we do |
|---|---|
| 0:00 | Hook + the win we're chasing |
| 0:05 | The one idea: images are just another content block (full read in notes.md) |
| 0:10 | Lab: describe an image; then ask it to read text / extract fields from a photo |
| 0:35 | Show: post what your app saw |
| 0:40 | Wrap + the multimodal landscape |
If you get stuck
- No new install, reuse M4's
anthropic+ key. The only new thing is theimageblock. - Use a PNG/JPG you own (a receipt, a screenshot, a pet). Big images cost more tokens, a normal photo is fine. Nothing here can harm your computer.
- Image generation and audio aren't in our Claude stack, that's expected; the notes cover which tools add them.
Optional challenge
Combine with M6: photograph a receipt and use output_config (a JSON schema) to extract
{merchant, total, date} as guaranteed JSON: vision + structured output = a real
"snap-a-receipt" feature in ~30 lines.