M12: Multimodal AI (breadth module, best after M6)

Every app so far spoke only text. But models can also see, and the world is full of images: receipts, screenshots, whiteboards, charts, photos. Today you give your app eyes: send a picture and a question, get an answer. Then you'll know the wider multimodal landscape, images in, images out, audio, video, and which tools do which.

Today's win: an app that looks at an image you choose and answers questions about it, and you can explain what "multimodal" means and what each kind needs.

Today you will

Send an image + a question to a multimodal model and get a text answer (vision / image understanding)
Understand the one new idea: an image content block alongside your text (same Messages API)
Survey the rest of multimodal, image generation, speech-to-text, text-to-speech, video: and which providers do them

Run of show (~45 min)

Time	What we do
0:00	Hook + the win we're chasing
0:05	The one idea: images are just another content block (full read in `notes.md`)
0:10	Lab: describe an image; then ask it to read text / extract fields from a photo
0:35	Show: post what your app saw
0:40	Wrap + the multimodal landscape

If you get stuck

No new install, reuse M4's anthropic + key. The only new thing is the image block.
Use a PNG/JPG you own (a receipt, a screenshot, a pet). Big images cost more tokens, a normal photo is fine. Nothing here can harm your computer.
Image generation and audio aren't in our Claude stack, that's expected; the notes cover which tools add them.

Optional challenge

Combine with M6: photograph a receipt and use output_config (a JSON schema) to extract {merchant, total, date} as guaranteed JSON: vision + structured output = a real "snap-a-receipt" feature in ~30 lines.