Skip to content

M12: Multimodal AI (breadth module, best after M6)

Every app so far spoke only text. But models can also see, and the world is full of images: receipts, screenshots, whiteboards, charts, photos. Today you give your app eyes: send a picture and a question, get an answer. Then you'll know the wider multimodal landscape, images in, images out, audio, video, and which tools do which.

Today's win: an app that looks at an image you choose and answers questions about it, and you can explain what "multimodal" means and what each kind needs.

Today you will

  • Send an image + a question to a multimodal model and get a text answer (vision / image understanding)
  • Understand the one new idea: an image content block alongside your text (same Messages API)
  • Survey the rest of multimodal, image generation, speech-to-text, text-to-speech, video: and which providers do them

Run of show (~45 min)

Time What we do
0:00 Hook + the win we're chasing
0:05 The one idea: images are just another content block (full read in notes.md)
0:10 Lab: describe an image; then ask it to read text / extract fields from a photo
0:35 Show: post what your app saw
0:40 Wrap + the multimodal landscape

If you get stuck

  • No new install, reuse M4's anthropic + key. The only new thing is the image block.
  • Use a PNG/JPG you own (a receipt, a screenshot, a pet). Big images cost more tokens, a normal photo is fine. Nothing here can harm your computer.
  • Image generation and audio aren't in our Claude stack, that's expected; the notes cover which tools add them.

Optional challenge

Combine with M6: photograph a receipt and use output_config (a JSON schema) to extract {merchant, total, date} as guaranteed JSON: vision + structured output = a real "snap-a-receipt" feature in ~30 lines.