Lab: M12: give your app eyes

You'll need: your M4 setup (venv, key in .env, anthropic), and an image file you own (a photo of a receipt, a screenshot, a chart, a pet, a PNG or JPG). No new install. Time: ~35 minutes • Work in your breakout pair.

Heads up: images cost more tokens than text, a normal phone photo is fine; you don't need anything huge. Nothing here can harm your computer.

flowchart LR
  Pic["your image"] --> B64["base64 (+ your question)"] --> M["multimodal model"] --> Ans["text answer"]

Step 1: Set up the folder

Put describe_image.py and sample.png (from solution/) and describe_image_starter.py (from starters/) in a folder with your M4 .env. Activate your venv.

You should now see: (.venv) and those files (ls / dir).

Step 2: Describe the sample image

python describe_image.py sample.png "What colour is this image?"

You should now see: an answer that correctly names the colour of sample.png (a steel-blue rectangle). Your program just looked at a picture and answered. (The image rode along as one new content block, open describe_image.py and find the "type": "image" block.)

Step 3: Use your own image

Put a real photo in the folder (a receipt, a screenshot, a chart). Ask a real question:

python describe_image.py my-photo.jpg "What is happening in this picture?"

You should now see: a sensible description of your image. Try a few questions about the same picture, it's a conversation about what the model sees.

Step 4: Read text from an image (OCR-style)

Use a photo or screenshot that contains text (a sign, a receipt, a slide). Ask:

python describe_image.py receipt.jpg "Read all the text in this image, line by line."

You should now see: the text transcribed from the image. Vision models double as a flexible OCR, no special OCR library needed.

Step 5: See the one new idea

Open describe_image.py. Compare its messages to M4's chatbot: the only difference is that content is a list with an image block (base64 + media_type) plus the text block.

You should now see / say: "an image is just another content block, same Messages API." That's the whole trick; everything else (model, max_tokens, reading the reply) you already knew.

Step 6: (Stretch) extract structured data from a photo

Combine with M6: in describe_image_starter.py (TODO 2), add output_config with a JSON schema for {merchant, total, date} and photograph a receipt. Run it.

You should now see: the receipt's fields as guaranteed JSON your code could save, a real "snap-a-receipt" feature. (Vision + structured output is a genuinely useful combo.)

Step 7: Know what Claude doesn't do

Read the survey table in notes.md. Note which multimodal jobs need other tools.

You should now see / say: vision (image→text) is Claude; image generation (DALL·E/Stable Diffusion), speech-to-text (Whisper), and text-to-speech are separate tools you'd add, same "call an API" pattern.

Stuck? The finished app is ../solution/describe_image.py.

Your win

Your app can look at an image and answer questions about it, read text, describe a scene, inspect a chart, and you can explain the wider multimodal landscape.

Post it to the chat wins board: what your app saw, e.g. "Snapped my receipt → it read the total and date correctly. My app has eyes now "

Take-home (optional)

Build a tiny "describe my day": feed it 2-3 photos one after another and ask for a one-paragraph summary across them. Notice you can put multiple image blocks in one message, the model compares and combines them.