Lab: M12: give your app eyes
You'll need: your M4 setup (venv, key in .env, anthropic), and an image file you own
(a photo of a receipt, a screenshot, a chart, a pet, a PNG or JPG). No new install.
Time: ~35 minutes • Work in your breakout pair.
Heads up: images cost more tokens than text, a normal phone photo is fine; you don't need anything huge. Nothing here can harm your computer.
flowchart LR
Pic["your image"] --> B64["base64 (+ your question)"] --> M["multimodal model"] --> Ans["text answer"]
Step 1: Set up the folder
Put describe_image.py and sample.png (from solution/) and
describe_image_starter.py (from starters/) in a folder with your M4 .env.
Activate your venv.
You should now see: (.venv) and those files (ls / dir).
Step 2: Describe the sample image
python describe_image.py sample.png "What colour is this image?"
You should now see: an answer that correctly names the colour of sample.png (a steel-blue
rectangle). Your program just looked at a picture and answered. (The image rode along as one new
content block, open describe_image.py and find the "type": "image" block.)
Step 3: Use your own image
Put a real photo in the folder (a receipt, a screenshot, a chart). Ask a real question:
python describe_image.py my-photo.jpg "What is happening in this picture?"
You should now see: a sensible description of your image. Try a few questions about the same picture, it's a conversation about what the model sees.
Step 4: Read text from an image (OCR-style)
Use a photo or screenshot that contains text (a sign, a receipt, a slide). Ask:
python describe_image.py receipt.jpg "Read all the text in this image, line by line."
You should now see: the text transcribed from the image. Vision models double as a flexible OCR, no special OCR library needed.
Step 5: See the one new idea
Open describe_image.py. Compare its messages to M4's chatbot: the only difference is that
content is a list with an image block (base64 + media_type) plus the text block.
You should now see / say: "an image is just another content block, same Messages API." That's the whole trick; everything else (model, max_tokens, reading the reply) you already knew.
Step 6: (Stretch) extract structured data from a photo
Combine with M6: in describe_image_starter.py (TODO 2), add output_config with a JSON schema
for {merchant, total, date} and photograph a receipt. Run it.
You should now see: the receipt's fields as guaranteed JSON your code could save, a real "snap-a-receipt" feature. (Vision + structured output is a genuinely useful combo.)
Step 7: Know what Claude doesn't do
Read the survey table in notes.md. Note which multimodal jobs need other tools.
You should now see / say: vision (image→text) is Claude; image generation (DALL·E/Stable Diffusion), speech-to-text (Whisper), and text-to-speech are separate tools you'd add, same "call an API" pattern.
Stuck? The finished app is
../solution/describe_image.py.
Your win
Your app can look at an image and answer questions about it, read text, describe a scene, inspect a chart, and you can explain the wider multimodal landscape.
Post it to the chat wins board: what your app saw, e.g. "Snapped my receipt → it read the total and date correctly. My app has eyes now "
Take-home (optional)
Build a tiny "describe my day": feed it 2-3 photos one after another and ask for a one-paragraph summary across them. Notice you can put multiple image blocks in one message, the model compares and combines them.