Notes: M12: Multimodal AI
"Multimodal" just means a model works with more than text, images, audio, even video: as input and/or output. It's less a new skill than a new content type: you've sent text to a model; now you send a picture too. This module builds the most useful, widely-supported case deeply, image understanding (vision): then maps the rest so you know what's possible and what each piece needs.
What "multimodal" means
A modality is a type of data: text, image, audio, video. A multimodal model can take more than one as input and/or produce more than one as output. The common combinations:
| You give it | It gives back | Called | Example |
|---|---|---|---|
| image + text | text | vision / image understanding | "what's in this photo?", read a receipt |
| text | image | image generation | "a watercolour fox" |
| audio | text | speech-to-text (STT) | transcribe a meeting |
| text | audio | text-to-speech (TTS) | read an article aloud |
| video + text | text | video understanding | summarize a clip |
This course's model (Claude) is strong at the first row, vision: so that's what you build. The others need different tools (below), but the idea is the same: another modality in or out.
Vision: give the model eyes (the build)
Sending an image is the same Messages API with one new content block. Instead of a plain
string, content becomes a list with an image block (the picture) and a text block (your
question):
import base64
with open("receipt.jpg", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-8", # a multimodal model
max_tokens=500,
messages=[{"role": "user", "content": [
{"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}},
{"type": "text", "text": "What is the total on this receipt?"},
]}],
)
media_type must match the file (image/png, image/jpeg, …). Everything else is the call
you already know, the reply is text in response.content[0].text.
flowchart LR
Img["image (bytes)"] --> B64["base64-encode"] --> Block["image content block"]
Q["your question"] --> TBlock["text content block"]
Block --> API["messages.create (multimodal model)"]
TBlock --> API
API --> Ans["text answer"]
What vision is great for: describing/altering captions, reading text from images (OCR-style), extracting fields from documents/receipts (combine with M6 structured output for guaranteed JSON!), reading charts/diagrams, screenshot understanding, accessibility. Watch the cost: images cost tokens (more for bigger/higher-res ones), downsize when you don't need fine detail.
The rest of multimodal (survey: what each needs)
Beyond vision, here's the landscape so the buzzwords make sense. These generally need other providers/models than Claude, easy to add, same "call an API" pattern you know.
| Capability | What it does | Common tools |
|---|---|---|
| Image generation | text → a new image | OpenAI DALL·E / gpt-image, Stable Diffusion, Google Imagen, Midjourney |
| Speech-to-text | audio → text (transcription) | OpenAI Whisper, Deepgram, AssemblyAI |
| Text-to-speech | text → spoken audio | ElevenLabs, OpenAI TTS, Google/Azure TTS |
| Video understanding | video → text (summary/Q&A) | Gemini (native video), frame-sampling + vision |
| Multimodal RAG | retrieve over images and text | vision embeddings + a vector DB (M7) |
The throughline: pick the model that has the modality you need (it's a model-choice decision, M0), call it like any API, and combine pieces (e.g. Whisper → Claude → TTS for a voice assistant).
Responsible multimodal (quick note)
The M10 cautions apply, with extras: images can carry hidden prompt-injection text ("ignore your instructions…" written in a picture, an indirect injection); generated images raise deepfake / consent / copyright issues; and faces/receipts are personal data: handle with the same privacy care as any sensitive input. Eyes and a voice make an app more powerful and raise the stakes.
Go deeper (optional, not needed for today's win)
- **Image input options:** base64 (shown), a public **URL**, or the **Files API** (upload once, reference by id across calls), handy for the same image in many requests. - **Resolution & cost:** models cap image size and bill by it; very large images are downscaled. Send the smallest image that still shows what you need. - **Multiple images:** you can put several `image` blocks in one message ("compare these two charts"). - **Vision embeddings** turn images into vectors (like text embeddings, M7), enabling image search and multimodal RAG.Check yourself
Lock in today's win, answer each in your head, then reveal.
1. What does "multimodal" mean, and which case does this course build?
Show answer
A model that works with more than one modality (type of data: text, image, audio, video) as input and/or output. This course builds vision / image understanding (image + text → text), because Claude is strong at it; other modalities need different tools.
2. How is sending an image different from a normal text call?
Show answer
It's the same Messages API: you just make content a list with an image block
(the picture, base64-encoded, with a matching media_type) alongside your text block. The
reply is still text in response.content[0].text.
3. Why base64-encode the image?
Show answer
To turn the image's raw bytes into safe text that can travel inside a JSON API request. (Alternatives: pass an image URL, or upload via the Files API and reference it by id.)
4. Name three things vision is genuinely useful for.
Show answer
Any of: reading text from images (OCR-style), extracting fields from receipts/forms/IDs (great with M6 structured output → guaranteed JSON), reading charts/diagrams, screenshot understanding, image captioning/description, accessibility.
5. You need your app to create an image and transcribe a voice note. Does Claude do that?
Show answer
No, those are image generation and speech-to-text, different modalities Claude's API doesn't cover. Use other tools: e.g. DALL·E / Stable Diffusion for images, Whisper for transcription, called the same "it's just an API" way, then combined with your Claude app.
New words (also in resources/glossary.md): multimodal (recap),
modality, vision / image understanding, image content block, base64, media type, image generation,
speech-to-text (STT), text-to-speech (TTS), video understanding, multimodal RAG.
Source: original, written for this course. The Claude vision usage (base64 image content block,
media_type, Messages API) follows Anthropic's official documentation and was verified against the
installed SDK (request structure + base64 confirmed; the model call mocked, see the solution README).
The image-generation / audio survey names widely-used third-party tools as neutral reference. Diagrams
are original.