Notes: M12: Multimodal AI

"Multimodal" just means a model works with more than text, images, audio, even video: as input and/or output. It's less a new skill than a new content type: you've sent text to a model; now you send a picture too. This module builds the most useful, widely-supported case deeply, image understanding (vision): then maps the rest so you know what's possible and what each piece needs.

What "multimodal" means

A modality is a type of data: text, image, audio, video. A multimodal model can take more than one as input and/or produce more than one as output. The common combinations:

You give it	It gives back	Called	Example
image + text	text	vision / image understanding	"what's in this photo?", read a receipt
text	image	image generation	"a watercolour fox"
audio	text	speech-to-text (STT)	transcribe a meeting
text	audio	text-to-speech (TTS)	read an article aloud
video + text	text	video understanding	summarize a clip

This course's model (Claude) is strong at the first row, vision: so that's what you build. The others need different tools (below), but the idea is the same: another modality in or out.

Vision: give the model eyes (the build)

Sending an image is the same Messages API with one new content block. Instead of a plain string, content becomes a list with an image block (the picture) and a text block (your question):

import base64
with open("receipt.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-8",          # a multimodal model
    max_tokens=500,
    messages=[{"role": "user", "content": [
        {"type": "image",
         "source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}},
        {"type": "text", "text": "What is the total on this receipt?"},
    ]}],
)

Two things to notice: - Base64 turns the image's raw bytes into safe text so it can travel in a JSON request. (You can also pass an image URL, or upload via the Files API for reuse, but base64 of a local file is the simplest start.) - media_type must match the file (image/png, image/jpeg, …). Everything else is the call you already know, the reply is text in response.content[0].text.

flowchart LR
  Img["image (bytes)"] --> B64["base64-encode"] --> Block["image content block"]
  Q["your question"] --> TBlock["text content block"]
  Block --> API["messages.create (multimodal model)"]
  TBlock --> API
  API --> Ans["text answer"]

What vision is great for: describing/altering captions, reading text from images (OCR-style), extracting fields from documents/receipts (combine with M6 structured output for guaranteed JSON!), reading charts/diagrams, screenshot understanding, accessibility. Watch the cost: images cost tokens (more for bigger/higher-res ones), downsize when you don't need fine detail.

The rest of multimodal (survey: what each needs)

Beyond vision, here's the landscape so the buzzwords make sense. These generally need other providers/models than Claude, easy to add, same "call an API" pattern you know.

Capability	What it does	Common tools
Image generation	text → a new image	OpenAI DALL·E / gpt-image, Stable Diffusion, Google Imagen, Midjourney
Speech-to-text	audio → text (transcription)	OpenAI Whisper, Deepgram, AssemblyAI
Text-to-speech	text → spoken audio	ElevenLabs, OpenAI TTS, Google/Azure TTS
Video understanding	video → text (summary/Q&A)	Gemini (native video), frame-sampling + vision
Multimodal RAG	retrieve over images and text	vision embeddings + a vector DB (M7)

The throughline: pick the model that has the modality you need (it's a model-choice decision, M0), call it like any API, and combine pieces (e.g. Whisper → Claude → TTS for a voice assistant).

Responsible multimodal (quick note)

The M10 cautions apply, with extras: images can carry hidden prompt-injection text ("ignore your instructions…" written in a picture, an indirect injection); generated images raise deepfake / consent / copyright issues; and faces/receipts are personal data: handle with the same privacy care as any sensitive input. Eyes and a voice make an app more powerful and raise the stakes.

Go deeper (optional, not needed for today's win)

- **Image input options:** base64 (shown), a public **URL**, or the **Files API** (upload once, reference by id across calls), handy for the same image in many requests. - **Resolution & cost:** models cap image size and bill by it; very large images are downscaled. Send the smallest image that still shows what you need. - **Multiple images:** you can put several `image` blocks in one message ("compare these two charts"). - **Vision embeddings** turn images into vectors (like text embeddings, M7), enabling image search and multimodal RAG.

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. What does "multimodal" mean, and which case does this course build?

Show answer

A model that works with more than one modality (type of data: text, image, audio, video) as input and/or output. This course builds vision / image understanding (image + text → text), because Claude is strong at it; other modalities need different tools.

2. How is sending an image different from a normal text call?

Show answer

It's the same Messages API: you just make content a list with an image block (the picture, base64-encoded, with a matching media_type) alongside your text block. The reply is still text in response.content[0].text.

3. Why base64-encode the image?

Show answer

To turn the image's raw bytes into safe text that can travel inside a JSON API request. (Alternatives: pass an image URL, or upload via the Files API and reference it by id.)

4. Name three things vision is genuinely useful for.

Show answer

Any of: reading text from images (OCR-style), extracting fields from receipts/forms/IDs (great with M6 structured output → guaranteed JSON), reading charts/diagrams, screenshot understanding, image captioning/description, accessibility.

5. You need your app to create an image and transcribe a voice note. Does Claude do that?

Show answer

No, those are image generation and speech-to-text, different modalities Claude's API doesn't cover. Use other tools: e.g. DALL·E / Stable Diffusion for images, Whisper for transcription, called the same "it's just an API" way, then combined with your Claude app.

New words (also in resources/glossary.md): multimodal (recap), modality, vision / image understanding, image content block, base64, media type, image generation, speech-to-text (STT), text-to-speech (TTS), video understanding, multimodal RAG.

Source: original, written for this course. The Claude vision usage (base64 image content block, media_type, Messages API) follows Anthropic's official documentation and was verified against the installed SDK (request structure + base64 confirmed; the model call mocked, see the solution README). The image-generation / audio survey names widely-used third-party tools as neutral reference. Diagrams are original.