Notes: M11: Deployment & productionizing

You've built real things. The gap between "it runs in my terminal" and "other people use it" is deployment, and it's smaller than it sounds: turn your function into a web service, package it so it runs the same anywhere, keep your key safe, and watch a couple of numbers so you know it's healthy and what it costs. That's this module. Then the capstone: put it all together into something that's yours.

From script to service: the web API

Right now your app is a script only you can run. A web API (Application Programming Interface over the web) lets anything, a website, a phone app, a teammate's code, send your app a request and get a response back over HTTP. The unit is the endpoint: a URL + method that does one thing, e.g. POST /chat takes a message and returns a reply.

FastAPI turns a Python function into an endpoint with almost no extra code:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):       # the shape of the incoming JSON
    message: str

@app.post("/chat")
def chat(request: ChatRequest):     # FastAPI validates the body for you
    reply = my_ai_function(request.message)
    return {"reply": reply}

Run it with uvicorn (uvicorn app:app --reload) and you have a live server on http://127.0.0.1:8000. Two free wins: FastAPI validates requests against your BaseModel (a bad body gets an automatic 422 error, no manual checking), and it generates an interactive /docs page so you (and others) can try the API in a browser. A /health endpoint that just returns {"status": "ok"} is conventional, monitors and load balancers ping it to check the service is alive.

flowchart LR
  C["any client<br/>(browser, app, teammate)"] -->|"POST /chat {message}"| F["FastAPI (uvicorn)"]
  F --> Fn["your function → model"]
  Fn -->|"{reply}"| C

Packaging it: containers

Your app works on your machine, with your Python version and your installed libraries. On someone else's machine (or a server), those differ, and things break. A container fixes this: it bundles your app plus its exact dependencies and Python version into one image that runs identically anywhere. (You met containers in Course 01.) Docker is the tool that builds and runs them, driven by a Dockerfile: a recipe:

FROM python:3.12-slim        # a known Python, so it's the same everywhere
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt   # install deps into the image
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]   # how to start it

docker build -t ai-app . makes the image; docker run -p 8000:8000 ai-app runs it. "It works on my laptop" becomes "it works in the container, everywhere." (Pinning python:3.12-slim also dodges the newest-Python install problems you hit with Chroma/LangGraph.)

You don't strictly need Docker to deploy. Running uvicorn on a server is a real deployment. Containers add portability and reproducibility: the same image runs on your laptop, a teammate's, and a cloud host identically. That's why production uses them.

Secrets at deploy time

The rule from M4 gets sharper when you ship: never put your API key in your code or your image. Anyone who pulls the image could read a baked-in key. Instead, pass it at run time:

docker run -p 8000:8000 --env-file .env ai-app

--env-file .env injects ANTHROPIC_API_KEY into the running container; the key never enters the image. .dockerignore (like .gitignore) keeps .env out of the build entirely. In real hosting you'd use the platform's secrets manager for the same reason. Secrets live with the running instance, never in the artifact you share.

Monitoring: watch latency and cost

Once real traffic hits your app, two numbers matter most: - Latency: how long a request takes. Slow responses drive users away; track it so you notice regressions. (Time the call: start = time.time() … time.time() - start.) - Cost: every call spends tokens. response.usage gives input_tokens and output_tokens; log them and you can see (and cap) spend. (Recall M4's spend limit and M6's model choice, a cheaper model or smaller max_tokens directly lowers this.)

Even a single log line per request, latency=… in_tokens=… out_tokens=…, is real monitoring: it's how you spot a slow endpoint or a runaway bill before your users (or your card) do. Bigger systems add dashboards and tracing, but the habit starts here.

Where it can run

Your container can run lots of places, your own server, a container host (Cloud Run, Render, Fly, ECS), or a Platform-as-a-Service. The details differ, but the shape is always the same: build an image, run it with the key supplied at run time, expose a port. Learn the shape once (you just did) and the specific host is a manual page away.

Go deeper (optional, not needed for today's win)

- **Async & concurrency:** FastAPI supports `async def` endpoints; for many simultaneous users an async client and multiple uvicorn workers help throughput. - **Streaming over HTTP:** you can stream the model's reply to the client (Server-Sent Events) so words appear live, M6's streaming, served. - **Smaller images:** `python:3.12-slim` is already lean; multi-stage builds and pinning exact versions in `requirements.txt` make images smaller and builds reproducible. - **Rate limiting & auth:** a public endpoint needs request limits and an auth check (an API key of *your own* for callers), otherwise anyone can spend your tokens (M10's unbounded-consumption risk).

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. What does wrapping your app in a web API (FastAPI) let you do that a script can't?

Show answer

Let anything: a browser, app, or teammate's code, call it over HTTP and get a response, instead of only you running it in a terminal. FastAPI also validates requests and gives you an interactive /docs test page for free.

2. What problem do containers solve?

Show answer

"Works on my machine" breaks elsewhere because Python versions and libraries differ. A container bundles your app + its exact dependencies into one image that runs identically anywhere: portability and reproducibility. Docker builds and runs it from a Dockerfile.

3. Where does your API key go when you deploy, and where does it NOT go?

Show answer

It goes into the running instance at run time (docker run --env-file .env, or a host's secrets manager). It does NOT go in your code or the image, a baked-in key leaks to anyone with the image. .dockerignore/.gitignore keep .env out.

4. What two numbers should even basic monitoring track, and why?

Show answer

Latency (how long each request takes, slow = users leave) and cost (tokens per call, from response.usage, so you can see and cap spend). One log line per request with both is real monitoring.

5. Do you have to use Docker to deploy?

Show answer

No, running uvicorn on a server is a real deployment. Docker adds portability and reproducibility (the same image runs identically everywhere), which is why production prefers it, but the FastAPI service itself is the deployable unit.

New words (also in resources/glossary.md): deployment, web API, endpoint, FastAPI, uvicorn, /health, container, image, Docker, Dockerfile, .dockerignore, latency, monitoring, secrets manager.

Source: original, written for this course. FastAPI/uvicorn and Docker usage follow their official docs; the example service was verified to run (endpoints tested with FastAPI's TestClient, the model call mocked; the Dockerfile structure-checked, see the solution README). Containers callback to Course 01. No third-party text or figures; diagrams are original.