M29: Agent deployment and serving (Part D: Agentic Systems)

"It runs on my laptop" and "it serves thousands of users without falling over" are different skills. M11 got your app into a container; today you make it production-grade: configuration from the environment instead of hardcoded, secrets that never touch the repo, health and readiness probes so the platform knows when to send traffic, graceful startup and shutdown, and statelessness so you can run many copies behind a load balancer. This is the operations layer that makes an agent dependable.

Today's win: a production-shaped agent service with config from the environment, liveness and readiness probes, graceful lifecycle, request logging, and a container that scales, all verified offline.

Today you will

Read all config from the environment (12-factor), and keep secrets out of the code
Add liveness (/healthz) and readiness (/readyz) probes, and see why they differ
Handle startup and shutdown gracefully (warm up, then drain) and fail fast on bad config
Make the service stateless so it scales horizontally behind a load balancer
Containerize it properly: slim image, non-root user, healthcheck, pinned dependencies

Run of show (about 55 minutes)

Time	What we do
0:00	Hook: laptop vs production
0:05	The one idea: config in the env, probes, statelessness (read `notes.md`)
0:12	Lab Part A: config from the environment, and the two probes
0:30	Lab Part B: readiness gating, statelessness, and the production Dockerfile
0:50	Show: post your probes and a stateless run
0:55	Wrap

If you get stuck

Builds on M11 (deployment, Docker) and M21 (why per-process memory blocks scaling). The core lab runs offline, free, no key via FastAPI's TestClient; it needs fastapi plus uvicorn (from M11). Building the Docker image is the only step that needs Docker (optional).
Nothing here can harm your computer. The agent is a stub so the focus stays on serving; you wire the M27 agent into handle for the real thing.
The mental model: liveness = "is the process alive (restart if not)"; readiness = "should it get traffic yet (wait if not)". Two different questions.

Optional challenge

Open starters/add_serving_feature.py and add one real serving feature: a /metrics endpoint, a per-client rate limit returning 429, async concurrency, or real per-session memory loaded from a store (keeping the process stateless so replicas still scale).