M29 notes: Agent deployment and serving (the one idea)
The one idea: serving an agent in production is not about the agent at all, it is about the operational shell around it. The same model call that works on your laptop needs, to be dependable for real users: configuration from the environment, secrets kept out of the code, health and readiness signals, graceful startup and shutdown, and statelessness so you can run many copies. Get the shell right and your agent stays up, scales, and is safe to change.
1. Config from the environment (12-factor)
The code should be identical on your laptop, in staging, and in production. The only thing that differs
is configuration, and it comes from environment variables, not from values baked into the code.
config.py reads model, step caps, timeouts, log level, port, and the API key from the environment,
with sane defaults. Benefits: one container image runs anywhere just by changing env vars; no secret
ever lands in the repo; and you can validate() config at startup and fail fast instead of
discovering a bad setting mid-request.
Secrets (the API key) are config too, but special: they come from the platform's secret store or
environment, never a committed file, and you never log their value. redacted() logs only whether the
key is set, not the key itself.
Analogy. The same car ships worldwide; what changes by country is the fuel you put in and the plate you bolt on, supplied locally, not welded in at the factory. Config is the fuel and plate.
2. Two different health questions: liveness vs readiness
Orchestrators (Kubernetes, ECS, Cloud Run, and friends) ask a service two separate questions, and conflating them causes outages:
- Liveness (
/healthz): is the process alive? If this fails, the platform restarts the container. It should be cheap and not depend on downstream services (or a slow database makes the platform kill a perfectly good process). - Readiness (
/readyz): should this instance receive traffic right now? It returns 503 while the app is starting up (warming caches, opening connections) or draining during shutdown. The load balancer holds traffic back until ready. This is what prevents requests hitting a half-started replica.
In app.py, /healthz is always 200 once the process runs; /readyz (and /chat) return 503 until
the lifespan marks the app ready. The lab shows the difference directly.
3. Graceful startup and shutdown
A service should warm up before taking traffic and drain before dying. The FastAPI lifespan handles both: on startup it validates config (and refuses to start if it is bad), warms what it needs, then flips readiness on; on shutdown it flips readiness off so the load balancer stops sending new requests while in-flight ones finish. Without graceful shutdown, a deploy or scale-down drops live requests.
4. Statelessness, the key to scaling
To serve more users you run more replicas of the same container behind a load balancer, and a
request might hit any of them. That only works if each replica is stateless: it keeps no
per-process memory that a later request depends on. This is exactly the caution from M21: convenient
in-process memory breaks horizontal scaling, because replica B does not have what the user told replica
A. The fix is to keep state outside the process (a database, cache, or the M21 store), keyed by a
session id the request carries. app.py takes a session_id and keeps no server-side session memory,
so any replica can serve any request.
5. Concurrency
One slow request must not freeze the others. Two common models: run several worker processes
(uvicorn --workers N, scaled to CPU cores) so requests run in parallel, and write async handlers
so a worker can interleave I/O-bound work (waiting on the model API) across requests. For agent
services, which spend most of their time waiting on the model, both help; start with a few workers and
measure (M20).
6. Containerization done right
The Dockerfile shows the production basics: a slim base image, dependencies installed in their
own layer (so code changes do not rebuild the world), a non-root user (never run a public service
as root), an EXPOSEd port, a HEALTHCHECK that hits the liveness probe, and pinned
dependencies so every build and every replica is identical. These are small habits that prevent real
incidents.
7. Putting it in front of users
Beyond this module, a real deployment adds: a load balancer or ingress, autoscaling rules (scale on CPU or request count), centralized logs and metrics (M20), the cost controls from M25, the eval gate in CI before deploy (M26), and a rollout strategy (rolling or blue/green) so a bad version can be rolled back. The service shape here is what those tools attach to.
Words you will hear
12-factor config, environment variable, secret store, liveness vs readiness probe, graceful shutdown / draining, fail fast, stateless / horizontal scaling, replica / load balancer, worker vs async concurrency, non-root container, pinned dependencies. Full definitions in the glossary.