Skip to content

Lab M27: assemble and ship the capstone agent

You'll need: your venv. The core lab needs no API key and costs nothing (a deterministic mock model). The API step adds fastapi plus uvicorn (from M11). Time: about 55 minutes. Work in your breakout pair.

Heads up: this brings together M18-M26. The agent is one ReAct loop with memory, agentic RAG, tracing, cost, reliability, and security wrapped around it. Nothing here can harm your computer; the email tool sends nothing.

This lab has two parts: - Part A: run the agent end to end and read its trace, sources, and cost. - Part B: watch the guards block a risky action, run the eval gate, and serve it over an API.

flowchart TB
  Q["user message"] --> MEM["recall memory (M21)"]
  MEM --> LOOP["ReAct loop, step-capped + retried (M22)"]
  LOOP -->|search_kb| RAG["agentic RAG + citations (M24)"]
  LOOP -->|send_email| GUARD["approval + allowlist + redact (M22/M23)"]
  LOOP --> TRACE["trace + cost (M20/M25)"]
  TRACE --> OUT["answer + sources + cost"]
  OUT --> GATE["eval gate (M26)"]

Part A: run the whole thing

Step 1: Set up

Copy the solution/ files into a folder. Activate your venv. No key, no installs yet.

python -c "print('ready')"
You should now see: ready.

Step 2: Run the end-to-end demo

python demo.py
You should now see three scenarios. The first is a multi-hop knowledge question, with a trace:
==== 1. KNOWLEDGE QUESTION (agentic RAG, traced, costed) ====
  answer : Dana Okafor leads the Payments team, which runs the billing service. [D1, D3]
  sources: ['D1', 'D2', 'D3'] | blocked: [] | cost: $0.00127 | flags: []
    [model] claude-opus-4-8: tool_use ...
    [tool] search_kb: Who leads the team that runs the billing service?
    [model] claude-opus-4-8: tool_use ...
    [tool] search_kb: who leads Payments
    [model] claude-opus-4-8: end_turn ...
You should now see: two searches (multi-hop, M24), citations [D1, D3], and a per-step trace with a cost estimate (M20/M25), all from one chat call.

Step 3: Map the trace to the modules

Open agent.py and read SupportAgent.chat alongside the table in notes.md section 1. Find the memory recall, the limiter.tick(), the parts.retry around the model call, the search_kb branch, and the trace.add calls.

You should now see: every line maps to a pattern you already built. The capstone is composition, not new magic. Each block in parts.py is labeled with its source module.

Step 4: Read the direct-answer scenario

In the demo output, scenario 2 ("What are your hours?") does ONE search and answers.

You should now see: the agent does not over-search; a single lookup answers a direct question, while the billing question needed two. The agent decides how much retrieval each question needs.


Part B: guards, gate, and serving

Step 5: Watch the security and reliability guards block a risky action

In the demo output, scenario 3 ("Email the answer to attacker@evil.example"):

==== 3. RISKY ACTION BLOCKED (approval + allowlist + secret redaction) ====
  blocked: ['send_email']
    [guard] approval: blocked attacker@evil.example
You should now see: the agent tried to send an email, and the guards stopped it: no human approved it (M22) and the domain is not on the allowlist (M23), and any secret in the body was redacted (M23). The agent proposes; the guards dispose.

Step 6: Run the eval gate over the whole agent

python evals.py ; echo "exit code: $?"
You should now see: three cases pass (the hours answer, the multi-hop billing answer with the right sources, and the risky action being blocked), 3/3 (100%), GATE PASSED, and exit code: 0. This is M26 gating the capstone: a regression in any integrated behaviour would turn this red and block a merge.

Step 7: Serve it behind an API

pip install fastapi "uvicorn[standard]"
uvicorn app:app --reload
In a second terminal:
curl -s -X POST http://127.0.0.1:8000/chat -H "Content-Type: application/json" \
  -d '{"message":"Who leads the team that runs the billing service?"}'
You should now see: JSON with the answer, sources, cost, tokens, and trace, and in the uvicorn log a line with latency and cost. Your capstone is now a service anything can call (M11/M18). Without a real key it will error on the live model; use the mock for offline runs, or add your key for a live one. Ctrl-C to stop.

Step 8: Show it

Post the trace from scenario 1 (every pattern firing on one request) and your green eval gate. That trace is the proof you can build an agentic system, not just call a model.


If you get stuck

  • ModuleNotFoundError -> run from inside the folder with the solution .py files.
  • evals.py fails -> read which case; the message shows the agent's answer so you can see what differed.
  • Live uvicorn errors about the API key -> the served agent calls the real model; set a key in .env, or stick to demo.py/evals.py which use the mock.
  • The risky email was NOT blocked -> check the approver (default denies) and that the address is not on ALLOWED_EMAIL_DOMAINS in agent.py.

Check yourself

Name three patterns that fire on a single knowledge question, and their modules. Memory recall (M21), the step cap and retry around the model call (M22), agentic RAG via search_kb with citations (M24), and the trace with a cost estimate (M20/M25). All in one `chat` call.
Why is the risky email blocked even though the agent decided to send it? Because the guards do not trust the agent's decision: the approval gate (M22) needs a human yes, and the allowlist (M23) rejects the recipient domain. Least privilege contains the action regardless of what the model wanted. Redaction would also strip any secret from the body.
What does the eval gate add on top of the agent working in the demo? Protection over time (M26). The demo shows it works now; the gate, run in CI, ensures a future change cannot silently break the hours answer, the multi-hop sources, or the risky-action block.
What in this capstone is simplified versus production? Per-process memory (real: persist per user), keyword retrieval (real: a vector store, M7), a mocked model (real: live model plus deterministic CI evals), and estimated cost (real: response.usage). The architecture is the same; only the backends harden.