Lab M28: stream your agent to a UI
You'll need: your venv. The core lab needs no API key and costs nothing (a streaming mock). The
SSE step adds fastapi plus uvicorn (from M11). Time: about 45 minutes. Work in your breakout pair.
Heads up: the new idea is small but high-impact: the agent EMITS events as it works (a generator), and a UI renders them live. You will see progress, the answer streaming token by token, citations, cost, and cancellation. Nothing here can harm your computer.
This lab has two parts: - Part A: run the streamed agent and watch progress, live tokens, citations, and cost. - Part B: cancel a run, then serve the stream over Server-Sent Events.
flowchart LR
Q["question"] --> AG["chat_stream (generator)"]
AG -->|status| UI["UI / renderer"]
AG -->|tool + citation| UI
AG -->|token token token| UI
AG -->|cost, done| UI
UI -->|user hits stop| CANCEL["close stream: agent stops, no more cost"]
Part A: watch it stream
Step 1: Set up
Copy the solution/ files into a folder. Activate your venv. No key, no installs yet.
python -c "print('ready')"
ready.
Step 2: Run the streamed demo
python demo.py
==== STREAMED RESPONSE (watch it work, then answer live) ====
... thinking
... searching the knowledge base
[tool] search_kb query='billing Payments team' found=2
... writing the answer
Dana Okafor leads the Payments team, which runs billing. [D1, D3]
[cost $0.00049, 45 tokens]
[sources: D1, D3]
Step 3: See that the answer really streams
Open streaming_agent.py. Find the loop that yields one token event
per chunk. Open events.py and find where render writes each token with
flush().
You should now see: the answer is emitted as many small token events, not one blob. That is what
lets a UI show words appearing live (time-to-first-token), the single biggest perceived-speed win (M6).
Step 4: See the event vocabulary
Read the top of events.py: status, tool, citation, token, cost,
done, error.
You should now see: the agent communicates entirely through these small events, and it does not care whether the renderer is a terminal, a web page, or a test. Emit events; let the consumer display them.
Part B: cancellation and serving
Step 5: Cancel a run
The demo's third section stops after three tokens.
==== CANCELLATION (user hits stop after a few tokens) ====
Dana Okafor leads
[cancelled by user; no further tokens generated, no further cost]
cost and done events never fired. Because
chat_stream is a generator, closing it means the agent does no more work and incurs no more cost.
Cancellation came free from streaming.
Step 6: Serve it over Server-Sent Events
pip install fastapi "uvicorn[standard]"
uvicorn app:app --reload
curl -N -X POST http://127.0.0.1:8000/chat/stream -H "Content-Type: application/json" \
-d '{"message":"Who leads the team that runs billing?"}'
data: {"type": "status", "text": "thinking"}
data: {"type": "status", "text": "searching the knowledge base"}
...
data: {"type": "token", "text": "Dana "}
...
data: {"type": "done"}
data: <json> line, sent live as the agent works. A browser's
EventSource reads exactly these and updates the page. Open app.py: the
StreamingResponse with media type text/event-stream is the whole trick. Ctrl-C to stop. (curl -N
disables buffering so you see frames arrive live.)
Step 7: Show it
Post your streamed run from Step 2 (progress, then the live answer with sources and cost), and one sentence on why streaming made it feel better even though the answer was the same.
If you get stuck
ModuleNotFoundError-> run from inside the folder with the solution.pyfiles.curlshows everything at once, not streaming -> add-N(no buffering); some terminals still batch. The frames are still sent live.ModuleNotFoundError: fastapi->pip install fastapi "uvicorn[standard]"(from M11). Steps 1 to 5 need no install.- Citations do not match the answer -> the search query must retrieve the cited docs; read
corpus.searchand the mock's query.