Lab M28: stream your agent to a UI

You'll need: your venv. The core lab needs no API key and costs nothing (a streaming mock). The SSE step adds fastapi plus uvicorn (from M11). Time: about 45 minutes. Work in your breakout pair.

Heads up: the new idea is small but high-impact: the agent EMITS events as it works (a generator), and a UI renders them live. You will see progress, the answer streaming token by token, citations, cost, and cancellation. Nothing here can harm your computer.

This lab has two parts: - Part A: run the streamed agent and watch progress, live tokens, citations, and cost. - Part B: cancel a run, then serve the stream over Server-Sent Events.

flowchart LR
  Q["question"] --> AG["chat_stream (generator)"]
  AG -->|status| UI["UI / renderer"]
  AG -->|tool + citation| UI
  AG -->|token token token| UI
  AG -->|cost, done| UI
  UI -->|user hits stop| CANCEL["close stream: agent stops, no more cost"]

Part A: watch it stream

Step 1: Set up

Copy the solution/ files into a folder. Activate your venv. No key, no installs yet.

python -c "print('ready')"

You should now see: ready.

Step 2: Run the streamed demo

python demo.py

You should now see the agent narrate its work, then stream the answer, then show sources and cost:

==== STREAMED RESPONSE (watch it work, then answer live) ====
  ... thinking
  ... searching the knowledge base
  [tool] search_kb query='billing Payments team' found=2
  ... writing the answer
Dana Okafor leads the Payments team, which runs billing. [D1, D3]
  [cost $0.00049, 45 tokens]
  [sources: D1, D3]

You should now see: progress notes BEFORE the answer (so there is no dead blank screen), then the answer, then its citations and cost. The user always knows what is happening.

Step 3: See that the answer really streams

Open streaming_agent.py. Find the loop that yields one token event per chunk. Open events.py and find where render writes each token with flush().

You should now see: the answer is emitted as many small token events, not one blob. That is what lets a UI show words appearing live (time-to-first-token), the single biggest perceived-speed win (M6).

Step 4: See the event vocabulary

Read the top of events.py: status, tool, citation, token, cost, done, error.

You should now see: the agent communicates entirely through these small events, and it does not care whether the renderer is a terminal, a web page, or a test. Emit events; let the consumer display them.

Part B: cancellation and serving

Step 5: Cancel a run

The demo's third section stops after three tokens.

==== CANCELLATION (user hits stop after a few tokens) ====
Dana Okafor leads
  [cancelled by user; no further tokens generated, no further cost]

You should now see: the stream stopped early and the cost and done events never fired. Because chat_stream is a generator, closing it means the agent does no more work and incurs no more cost. Cancellation came free from streaming.

Step 6: Serve it over Server-Sent Events

pip install fastapi "uvicorn[standard]"
uvicorn app:app --reload

In a second terminal:

curl -N -X POST http://127.0.0.1:8000/chat/stream -H "Content-Type: application/json" \
  -d '{"message":"Who leads the team that runs billing?"}'

You should now see a sequence of SSE frames arrive one after another:

data: {"type": "status", "text": "thinking"}

data: {"type": "status", "text": "searching the knowledge base"}
...
data: {"type": "token", "text": "Dana "}
...
data: {"type": "done"}

You should now see: each event is a data: <json> line, sent live as the agent works. A browser's EventSource reads exactly these and updates the page. Open app.py: the StreamingResponse with media type text/event-stream is the whole trick. Ctrl-C to stop. (curl -N disables buffering so you see frames arrive live.)

Step 7: Show it

Post your streamed run from Step 2 (progress, then the live answer with sources and cost), and one sentence on why streaming made it feel better even though the answer was the same.

If you get stuck

ModuleNotFoundError -> run from inside the folder with the solution .py files.
curl shows everything at once, not streaming -> add -N (no buffering); some terminals still batch. The frames are still sent live.
ModuleNotFoundError: fastapi -> pip install fastapi "uvicorn[standard]" (from M11). Steps 1 to 5 need no install.
Citations do not match the answer -> the search query must retrieve the cited docs; read corpus.search and the mock's query.

Check yourself

Why does streaming make an agent feel faster when the total time is unchanged?

Because users experience perceived latency: the time they wait staring at nothing. Streaming shows progress and the first words almost immediately (time-to-first-token), so the wait feels short even though the total is the same.

How does the agent communicate with the UI?

Through a stream of small events (status, tool, citation, token, cost, done) that it yields as a generator. The UI renders each as it arrives. The agent does not care what the renderer is.

Why does cancellation come "for free" with a generator?

A generator only produces the next event when asked. If the consumer stops iterating (the user hits stop / the SSE connection closes), the agent never generates the rest, so no more tokens and no more cost are produced.

What is SSE and when would you use WebSockets instead?

Server-Sent Events stream one-way from server to client over a kept-open HTTP response (`data: ...` lines), perfect for streaming an answer. Use WebSockets when you also need to stream from client to server, for example live two-way voice.