M28 notes: Agent UX and streaming (the one idea)

The one idea: an agent that goes silent and then dumps a finished answer feels slow and opaque, even when it is fast and correct. The fix is to make the agent EMIT EVENTS as it works, and render them live: a progress note while it thinks and searches, the answer streaming in token by token, the sources it used, and what it cost. Same answer, completely different experience. UX is not paint on top; for an agent it is part of whether people trust and keep using it.

1. Perceived latency is the real latency

Users do not experience your server's wall-clock time; they experience the wait they can see. Two agents that both take six seconds feel totally different: one shows a blank screen for six seconds, the other shows "searching the knowledge base..." at second one and starts printing the answer at second two. The second feels fast. The single biggest UX win for any LLM app is streaming the output so the first words appear almost immediately (time-to-first-token), instead of waiting for the whole thing.

Analogy. A restaurant that seats you, brings water, and tells you the kitchen is busy feels fine for thirty minutes. A restaurant that leaves you standing at the door in silence feels broken in five. The food took the same time; the experience did not.

2. Make the agent an event stream

Instead of return final_answer, the agent YIELDS events as it goes. In events.py the vocabulary is small: status (a progress note), tool (a tool was called), citation (a source used), token (a chunk of the answer), cost (the final bill), done, and error. streaming_agent.py is a generator that yields these in order: status "thinking", status "searching", a tool event, citation events, then the answer one chunk at a time, then a cost event, then done. A UI subscribes and renders each the instant it arrives.

This event model is also clean engineering: the agent does not know or care whether the renderer is a terminal, a web page, or a test. It just emits; the consumer decides how to show it.

3. Streaming the answer

The answer streams because the model produces it in chunks and the agent forwards each chunk as a token event immediately, rather than buffering the whole thing. Real Claude streaming uses the SDK's streaming API you met in M6 (with client.messages.stream(...) as s: for text in s.text_stream: ...). Our mock yields a list of chunks to model the same flow offline. Either way the UI sees words appear as they are generated.

4. Show the work: progress, citations, cost

Streaming tokens is the headline, but the other events matter too:

Progress / status: "searching the knowledge base" tells the user the agent is working and WHAT it is doing, which builds trust and explains the wait.
Citations: surfacing the sources (M24) lets the user verify the answer and signals it is grounded, not made up.
Cost and latency: showing tokens and dollars (M20/M25), at least in internal tools, keeps usage honest and helps users understand the system. Many products show "thinking" timers or token meters.

Each is just another event type the UI renders.

5. Cancellation comes free with generators

Because chat_stream is a generator, the agent only does work when the consumer asks for the next event. If the user hits stop, the consumer stops iterating and closes the generator; the agent never generates the rest of the answer, so no further tokens and no further cost are produced. The lab shows this: stop after three tokens and the cost and done events are never emitted. Cancellation is not a special feature you bolt on; it falls out of streaming done right. (In a web app, a closed SSE connection or an abort signal plays the same role.)

6. Serving it: Server-Sent Events

To stream to a browser, the simplest transport is Server-Sent Events (SSE): the HTTP response stays open and the server writes data: <json>\n\n for each event. The browser's EventSource (or any client) reads them as they arrive. app.py wraps the agent's event stream in a FastAPI StreamingResponse with media type text/event-stream. (WebSockets are the alternative when you also need to stream from the client to the server, for example live voice; SSE is enough for one-way output.)

7. Honest notes

The model is mocked so this runs offline; a production build wraps the real SDK streaming API behind the same event stream, so the UI code does not change.
Do not stream raw internal reasoning you would not want a user to see; stream a curated "status", not unfiltered chain-of-thought.
Streaming improves perceived speed, not total cost or correctness; pair it with the cost work (M25) and the eval gate (M26).

Words you will hear

Perceived latency, time-to-first-token, streaming, event stream / generator, Server-Sent Events (SSE), EventSource, cancellation / abort, progress indicator, citations (M24). Full definitions in the glossary.