Building Streaming AI Interfaces with OpenAI APIs
A 5-second response from an LLM feels broken. The same 5-second response, with the first words appearing in 300ms and the rest streaming in as they’re generated, feels conversational. The model is doing the same work; only the user experience differs. This is why every serious LLM product streams responses: perceived latency, not actual latency, is what matters to users.
Streaming sounds simple in a demo (stream=True, render tokens as they arrive) and turns out to involve real engineering once you ship it to production — backpressure across an HTTP hop, graceful cancellation, partial-output validation, error handling mid-stream, and serving structured outputs that need to be parsed before display. This post is about getting that right.
Why Streaming Matters
LLMs generate tokens autoregressively — one token at a time, each depending on the prior ones. Total latency for a response is roughly:
TTLT ≈ TTFT + (output_tokens × time_per_token)For GPT-4-class models, TTFT (time to first token) is typically 200ms–1s, and time-per-token is 20–80ms. A 300-token response thus takes 6–25 seconds total. If you wait for the full response, the user stares at a spinner for that duration. If you stream, the user sees output beginning within ~500ms and reads it as it appears at human reading speed.
The user’s perception of speed is dominated by TTFT, not TTLT. Streaming is therefore a perceived-latency multiplier without any change to the model itself.
The streaming case actually completes at the same wall-clock time, but the user perceives the experience as starting immediately rather than after an 8-second wait.
The Protocol
OpenAI’s streaming uses Server-Sent Events (SSE). Each chunk is a JSON object with a delta — the incremental content since the last chunk:
data: {"id":"...","choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"...","choices":[{"delta":{"content":" world"},"finish_reason":null}]}
data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]SSE is the right transport choice for this — unidirectional server-to-client, text-based, automatically reconnects, plays well with HTTP/2 multiplexing, and works through proxies (unlike WebSocket in some restrictive environments).
In Python:
async for chunk in await client.chat.completions.create( model="gpt-4o", messages=[...], stream=True,): delta = chunk.choices[0].delta if delta.content: await send_to_user(delta.content)In Node.js:
const stream = await openai.chat.completions.create({ model: "gpt-4o", messages: [...], stream: true,});
for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) await sendToUser(content);}Both SDKs handle the SSE parsing internally; you receive structured chunks.
End-to-End Streaming, Not Just Server-Side
A common mistake: the server receives a streaming response from OpenAI and then buffers it before responding to the client. This negates the entire point of streaming.
The chain has to stream throughout:
- OpenAI streams chunks to your server.
- Your server streams chunks to the client.
- The client renders chunks as they arrive.
Each hop needs explicit streaming handling. In FastAPI:
from fastapi.responses import StreamingResponse
@app.post("/chat")async def chat(req: ChatRequest): async def event_stream(): async for chunk in await openai_client.chat.completions.create( model="gpt-4o", messages=req.messages, stream=True, ): delta = chunk.choices[0].delta if delta.content: yield f"data: {json.dumps({'content': delta.content})}\n\n" yield "data: [DONE]\n\n"
return StreamingResponse(event_stream(), media_type="text/event-stream")In Express/Fastify with raw Node streams:
fastify.post("/chat", async (request, reply) => { reply.raw.setHeader("Content-Type", "text/event-stream"); reply.raw.setHeader("Cache-Control", "no-cache, no-transform"); reply.raw.setHeader("X-Accel-Buffering", "no");
const stream = await openai.chat.completions.create({...}); for await (const chunk of stream) { const c = chunk.choices[0]?.delta?.content; if (c) reply.raw.write(`data: ${JSON.stringify({ content: c })}\n\n`); } reply.raw.write("data: [DONE]\n\n"); reply.raw.end();});A few specifics that bite people:
X-Accel-Buffering: nodisables proxy buffering on NGINX (the default is to buffer). Without it, NGINX collects the stream before forwarding to the client.- Disable compression for SSE responses. Buffering for gzip flush negates streaming.
- HTTP/1.1 chunked transfer encoding is the underlying mechanism; most frameworks handle this correctly when you write before
end.
Client-Side Rendering
On the client, the EventSource API or fetch with a ReadableStream both work:
const res = await fetch("/chat", { method: "POST", body: JSON.stringify(req) });const reader = res.body!.getReader();const decoder = new TextDecoder();let buffer = "";
while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const parts = buffer.split("\n\n"); buffer = parts.pop() ?? ""; for (const part of parts) { if (!part.startsWith("data: ")) continue; const payload = part.slice(6); if (payload === "[DONE]") return; const { content } = JSON.parse(payload); appendToUI(content); }}The \n\n split is important — SSE messages are separated by blank lines, and chunks can land mid-message. Buffering the trailing partial line until the next read prevents JSON-parse failures.
For React, the rendering pattern that works well: accumulate the streamed content in state, and let React’s batching handle the re-renders. For very long streams, throttle UI updates to ~60Hz to avoid pegging the main thread.
Cancellation: The Part Everyone Forgets
A user typing a long question, hitting submit, then immediately changing the question, should cancel the in-flight request. Without cancellation, the server continues generating tokens (and you continue paying for them) while the user has moved on.
End-to-end cancellation requires:
- AbortSignal on the client.
fetch(url, { signal: controller.signal }). Cancelling aborts the request and closes the connection. - Detect client disconnection on the server. FastAPI/Starlette expose
await request.is_disconnected(); Node exposesrequest.connection.destroyedandrequest.on('close', ...). - Cancel the OpenAI request. Pass an
AbortSignal(Node SDK) orwith httpx.timeout(...)cancellation (Python SDK).
async def event_stream(): try: async for chunk in stream: if await request.is_disconnected(): await stream.close() return ... except asyncio.CancelledError: await stream.close() raiseWithout this, a noisy frontend (users navigating away, retrying) doubles or triples your OpenAI bill.
Backpressure
A client reading slowly should slow down generation. Without backpressure, the server buffers indefinitely.
The mechanism is built in to HTTP — TCP backpressure propagates through. The point is that you do not need to add your own buffer. Don’t await asyncio.create_task(send(chunk)) without awaiting the send; let the natural flow control work.
The danger zone is when you introduce a queue or buffer between the OpenAI stream and your response — say, to broadcast the response to multiple consumers, or to persist while streaming. The queue removes backpressure. Be deliberate about whether you want that.
Mid-Stream Errors
Streams can fail mid-flight: model produces an OpenAI-side error after 100 tokens, the connection drops, content filter triggers at token 200. The client needs a way to know.
The protocol decision: communicate errors as data events with a distinct type, not by closing the connection.
try: async for chunk in stream: ...except APIError as e: yield f"data: {json.dumps({'error': str(e)})}\n\n"finally: yield "data: [DONE]\n\n"The client checks for error fields and displays appropriately. A silently-truncated stream is the worst UX — the user thinks the model decided to stop early.
OpenAI also emits finish_reason in the final chunk: stop, length, content_filter, tool_calls. Surface these distinctively:
length— truncated due to max_tokens. Offer “continue” affordance.content_filter— blocked. Explain.stop— model finished. Normal.
Streaming Structured Outputs
What if your application consumes structured JSON, not free text? The challenge: streaming partial JSON is not parseable until the closing brace.
Three approaches:
- Wait for completion before parsing. Defeats streaming. Use only if structure must be valid before any UI feedback.
- Stream the tokens but render only the textual parts. Many UIs care about a “message” field but also include metadata; you can render the message stream while collecting metadata at the end.
- Partial-JSON parsing libraries.
partial-json-parser(JS),ijson(Python). Parse “as much as is valid” at each step; render the visible parts.
For tool calls specifically, OpenAI streams tool arguments incrementally as a delta:
{"tool_calls": [{"index": 0, "function": {"arguments": "{\""}}]}{"tool_calls": [{"index": 0, "function": {"arguments": "name\":"}}]}{"tool_calls": [{"index": 0, "function": {"arguments": " \"Akhil\"}"}}]}Accumulate the arguments string per-tool-call index and parse only when complete. Showing a partial argument to the user is rarely useful; showing “calling tool: get_weather” with a spinner is.
Tool Calls and Reasoning
Modern LLM responses interleave reasoning, tool calls, and final answers. The UI must handle each phase:
- Reasoning content (in newer models) — render as a collapsed “thinking” section. Users want the option to expand; most don’t.
- Tool calls — show “Calling [tool] with [args]…” and the result. The model often emits these mid-stream before continuing with the answer.
- Final answer — the streaming text the user reads.
Frameworks that handle this well (Vercel AI SDK, Mastra, LangChain.js) abstract the multi-phase nature. Building from scratch is reasonable; just plan for the state machine.
Observability for Streaming
Streaming adds observability complexity. You want metrics for both phases:
- TTFT. The most important UX metric. Histogram by model and prompt class.
- TTLT. End-to-end duration.
- Tokens emitted. Output token count per response. Drives cost.
- Cancellation rate. Fraction of streams cancelled by client. High rates indicate UI bugs or user dissatisfaction.
- Error rate by finish_reason.
length,content_filter, transient API errors. - Time-per-token. Histogram. Anomalies indicate model degradation or routing issues.
For traces, each stream is one span; sub-spans for tool calls and reasoning blocks. The Vercel AI SDK and OpenTelemetry’s GenAI semantic conventions provide a vocabulary for these — adopt the conventions early so dashboards and tooling work out of the box.
Production Pitfalls
A few patterns that bite teams:
- Idle timeouts on load balancers. A 30-second response time exceeds the default idle timeout on many ingress configurations (ALB default is 60s, sometimes set lower). Configure long timeouts on streaming routes specifically.
- Compression middleware. Compression middleware that buffers the response before flushing kills streaming. Exempt SSE routes.
- CDN/caching layers. A CDN that caches the response defeats streaming and breaks subsequent requests. Set
Cache-Control: no-storeand ensure your edge respects it. - Mobile networks dropping connections. Reconnection logic with resumption (start where you left off via prompt continuation) preserves UX on flaky networks.
- Token usage tracking. Streaming responses don’t include the final
usageblock by default — request it explicitly (stream_options={"include_usage": true}) or accept that you need to estimate.
Closing
Streaming is the single largest perceived-latency improvement you can make to an LLM-backed product, and the implementation surface is larger than it looks. The chain from OpenAI to user must stream throughout; cancellation must propagate end-to-end; errors must be communicated as data events, not connection drops; tool calls and structured outputs require their own rendering states; observability needs to handle the multi-phase nature. None of this is hard once you’ve done it once; all of it is easy to miss the first time. Build the streaming path deliberately — end-to-end, with cancellation, with error semantics, with the right transport headers — and the LLM in your product stops feeling like a batch system and starts feeling like a conversation. That difference is most of what users notice.