Building Streaming AI Interfaces with OpenAI APIs

A 5-second response from an LLM feels broken. The same 5-second response, with the first words appearing in 300ms and the rest streaming in as they’re generated, feels conversational. The model is doing the same work; only the user experience differs. This is why every serious LLM product streams responses: perceived latency, not actual latency, is what matters to users.

Streaming sounds simple in a demo (stream=True, render tokens as they arrive) and turns out to involve real engineering once you ship it to production — backpressure across an HTTP hop, graceful cancellation, partial-output validation, error handling mid-stream, and serving structured outputs that need to be parsed before display. This post is about getting that right.

Why Streaming Matters

LLMs generate tokens autoregressively — one token at a time, each depending on the prior ones. Total latency for a response is roughly:

TTLT ≈ TTFT + (output_tokens × time_per_token)

For GPT-4-class models, TTFT (time to first token) is typically 200ms–1s, and time-per-token is 20–80ms. A 300-token response thus takes 6–25 seconds total. If you wait for the full response, the user stares at a spinner for that duration. If you stream, the user sees output beginning within ~500ms and reads it as it appears at human reading speed.

The user’s perception of speed is dominated by TTFT, not TTLT. Streaming is therefore a perceived-latency multiplier without any change to the model itself.

The streaming case actually completes at the same wall-clock time, but the user perceives the experience as starting immediately rather than after an 8-second wait.

The Protocol

OpenAI’s streaming uses Server-Sent Events (SSE). Each chunk is a JSON object with a delta — the incremental content since the last chunk:

data: {"id":"...","choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"...","choices":[{"delta":{"content":" world"},"finish_reason":null}]}
data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]

SSE is the right transport choice for this — unidirectional server-to-client, text-based, automatically reconnects, plays well with HTTP/2 multiplexing, and works through proxies (unlike WebSocket in some restrictive environments).

In Python:

async for chunk in await client.chat.completions.create(
model="gpt-4o",
messages=[...],
stream=True,
):
delta = chunk.choices[0].delta
if delta.content:
await send_to_user(delta.content)

In Node.js:

const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: [...],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) await sendToUser(content);
}

Both SDKs handle the SSE parsing internally; you receive structured chunks.

End-to-End Streaming, Not Just Server-Side

A common mistake: the server receives a streaming response from OpenAI and then buffers it before responding to the client. This negates the entire point of streaming.

The chain has to stream throughout:

  1. OpenAI streams chunks to your server.
  2. Your server streams chunks to the client.
  3. The client renders chunks as they arrive.

Each hop needs explicit streaming handling. In FastAPI:

from fastapi.responses import StreamingResponse
@app.post("/chat")
async def chat(req: ChatRequest):
async def event_stream():
async for chunk in await openai_client.chat.completions.create(
model="gpt-4o", messages=req.messages, stream=True,
):
delta = chunk.choices[0].delta
if delta.content:
yield f"data: {json.dumps({'content': delta.content})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(event_stream(), media_type="text/event-stream")

In Express/Fastify with raw Node streams:

fastify.post("/chat", async (request, reply) => {
reply.raw.setHeader("Content-Type", "text/event-stream");
reply.raw.setHeader("Cache-Control", "no-cache, no-transform");
reply.raw.setHeader("X-Accel-Buffering", "no");
const stream = await openai.chat.completions.create({...});
for await (const chunk of stream) {
const c = chunk.choices[0]?.delta?.content;
if (c) reply.raw.write(`data: ${JSON.stringify({ content: c })}\n\n`);
}
reply.raw.write("data: [DONE]\n\n");
reply.raw.end();
});

A few specifics that bite people:

  • X-Accel-Buffering: no disables proxy buffering on NGINX (the default is to buffer). Without it, NGINX collects the stream before forwarding to the client.
  • Disable compression for SSE responses. Buffering for gzip flush negates streaming.
  • HTTP/1.1 chunked transfer encoding is the underlying mechanism; most frameworks handle this correctly when you write before end.

Client-Side Rendering

On the client, the EventSource API or fetch with a ReadableStream both work:

const res = await fetch("/chat", { method: "POST", body: JSON.stringify(req) });
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const parts = buffer.split("\n\n");
buffer = parts.pop() ?? "";
for (const part of parts) {
if (!part.startsWith("data: ")) continue;
const payload = part.slice(6);
if (payload === "[DONE]") return;
const { content } = JSON.parse(payload);
appendToUI(content);
}
}

The \n\n split is important — SSE messages are separated by blank lines, and chunks can land mid-message. Buffering the trailing partial line until the next read prevents JSON-parse failures.

For React, the rendering pattern that works well: accumulate the streamed content in state, and let React’s batching handle the re-renders. For very long streams, throttle UI updates to ~60Hz to avoid pegging the main thread.

Cancellation: The Part Everyone Forgets

A user typing a long question, hitting submit, then immediately changing the question, should cancel the in-flight request. Without cancellation, the server continues generating tokens (and you continue paying for them) while the user has moved on.

End-to-end cancellation requires:

  • AbortSignal on the client. fetch(url, { signal: controller.signal }). Cancelling aborts the request and closes the connection.
  • Detect client disconnection on the server. FastAPI/Starlette expose await request.is_disconnected(); Node exposes request.connection.destroyed and request.on('close', ...).
  • Cancel the OpenAI request. Pass an AbortSignal (Node SDK) or with httpx.timeout(...) cancellation (Python SDK).
async def event_stream():
try:
async for chunk in stream:
if await request.is_disconnected():
await stream.close()
return
...
except asyncio.CancelledError:
await stream.close()
raise

Without this, a noisy frontend (users navigating away, retrying) doubles or triples your OpenAI bill.

Backpressure

A client reading slowly should slow down generation. Without backpressure, the server buffers indefinitely.

The mechanism is built in to HTTP — TCP backpressure propagates through. The point is that you do not need to add your own buffer. Don’t await asyncio.create_task(send(chunk)) without awaiting the send; let the natural flow control work.

The danger zone is when you introduce a queue or buffer between the OpenAI stream and your response — say, to broadcast the response to multiple consumers, or to persist while streaming. The queue removes backpressure. Be deliberate about whether you want that.

Mid-Stream Errors

Streams can fail mid-flight: model produces an OpenAI-side error after 100 tokens, the connection drops, content filter triggers at token 200. The client needs a way to know.

The protocol decision: communicate errors as data events with a distinct type, not by closing the connection.

try:
async for chunk in stream:
...
except APIError as e:
yield f"data: {json.dumps({'error': str(e)})}\n\n"
finally:
yield "data: [DONE]\n\n"

The client checks for error fields and displays appropriately. A silently-truncated stream is the worst UX — the user thinks the model decided to stop early.

OpenAI also emits finish_reason in the final chunk: stop, length, content_filter, tool_calls. Surface these distinctively:

  • length — truncated due to max_tokens. Offer “continue” affordance.
  • content_filter — blocked. Explain.
  • stop — model finished. Normal.

Streaming Structured Outputs

What if your application consumes structured JSON, not free text? The challenge: streaming partial JSON is not parseable until the closing brace.

Three approaches:

  • Wait for completion before parsing. Defeats streaming. Use only if structure must be valid before any UI feedback.
  • Stream the tokens but render only the textual parts. Many UIs care about a “message” field but also include metadata; you can render the message stream while collecting metadata at the end.
  • Partial-JSON parsing libraries. partial-json-parser (JS), ijson (Python). Parse “as much as is valid” at each step; render the visible parts.

For tool calls specifically, OpenAI streams tool arguments incrementally as a delta:

{"tool_calls": [{"index": 0, "function": {"arguments": "{\""}}]}
{"tool_calls": [{"index": 0, "function": {"arguments": "name\":"}}]}
{"tool_calls": [{"index": 0, "function": {"arguments": " \"Akhil\"}"}}]}

Accumulate the arguments string per-tool-call index and parse only when complete. Showing a partial argument to the user is rarely useful; showing “calling tool: get_weather” with a spinner is.

Tool Calls and Reasoning

Modern LLM responses interleave reasoning, tool calls, and final answers. The UI must handle each phase:

  • Reasoning content (in newer models) — render as a collapsed “thinking” section. Users want the option to expand; most don’t.
  • Tool calls — show “Calling [tool] with [args]…” and the result. The model often emits these mid-stream before continuing with the answer.
  • Final answer — the streaming text the user reads.

Frameworks that handle this well (Vercel AI SDK, Mastra, LangChain.js) abstract the multi-phase nature. Building from scratch is reasonable; just plan for the state machine.

Observability for Streaming

Streaming adds observability complexity. You want metrics for both phases:

  • TTFT. The most important UX metric. Histogram by model and prompt class.
  • TTLT. End-to-end duration.
  • Tokens emitted. Output token count per response. Drives cost.
  • Cancellation rate. Fraction of streams cancelled by client. High rates indicate UI bugs or user dissatisfaction.
  • Error rate by finish_reason. length, content_filter, transient API errors.
  • Time-per-token. Histogram. Anomalies indicate model degradation or routing issues.

For traces, each stream is one span; sub-spans for tool calls and reasoning blocks. The Vercel AI SDK and OpenTelemetry’s GenAI semantic conventions provide a vocabulary for these — adopt the conventions early so dashboards and tooling work out of the box.

Production Pitfalls

A few patterns that bite teams:

  • Idle timeouts on load balancers. A 30-second response time exceeds the default idle timeout on many ingress configurations (ALB default is 60s, sometimes set lower). Configure long timeouts on streaming routes specifically.
  • Compression middleware. Compression middleware that buffers the response before flushing kills streaming. Exempt SSE routes.
  • CDN/caching layers. A CDN that caches the response defeats streaming and breaks subsequent requests. Set Cache-Control: no-store and ensure your edge respects it.
  • Mobile networks dropping connections. Reconnection logic with resumption (start where you left off via prompt continuation) preserves UX on flaky networks.
  • Token usage tracking. Streaming responses don’t include the final usage block by default — request it explicitly (stream_options={"include_usage": true}) or accept that you need to estimate.

Closing

Streaming is the single largest perceived-latency improvement you can make to an LLM-backed product, and the implementation surface is larger than it looks. The chain from OpenAI to user must stream throughout; cancellation must propagate end-to-end; errors must be communicated as data events, not connection drops; tool calls and structured outputs require their own rendering states; observability needs to handle the multi-phase nature. None of this is hard once you’ve done it once; all of it is easy to miss the first time. Build the streaming path deliberately — end-to-end, with cancellation, with error semantics, with the right transport headers — and the LLM in your product stops feeling like a batch system and starts feeling like a conversation. That difference is most of what users notice.