Building Streaming AI Interfaces with OpenAI APIs

A 5-second response from an LLM feels broken. The same 5-second response, with the first words appearing in 300ms and the rest streaming in as they’re generated, feels conversational. The model is doing the same work; only the user experience differs. This is why every serious LLM product streams responses: perceived latency, not actual latency, is what matters to users.

Streaming sounds simple in a demo (stream=True, render tokens as they arrive) and turns out to involve real engineering once you ship it to production — backpressure across an HTTP hop, graceful cancellation, partial-output validation, error handling mid-stream, and serving structured outputs that need to be parsed before display. This post is about getting that right.

Why Streaming Matters

LLMs generate tokens autoregressively — one token at a time, each depending on the prior ones. Total latency for a response is roughly:

TTLT ≈ TTFT + (output_tokens × time_per_token)

For GPT-4-class models, TTFT (time to first token) is typically 200ms–1s, and time-per-token is 20–80ms. A 300-token response thus takes 6–25 seconds total. If you wait for the full response, the user stares at a spinner for that duration. If you stream, the user sees output beginning within ~500ms and reads it as it appears at human reading speed.

The user’s perception of speed is dominated by TTFT, not TTLT. Streaming is therefore a perceived-latency multiplier without any change to the model itself.

The streaming case actually completes at the same wall-clock time, but the user perceives the experience as starting immediately rather than after an 8-second wait.

The Protocol

OpenAI’s streaming uses Server-Sent Events (SSE). Each chunk is a JSON object with a delta — the incremental content since the last chunk:

data: {"id":"...","choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"...","choices":[{"delta":{"content":" world"},"finish_reason":null}]}

data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

SSE is the right transport choice for this — unidirectional server-to-client, text-based, automatically reconnects, plays well with HTTP/2 multiplexing, and works through proxies (unlike WebSocket in some restrictive environments).

In Python:

async for chunk in await client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=True,
):
    delta = chunk.choices[0].delta
    if delta.content:
        await send_to_user(delta.content)

In Node.js:

const stream = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [...],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) await sendToUser(content);
}

Both SDKs handle the SSE parsing internally; you receive structured chunks.

End-to-End Streaming, Not Just Server-Side

A common mistake: the server receives a streaming response from OpenAI and then buffers it before responding to the client. This negates the entire point of streaming.

The chain has to stream throughout:

OpenAI streams chunks to your server.
Your server streams chunks to the client.
The client renders chunks as they arrive.

Each hop needs explicit streaming handling. In FastAPI:

from fastapi.responses import StreamingResponse

@app.post("/chat")
async def chat(req: ChatRequest):
    async def event_stream():
        async for chunk in await openai_client.chat.completions.create(
            model="gpt-4o", messages=req.messages, stream=True,
        ):
            delta = chunk.choices[0].delta
            if delta.content:
                yield f"data: {json.dumps({'content': delta.content})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(event_stream(), media_type="text/event-stream")

In Express/Fastify with raw Node streams:

fastify.post("/chat", async (request, reply) => {
  reply.raw.setHeader("Content-Type", "text/event-stream");
  reply.raw.setHeader("Cache-Control", "no-cache, no-transform");
  reply.raw.setHeader("X-Accel-Buffering", "no");

  const stream = await openai.chat.completions.create({...});
  for await (const chunk of stream) {
    const c = chunk.choices[0]?.delta?.content;
    if (c) reply.raw.write(`data: ${JSON.stringify({ content: c })}\n\n`);
  }
  reply.raw.write("data: [DONE]\n\n");
  reply.raw.end();
});

A few specifics that bite people:

X-Accel-Buffering: no disables proxy buffering on NGINX (the default is to buffer). Without it, NGINX collects the stream before forwarding to the client.
Disable compression for SSE responses. Buffering for gzip flush negates streaming.
HTTP/1.1 chunked transfer encoding is the underlying mechanism; most frameworks handle this correctly when you write before end.

Client-Side Rendering

On the client, the EventSource API or fetch with a ReadableStream both work:

const res = await fetch("/chat", { method: "POST", body: JSON.stringify(req) });
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  const parts = buffer.split("\n\n");
  buffer = parts.pop() ?? "";
  for (const part of parts) {
    if (!part.startsWith("data: ")) continue;
    const payload = part.slice(6);
    if (payload === "[DONE]") return;
    const { content } = JSON.parse(payload);
    appendToUI(content);
  }
}

The \n\n split is important — SSE messages are separated by blank lines, and chunks can land mid-message. Buffering the trailing partial line until the next read prevents JSON-parse failures.

For React, the rendering pattern that works well: accumulate the streamed content in state, and let React’s batching handle the re-renders. For very long streams, throttle UI updates to ~60Hz to avoid pegging the main thread.

Cancellation: The Part Everyone Forgets

A user typing a long question, hitting submit, then immediately changing the question, should cancel the in-flight request. Without cancellation, the server continues generating tokens (and you continue paying for them) while the user has moved on.

End-to-end cancellation requires:

AbortSignal on the client. fetch(url, { signal: controller.signal }). Cancelling aborts the request and closes the connection.
Detect client disconnection on the server. FastAPI/Starlette expose await request.is_disconnected(); Node exposes request.connection.destroyed and request.on('close', ...).
Cancel the OpenAI request. Pass an AbortSignal (Node SDK) or with httpx.timeout(...) cancellation (Python SDK).

async def event_stream():
    try:
        async for chunk in stream:
            if await request.is_disconnected():
                await stream.close()
                return
            ...
    except asyncio.CancelledError:
        await stream.close()
        raise

Without this, a noisy frontend (users navigating away, retrying) doubles or triples your OpenAI bill.

Backpressure

A client reading slowly should slow down generation. Without backpressure, the server buffers indefinitely.

The mechanism is built in to HTTP — TCP backpressure propagates through. The point is that you do not need to add your own buffer. Don’t await asyncio.create_task(send(chunk)) without awaiting the send; let the natural flow control work.

The danger zone is when you introduce a queue or buffer between the OpenAI stream and your response — say, to broadcast the response to multiple consumers, or to persist while streaming. The queue removes backpressure. Be deliberate about whether you want that.

Mid-Stream Errors

Streams can fail mid-flight: model produces an OpenAI-side error after 100 tokens, the connection drops, content filter triggers at token 200. The client needs a way to know.

The protocol decision: communicate errors as data events with a distinct type, not by closing the connection.

try:
    async for chunk in stream:
        ...
except APIError as e:
    yield f"data: {json.dumps({'error': str(e)})}\n\n"
finally:
    yield "data: [DONE]\n\n"

The client checks for error fields and displays appropriately. A silently-truncated stream is the worst UX — the user thinks the model decided to stop early.

OpenAI also emits finish_reason in the final chunk: stop, length, content_filter, tool_calls. Surface these distinctively:

length — truncated due to max_tokens. Offer “continue” affordance.
content_filter — blocked. Explain.
stop — model finished. Normal.

Streaming Structured Outputs

What if your application consumes structured JSON, not free text? The challenge: streaming partial JSON is not parseable until the closing brace.

Three approaches:

Wait for completion before parsing. Defeats streaming. Use only if structure must be valid before any UI feedback.
Stream the tokens but render only the textual parts. Many UIs care about a “message” field but also include metadata; you can render the message stream while collecting metadata at the end.
Partial-JSON parsing libraries. partial-json-parser (JS), ijson (Python). Parse “as much as is valid” at each step; render the visible parts.

For tool calls specifically, OpenAI streams tool arguments incrementally as a delta:

{"tool_calls": [{"index": 0, "function": {"arguments": "{\""}}]}
{"tool_calls": [{"index": 0, "function": {"arguments": "name\":"}}]}
{"tool_calls": [{"index": 0, "function": {"arguments": " \"Akhil\"}"}}]}

Accumulate the arguments string per-tool-call index and parse only when complete. Showing a partial argument to the user is rarely useful; showing “calling tool: get_weather” with a spinner is.

Tool Calls and Reasoning

Modern LLM responses interleave reasoning, tool calls, and final answers. The UI must handle each phase:

Reasoning content (in newer models) — render as a collapsed “thinking” section. Users want the option to expand; most don’t.
Tool calls — show “Calling [tool] with [args]…” and the result. The model often emits these mid-stream before continuing with the answer.
Final answer — the streaming text the user reads.

Frameworks that handle this well (Vercel AI SDK, Mastra, LangChain.js) abstract the multi-phase nature. Building from scratch is reasonable; just plan for the state machine.

Observability for Streaming

Streaming adds observability complexity. You want metrics for both phases:

TTFT. The most important UX metric. Histogram by model and prompt class.
TTLT. End-to-end duration.
Tokens emitted. Output token count per response. Drives cost.
Cancellation rate. Fraction of streams cancelled by client. High rates indicate UI bugs or user dissatisfaction.
Error rate by finish_reason. length, content_filter, transient API errors.
Time-per-token. Histogram. Anomalies indicate model degradation or routing issues.

For traces, each stream is one span; sub-spans for tool calls and reasoning blocks. The Vercel AI SDK and OpenTelemetry’s GenAI semantic conventions provide a vocabulary for these — adopt the conventions early so dashboards and tooling work out of the box.

Production Pitfalls

A few patterns that bite teams:

Idle timeouts on load balancers. A 30-second response time exceeds the default idle timeout on many ingress configurations (ALB default is 60s, sometimes set lower). Configure long timeouts on streaming routes specifically.
Compression middleware. Compression middleware that buffers the response before flushing kills streaming. Exempt SSE routes.
CDN/caching layers. A CDN that caches the response defeats streaming and breaks subsequent requests. Set Cache-Control: no-store and ensure your edge respects it.
Mobile networks dropping connections. Reconnection logic with resumption (start where you left off via prompt continuation) preserves UX on flaky networks.
Token usage tracking. Streaming responses don’t include the final usage block by default — request it explicitly (stream_options={"include_usage": true}) or accept that you need to estimate.

Closing

Streaming is the single largest perceived-latency improvement you can make to an LLM-backed product, and the implementation surface is larger than it looks. The chain from OpenAI to user must stream throughout; cancellation must propagate end-to-end; errors must be communicated as data events, not connection drops; tool calls and structured outputs require their own rendering states; observability needs to handle the multi-phase nature. None of this is hard once you’ve done it once; all of it is easy to miss the first time. Build the streaming path deliberately — end-to-end, with cancellation, with error semantics, with the right transport headers — and the LLM in your product stops feeling like a batch system and starts feeling like a conversation. That difference is most of what users notice.