latency · transport · async
Streaming, batching & async — moving LLM data without making the user wait
The model can only generate so fast. Everything else — what the user perceives, what the connection looks like on the wire, what costs you 50% less if you can afford to wait — is your choice of transport.
The three latencies you actually care about
There is no single "latency" number for an LLM call. There are three, and users weight them very unevenly.
TTFT — time to first token. The interval from request-sent to the first piece of output the user can see. This is the "is anything happening?" signal. Realistic numbers: 200–500ms for small models on a warm cache, 1–3s for large models, 5–10s+ for reasoning models that think before they speak.
Inter-token latency — the steady-state output rate once the stream is running. Usually quoted as tokens per second (60–200 t/s is typical; reasoning models with internal CoT run slower on the visible portion). This determines how fast text "fills in" once it's started.
Total completion time — TTFT plus (output_tokens / rate). The honest end-to-end number. Often the metric on your dashboard. Almost never the one your users feel.
What streaming actually is
It's not magic, and it's not WebSockets. It's an old, boring HTTP pattern: Server-Sent Events over a long-lived response body, with each chunk written as the model generates it.
The server doesn't return the response. It holds the connection open, writes a chunk of bytes, flushes, writes another chunk, flushes, and finally closes when the model is done. The wire format is text: each event is a line that starts with data:, followed by a JSON payload, followed by a double newline.
HTTP/1.1 200 OK Content-Type: text/event-stream Cache-Control: no-cache Connection: keep-alive Transfer-Encoding: chunked event: message_start data: {"type":"message_start","message":{"role":"assistant"}} event: content_block_delta data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"Hello"}} event: content_block_delta data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" world"}} event: message_stop data: {"type":"message_stop"}
That's it. The browser reads it with EventSource (or, for non-GET requests with custom headers, with fetch and a ReadableStream). Server frameworks expose flush-after-write primitives — res.write() + res.flush() in Node/Express, StreamingResponse in FastAPI, writer.flush() in Go. The bytes show up at the client roughly when they're emitted by the model, modulo any buffering you forgot to disable (more on that below).
The animated version
Watch a single token's life: the model emits it, the server frames it as an SSE event, the browser parses it, the DOM updates. Each hop has a latency cost that's normally invisible but adds up when something goes wrong.
When streaming is worth it (and when it isn't)
Streaming is not a free win. It complicates your server, your client parsing, your error handling, and your buffering. Reach for it when the user benefits — otherwise, the simpler request/response shape pays off.
Worth it: chat interfaces, long-form generation over ~200 tokens, reasoning models with visible thinking, agentic UIs that want to surface intermediate tool calls. Anything where the user is reading the output as it appears.
Not worth it: structured-output extractions where the consumer is a JSON parser — you can't act on half a JSON object, so streaming the chars gains you nothing. Classifications and single-token outputs. Backend pipelines that wait for the full response anyway. Embedding endpoints. Anything where streaming would just be a more expensive way of getting to the same final blob.
A useful gut check: does the next thing in your pipeline need the whole response before it can do anything useful? If yes, don't bother streaming.
Latency simulator
Three response shapes, three rendering modes, same generation rate. The "feels-fast" score is computed mostly from TTFT, the way users actually weight it.
SCENARIO
SIM · 60 t/s · TTFT 400ms
The backpressure problem
The model emits tokens at 60–200 t/s. If your frontend renders slower than that, the connection buffers somewhere, and the user sees the output appear in lurches — frozen, then a sudden burst, then frozen again.
The usual culprit isn't the network. It's that the per-token handler is doing too much work. Re-running a heavyweight syntax highlighter on the entire response after every token. Re-parsing 8KB of markdown into a virtual DOM on every chunk. Calling React's setState in a way that forces a full reconcile per token. Each of these is fine at 5 t/s; at 80 t/s the main thread is on fire.
The fix is to decouple ingestion from rendering. Stream the tokens into a plain string buffer, and update the visible DOM on a requestAnimationFrame tick (or every ~50ms) instead. The right panel in the simulator above shows this: same generation rate, same TTFT, smoother paint, less janky.
// Naive: pays full render cost per token. stream.on("delta", t => { el.textContent += t; }) // Better: append to buffer, flush on animation frame. let buf = "", dirty = false; stream.on("delta", t => { buf += t; dirty = true; }); function tick() { if (dirty) { el.textContent = buf; dirty = false; } requestAnimationFrame(tick); } requestAnimationFrame(tick);
Batch API — 50% off, hours of latency
Not every workload is user-facing. If the answer can wait an hour — or a day — every major provider has a batch endpoint at half the per-token price.
The shape is the same across providers: upload a JSONL file where each line is one request, get back a JSONL where each line is one response. OpenAI's /v1/batches guarantees completion within 24 hours; Anthropic's Message Batches API the same. Typical end-to-end is under an hour for most jobs. Pricing is half of the synchronous input and output rates — no asterisks.
# Each line in requests.jsonl is one custom-id'd Message request. $ curl https://api.anthropic.com/v1/messages/batches \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d @requests.jsonl # Returns a batch_id. Poll /v1/messages/batches/{id} until status == "ended". # Then GET the results_url to download a JSONL of completions. # Billing line item: 50% of standard input/output rates.
Worth it when: nightly enrichment jobs (classify yesterday's tickets), embedding 100K documents for a new index, retroactive analysis (re-score a year of conversations with the new model), evaluation suites running against thousands of cases. Anything where you'd happily come back tomorrow.
Not worth it when: a user is waiting. There is no SLA on batch beyond the 24-hour upper bound — your p50 might be 20 minutes, but your p99 is the full 24 hours, and you cannot bargain.
Async-job patterns for long-running work
Streaming helps when the model is slow. It does not help when your whole system is slow — a multi-step agent, a long RAG pipeline, a chain of three model calls. The user disconnects, your HTTP timeout fires, your load balancer kills the request. You need a different shape.
Pattern 1: job queue + poll
Client POSTs the work, gets back a job_id, polls GET /jobs/{id} every few seconds until status == "done". Cheap, works through every proxy and firewall, easy to scale because every request is short. Downside: the polling cadence is the floor on perceived latency, and the user sees stepped progress at best.
Pattern 2: WebSocket / SSE streaming progress
Server pushes incremental progress events ("retrieving... 2 of 4 sources", "planning...", "executing tool x") over a persistent connection. Much better UX. Harder to scale — each in-flight request holds a connection — and you need reconnection logic for when the connection drops mid-job.
Pattern 3: durable state + resumable streams
Write the in-flight state to a key-value store (Redis, Cloudflare Durable Objects, DynamoDB). The client streams over a connection that can drop and resume — passing a Last-Event-ID on reconnect so the server picks up where it left off. The pattern most production agent platforms eventually arrive at, because users will close their laptop mid-response.
Concrete example: a four-step research pipeline — retrieve → plan → execute → synthesize. Each step pushes a status event. The user sees "Step 2 of 4: planning…" instead of a 25-second spinner. Total time is the same. Perceived time is dramatically different.
The streaming gotchas
1. The proxy buffer problem
The #1 cause of "streaming works on my laptop, fails in production." Cloudflare, nginx, AWS ALB, and most CDNs default to buffering response bodies for performance — they collect the whole response, then deliver it. Which is exactly what you don't want for SSE.
The fix is provider-specific. Send Cache-Control: no-cache and X-Accel-Buffering: no headers; ensure Content-Type: text/event-stream; on nginx set proxy_buffering off; on Cloudflare Workers, prefer TransformStream with explicit flush. Verify with curl -N against the production URL — if it dumps the whole response at once, your proxy is buffering.
2. UTF-8 boundaries are not token boundaries
A single Japanese character like 食 spans multiple bytes; a tokenizer may split a single Unicode codepoint across token boundaries; one emoji can be four bytes plus a ZWJ sequence. If you decode delta.text byte-by-byte and concatenate, you get mojibake — replacement characters where partial codepoints landed mid-buffer. Always buffer at the byte level and decode with an incremental UTF-8 decoder (TextDecoder in JS with stream: true). Every SDK does this internally; if you're rolling your own, don't forget.
3. Streaming + tool use + thinking = three protocols
Once the model can call tools and emit visible reasoning, a single response is no longer a single stream of text deltas. You get message_start, content_block_start (which may declare type: tool_use or type: thinking or type: text), content_block_delta events (with different delta shapes for each block type), content_block_stop, and message_stop. Your handler needs a small state machine, not a flat string append. The SDKs hide this; if you go raw, expect to spend an afternoon on it.
4. Reconnection on dropped connections
Mobile networks drop. WiFi switches APs. The user closes their laptop and reopens it ten minutes later. A naive streaming client shows a half-rendered response and an awkward silence. Either implement resume via Last-Event-ID (server replays from the last sent event), or surface a "regenerate" affordance, or write completed responses to a durable store so the next call can fetch the result. Don't pretend it won't happen.
5. The "frozen" stream that's actually rendering
Symptom: the network tab shows bytes arriving steadily, but the UI hangs. Cause: per-token rendering work blocks the main thread. The connection is fine; the renderer is overloaded. Fix is the buffered-render pattern above.
A small decision flowchart
Three questions, four answers. Run a request through them in order.
The four answers are not exclusive within a system — a single product can do all four. Your chat surface streams. Your title-suggestion endpoint is sync. Your nightly summarizer is batch. Your "deep research" mode is an async job. The choice is per-endpoint, not per-product.