Inference internals

Speculative decoding & continuous batching

Two paired tricks underneath every modern LLM serving stack. One cuts cost by packing users onto the same forward pass. One cuts latency by spitting out several tokens at a time. Together they're why a hosted Llama 70B endpoint is roughly 30× cheaper than it would be naively.

Why naive generation is wasteful

A forward pass through Llama 3 70B on an H100 takes roughly 30ms. That pass produces one token. To generate a 200-token answer for one user, the GPU runs 200 forward passes sequentially — six seconds of wall time, ~33 tokens/sec.

Here's the unintuitive part: the GPU does nearly the same amount of work whether it's generating for one user or sixty-four. Decode-phase inference is overwhelmingly memory-bandwidth bound — most of those 30ms are spent streaming 140 GB of model weights from HBM into the compute units. Once the weights are in flight, you could be multiplying them against one activation vector or sixty-four, and the cost is roughly the same.

On a $5/hr H100 at 33 tokens/sec for one user, you're paying $0.042 per 1K output tokens. That's ~7× the price hosted endpoints actually charge. The gap is mostly parallelism the naive loop doesn't capture.

Static batching — the obvious attempt

The first instinct: group N requests at the start, run them together, finish when all are done. Each forward pass now produces N tokens (one per user) for the same ~30ms. Throughput goes up Nx.

The problem is raggedness. Real requests have wildly different output lengths — a quick yes/no answer might take 5 tokens, a code generation 500. Short requests finish quickly but stay parked in the batch (the GPU keeps emitting padding for them); long requests block the batch from accepting new arrivals that show up mid-flight. Realistic GPU utilization with static batching: 30–50%.

Continuous batching

The fix, from Yu et al.'s Orca paper and now the default in vLLM, TensorRT-LLM, and SGLang: recompute the batch on every iteration. After each forward pass, check which requests just emitted an EOS — eject them. Check which new requests arrived — splice them in. The "batch" is a moving window, not a fixed cohort.

Combined with PagedAttention — KV-cache pages allocated in fixed blocks like virtual memory, so requests of arbitrary lengths can coexist without fragmentation — utilization climbs to 70–90%. This was the unlock behind 2023–2024's collapse in hosted-model pricing.

Watch it run

Pick a workload and a serving mode. Each row is a timeline; each colored bar is a token emitted during a forward pass. Naive can only fill one lane at a time. Continuous batching fills as many lanes as you have users. Speculative decoding makes each lane finish faster.

Workload

Serving mode

Throughput

— tok/s

Cost per 1K tok

— USD

GPU utilization

— %

accepted token speculative (accepted) rejected speculation

The pedagogical aha: the three modes look identical from above — same number of forward passes, same wall-clock budget. The difference is what each forward pass does. Naive: 1 user × 1 token. Continuous batching: N users × 1 token. +Speculative: N users × ~2.3 tokens. Each technique is a multiplicative factor on the others.

The math of continuous batching

Let T_B be the wall time of one forward pass at batch size B. In the memory-bound regime, T_B ≈ T_1 — adding users to the batch costs almost nothing because the weights are already streaming anyway. Throughput scales linearly with B.

This holds up until you hit the compute-bound knee, the point where the matrix multiplies themselves (not the weight loads) become the bottleneck. For Llama-70B-class models on H100, that knee sits around B ≈ 64–128, depending on prompt length and KV-cache state. Beyond it, T_B rises roughly linearly with B and throughput plateaus.

Batch size	T_B (ms)	Throughput (tok/s)	Regime
1	30	33	memory-bound
8	31	258	memory-bound
32	34	941	memory-bound
64	40	1,600	knee — sweet spot
128	62	2,065	compute-bound
256	118	2,170	diminishing returns

Picking the right batch size is the central tuning problem for a serving stack. Too low: you leave the GPU idle. Too high: per-user latency suffers without much throughput gain. vLLM's scheduler tries to keep you near the knee dynamically as requests come and go.

Speculative decoding — the trick

Continuous batching parallelizes across users. Speculative decoding parallelizes across positions within one user's stream — which should be impossible, because token N+1 depends on token N. You can't generate them in parallel.

Unless you guess. The recipe:

Run a small draft model (e.g. a 1B Llama) autoregressively for K steps. It's cheap — maybe 2ms per token.
Feed all K draft tokens to the big model in one forward pass. The big model emits K logits, one per position.
For each draft token, check: did the big model's argmax match? Accept matches up to the first disagreement. At the disagreement, take the big model's pick instead.
Net output: ~2–3 accepted tokens per big-model pass instead of 1.

Quality is identical — speculative decoding is provably exact with the right sampling. You can't tell the output came from a sped-up stack.

Why it actually works

The big model's forward pass is reading those 140 GB of weights anyway. The marginal cost of also computing K extra positions in that same pass is small — you're already paying for the bandwidth, you're just getting more arithmetic done on the same loaded weights. Adding K extra positions barely changes wall time.

Meanwhile, K serial passes of the 1B draft model cost roughly K × 2ms = 8ms for K=4. If even 2 of those 4 draft tokens get accepted, you've generated 2 big-model tokens in 30 + 8 ≈ 38ms instead of 60ms — a 1.6× speedup. If 3 get accepted, it's a 2.4× speedup. Free latency improvement, same outputs.

Walk through one verify step

The little model proposes; the big model judges. Step through and see what gets accepted.

Prompt: "The quick brown fox jumps over the lazy ___"

The quick brown fox jumps over the lazy

Draft (1B)

Big (70B)

Press Next step to run the draft model. It will propose 4 tokens autoregressively before the big model verifies them in a single forward pass.

When speculative decoding loses

Acceptance rate is everything. The arithmetic flips if the draft agrees with the big model less than about 30% of the time — you've spent draft-model compute and verification overhead for fewer than 1.3 accepted tokens per pass.

Cases where drafts tend to be bad:

Math & symbolic reasoning — the draft model is too small to track multi-step arithmetic, and one wrong digit pollutes everything downstream.
Code generation — variable names and unfamiliar APIs are high-entropy; the draft model whiffs.
Niche languages or domains — anywhere the draft and big model's vocab distributions diverge.

Modern variants paper over the worst cases:

Self-speculative (Medusa, EAGLE) — bolt extra prediction heads onto the same model. No draft model to maintain, and the heads learn from the big model's hidden states so acceptance rates run 60–80%.
Tree speculation — speculate multiple branches in parallel, take the longest accepted prefix. Burns more compute per pass; pays off when acceptance is mediocre.
Lookahead decoding — no draft model at all. Mine n-gram patterns from the prompt and the so-far output, use those as speculative guesses. Cheap; great for templated outputs.

The two together — the 2026 stack

vLLM, TensorRT-LLM, and SGLang all run the same recipe: continuous batching for throughput, paged attention for memory, speculative decoding (usually EAGLE or Medusa) for per-stream latency. Combined: ~3× the throughput of naive batching at the same latency budget, with GPU utilization in the high-80s.

SGLang adds a specifically-2025 trick: structured-output speculation. If you're forcing the model to emit JSON, the schema tokens ({ " : " ,) are deterministic — you can speculate them for free, no draft model needed, and get acceptance rates near 100% on the structural parts. For agent stacks emitting tool calls, this is a quiet 20–30% additional speedup.

When you read benchmark numbers comparing serving stacks — vLLM at 2,400 tok/s/H100 for Llama-3-70B-FP8, TensorRT-LLM at 2,700, SGLang at 2,900 — most of the gap is in how aggressive the continuous-batching scheduler is, and which speculation strategy is enabled. The fundamentals are the same.

If you're sizing a cost model: the throughput numbers in serving benchmarks already assume this stack. Naive autoregressive numbers are roughly 5× more expensive; if you see someone quoting GPU costs that look way too high, check whether they're computing from tokens/sec at B=1 or tokens/sec at the continuous-batching steady state. The difference is the whole reason hosted inference is affordable.

Decode-phase LLM inference is memory-bandwidth bound — most of the 30ms forward pass is spent streaming the 140 GB of weights from HBM. Once those weights are in flight, multiplying them against one activation or sixty-four is roughly the same amount of arithmetic in the same time. Throughput scales nearly linearly with batch size until you hit the compute-bound knee around B≈64–128. Because the GPU's tensor cores are idle most of the time and have spare capacity to handle additional users for free until they saturate at very large batch sizes. Because the KV cache is shared across users in the batch, so adding more users only adds activation memory, not compute or bandwidth. Because PagedAttention compresses the per-user state to roughly zero bytes when requests share a common prefix, eliminating the per-user overhead. The big model's forward pass is dominated by streaming 140 GB of weights, not by the per-position arithmetic. Verifying K speculative tokens in one pass barely changes wall time — you're already paying for the bandwidth. Meanwhile the draft model is small enough (~1B params) that K serial draft steps cost a few ms total. If the draft agrees often enough, you get 2–3 accepted tokens per big-model pass for what was previously 1. Because the draft model and the big model can run on the same GPU streaming multiprocessor in parallel, doubling the effective compute throughput for free. Because speculative decoding skips the slowest decoder layers when the draft and big model agree, reducing per-token FLOPs by roughly 2x. Because the verification step only needs to compute logits for the top-K candidates the draft model produced, instead of the full vocabulary. Too small: you sit in the memory-bound regime and the GPU is starved of work — throughput is far below what the hardware can deliver. Too large: you cross the compute-bound knee (B≈64–128 for Llama-70B on H100), per-pass latency rises roughly linearly with B, and throughput plateaus while users experience worse time-per-token. The sweet spot sits right around the knee, and good schedulers like vLLM's adjust it dynamically as the request mix changes. The batch size has to evenly divide the number of attention heads in the model, so the valid choices are tightly constrained to small powers of two. The batch size determines the KV cache page size, so picking the wrong value causes PagedAttention to fragment memory and crash with out-of-memory errors. Larger batches always improve throughput, but doubling the batch size doubles the per-user latency — so the only tradeoff is throughput versus how fast each individual user sees their tokens stream. Speculative decoding only pays off when the draft model agrees with the big model frequently — acceptance rates below ~30% make it net-negative. Math is high-entropy at every step (one wrong digit derails everything), and code generation involves unfamiliar variable names and API tokens that the small draft model doesn't track well. The draft's predictions diverge from the big model's, acceptance collapses, and you've paid for speculation that mostly gets thrown away. The draft model has a different tokenizer than the big model on numeric and symbolic content, so individual digits and operators get split differently and can't be verified one-to-one. Math and code use deterministic sampling (temperature=0) which speculative decoding doesn't support — it only works with stochastic sampling at temperature >= 0.7. The big model uses different positional encodings for numeric and code tokens, so the speculative verification step can't be performed in a single forward pass. PagedAttention allocates the KV cache in fixed-size blocks like OS virtual memory, so requests of arbitrary and changing lengths can coexist in the same batch without contiguous-allocation fragmentation. Continuous batching constantly ejects finished requests and splices in new ones — without paged storage, you'd waste large fractions of GPU memory on padding or have to compact frequently. Paging is what makes continuous batching practical at GPU-utilization in the 80s rather than the 50s. It compresses the KV cache by sharing common prefix tokens across users in the batch, so a system prompt shared by 64 users only occupies one copy in memory. It streams the KV cache to CPU memory between forward passes, freeing GPU HBM for weights and allowing much larger effective batch sizes than would otherwise fit. It encrypts each user's KV cache with a per-request key, so user data can be safely co-resident on the GPU without information leakage across tenants in the same batch.