Inference internals
Speculative decoding & continuous batching
Two paired tricks underneath every modern LLM serving stack. One cuts cost by packing users onto the same forward pass. One cuts latency by spitting out several tokens at a time. Together they're why a hosted Llama 70B endpoint is roughly 30× cheaper than it would be naively.
Why naive generation is wasteful
A forward pass through Llama 3 70B on an H100 takes roughly 30ms. That pass produces one token. To generate a 200-token answer for one user, the GPU runs 200 forward passes sequentially — six seconds of wall time, ~33 tokens/sec.
Here's the unintuitive part: the GPU does nearly the same amount of work whether it's generating for one user or sixty-four. Decode-phase inference is overwhelmingly memory-bandwidth bound — most of those 30ms are spent streaming 140 GB of model weights from HBM into the compute units. Once the weights are in flight, you could be multiplying them against one activation vector or sixty-four, and the cost is roughly the same.
On a $5/hr H100 at 33 tokens/sec for one user, you're paying $0.042 per 1K output tokens. That's ~7× the price hosted endpoints actually charge. The gap is mostly parallelism the naive loop doesn't capture.
Static batching — the obvious attempt
The first instinct: group N requests at the start, run them together, finish when all are done. Each forward pass now produces N tokens (one per user) for the same ~30ms. Throughput goes up Nx.
The problem is raggedness. Real requests have wildly different output lengths — a quick yes/no answer might take 5 tokens, a code generation 500. Short requests finish quickly but stay parked in the batch (the GPU keeps emitting padding for them); long requests block the batch from accepting new arrivals that show up mid-flight. Realistic GPU utilization with static batching: 30–50%.
Continuous batching
The fix, from Yu et al.'s Orca paper and now the default in vLLM, TensorRT-LLM, and SGLang: recompute the batch on every iteration. After each forward pass, check which requests just emitted an EOS — eject them. Check which new requests arrived — splice them in. The "batch" is a moving window, not a fixed cohort.
Combined with PagedAttention — KV-cache pages allocated in fixed blocks like virtual memory, so requests of arbitrary lengths can coexist without fragmentation — utilization climbs to 70–90%. This was the unlock behind 2023–2024's collapse in hosted-model pricing.
Watch it run
Pick a workload and a serving mode. Each row is a timeline; each colored bar is a token emitted during a forward pass. Naive can only fill one lane at a time. Continuous batching fills as many lanes as you have users. Speculative decoding makes each lane finish faster.
Throughput
— tok/s
Cost per 1K tok
— USD
GPU utilization
— %
The math of continuous batching
Let T_B be the wall time of one forward pass at batch size B. In the memory-bound regime, T_B ≈ T_1 — adding users to the batch costs almost nothing because the weights are already streaming anyway. Throughput scales linearly with B.
This holds up until you hit the compute-bound knee, the point where the matrix multiplies themselves (not the weight loads) become the bottleneck. For Llama-70B-class models on H100, that knee sits around B ≈ 64–128, depending on prompt length and KV-cache state. Beyond it, T_B rises roughly linearly with B and throughput plateaus.
| Batch size | T_B (ms) | Throughput (tok/s) | Regime |
|---|---|---|---|
| 1 | 30 | 33 | memory-bound |
| 8 | 31 | 258 | memory-bound |
| 32 | 34 | 941 | memory-bound |
| 64 | 40 | 1,600 | knee — sweet spot |
| 128 | 62 | 2,065 | compute-bound |
| 256 | 118 | 2,170 | diminishing returns |
Picking the right batch size is the central tuning problem for a serving stack. Too low: you leave the GPU idle. Too high: per-user latency suffers without much throughput gain. vLLM's scheduler tries to keep you near the knee dynamically as requests come and go.
Speculative decoding — the trick
Continuous batching parallelizes across users. Speculative decoding parallelizes across positions within one user's stream — which should be impossible, because token N+1 depends on token N. You can't generate them in parallel.
Unless you guess. The recipe:
- Run a small draft model (e.g. a 1B Llama) autoregressively for K steps. It's cheap — maybe 2ms per token.
- Feed all K draft tokens to the big model in one forward pass. The big model emits K logits, one per position.
- For each draft token, check: did the big model's argmax match? Accept matches up to the first disagreement. At the disagreement, take the big model's pick instead.
- Net output: ~2–3 accepted tokens per big-model pass instead of 1.
Quality is identical — speculative decoding is provably exact with the right sampling. You can't tell the output came from a sped-up stack.
Why it actually works
The big model's forward pass is reading those 140 GB of weights anyway. The marginal cost of also computing K extra positions in that same pass is small — you're already paying for the bandwidth, you're just getting more arithmetic done on the same loaded weights. Adding K extra positions barely changes wall time.
Meanwhile, K serial passes of the 1B draft model cost roughly K × 2ms = 8ms for K=4. If even 2 of those 4 draft tokens get accepted, you've generated 2 big-model tokens in 30 + 8 ≈ 38ms instead of 60ms — a 1.6× speedup. If 3 get accepted, it's a 2.4× speedup. Free latency improvement, same outputs.
Walk through one verify step
The little model proposes; the big model judges. Step through and see what gets accepted.
When speculative decoding loses
Acceptance rate is everything. The arithmetic flips if the draft agrees with the big model less than about 30% of the time — you've spent draft-model compute and verification overhead for fewer than 1.3 accepted tokens per pass.
Cases where drafts tend to be bad:
- Math & symbolic reasoning — the draft model is too small to track multi-step arithmetic, and one wrong digit pollutes everything downstream.
- Code generation — variable names and unfamiliar APIs are high-entropy; the draft model whiffs.
- Niche languages or domains — anywhere the draft and big model's vocab distributions diverge.
Modern variants paper over the worst cases:
- Self-speculative (Medusa, EAGLE) — bolt extra prediction heads onto the same model. No draft model to maintain, and the heads learn from the big model's hidden states so acceptance rates run 60–80%.
- Tree speculation — speculate multiple branches in parallel, take the longest accepted prefix. Burns more compute per pass; pays off when acceptance is mediocre.
- Lookahead decoding — no draft model at all. Mine n-gram patterns from the prompt and the so-far output, use those as speculative guesses. Cheap; great for templated outputs.
The two together — the 2026 stack
vLLM, TensorRT-LLM, and SGLang all run the same recipe: continuous batching for throughput, paged attention for memory, speculative decoding (usually EAGLE or Medusa) for per-stream latency. Combined: ~3× the throughput of naive batching at the same latency budget, with GPU utilization in the high-80s.
SGLang adds a specifically-2025 trick: structured-output speculation. If you're forcing the model to emit JSON, the schema tokens ({ " : " ,) are deterministic — you can speculate them for free, no draft model needed, and get acceptance rates near 100% on the structural parts. For agent stacks emitting tool calls, this is a quiet 20–30% additional speedup.
When you read benchmark numbers comparing serving stacks — vLLM at 2,400 tok/s/H100 for Llama-3-70B-FP8, TensorRT-LLM at 2,700, SGLang at 2,900 — most of the gap is in how aggressive the continuous-batching scheduler is, and which speculation strategy is enabled. The fundamentals are the same.