inference engineering · deep dive

KV cache & paged attention

Why a 70B model that fits comfortably on an H100 OOMs at sequence 4000 with a batch of eight — and how vLLM gave you a 5× throughput jump by stealing an idea from operating-system memory management.

GPU MEMORY · H100 80GB · LLAMA-70B FP16 mode: naive allocation

model weights · ~140GB tensor-parallel · 35% shown

0 GB KV cache pool · 52 GB usable 52 GB

requests served

requests rejected

cache utilization

wasted memory

incoming queue · 25 requests · varying lengths

weights request blocks fragmented / reserved-but-unused shared prefix blocks free

01Why attention needs a cache

Generation is autoregressive: the model produces one token, appends it to the sequence, and runs the forward pass again to produce the next. At step t, the new token computes its own query, key, and value vectors, then dots its query against the keys of every prior token to figure out what to attend to. Multiply, softmax, weight the values, done.

The keys and values of the prior tokens never change. They were computed once when those tokens were first seen, and they will be re-used for every subsequent token in the same sequence. If you don't save them, every generation step has to re-run the full forward pass on the entire history to recompute them — that's O(n²) work per token and O(n³) over the whole sequence. With the cache, each new token is O(n) attention plus a constant MLP, totalling O(n²) across generation. The difference between cache-on and cache-off is the difference between feasible and infeasible. Every production serving stack uses one — the question is only how you store it.

02Sizing the cache

The per-token footprint is the part most engineers under-estimate. For every layer, every attention head, and every position in the sequence, you store two vectors of length dim_per_head: one key, one value. Across a request:

cache_bytes = 2 × seq_len × batch × n_layers × n_heads × dim_per_head × bytes_per_elem

llama-3 70B · FP16 · seq=4096 · batch=8= 2 × 4096 × 8 × 80 × 8 × 128 × 2

(GQA: 8 KV heads, not 64)≈ 10.7 GB

+ weights at FP16≈ 140 GB

Without grouped-query attention you'd be paying for all 64 heads — roughly 86 GB, which is why GQA is now table stakes. Either way the cache is a non-trivial fraction of memory the moment you batch real requests, and it scales linearly with every variable — double the sequence, double the cache; double the batch, double the cache.

config	n_layers	n_kv_heads	dim	seq	batch	cache/req	full pool
Mistral-7B	32	8	128	4K	1	256 MB	2.0 GB
Llama-3 8B	32	8	128	8K	16	512 MB	8.2 GB
Llama-3 70B	80	8	128	4K	8	1.3 GB	10.7 GB
Llama-3 70B	80	8	128	32K	8	10.7 GB	86 GB
DeepSeek-V3 671B	61	MLA	—	128K	8	~4 GB	~32 GB

That last row is the punchline of architectures like Multi-Head Latent Attention: collapse the KV representation through a low-rank projection and the cache shrinks 6–10×. Compressing the cache has become more strategically important than compressing the weights.

03The naive memory layout (and why it wastes everything)

A serving system needs to know, ahead of time, how much memory each request gets. The simple approach: when a request arrives with a declared max_tokens of (say) 4096, you allocate a contiguous 4096-token KV cache slot for it. The slot lives in one piece of GPU memory so the attention kernel can stride through it cleanly.

The problem is that most requests don't actually reach max-tokens. A user asks a question that the model answers in 200 tokens, and 3896 tokens of pre-allocated cache sit empty until the request finishes and the slot is freed. Worse, the slots are heterogeneous (8192 here, 2048 there), so when one frees up, you can't necessarily fit a new request into it — classic internal fragmentation. Production measurements on naive serving stacks consistently land in the 60–90% wasted-memory range. The visualizer above is calibrated to those numbers: in naive allocation mode you'll typically get six to eight concurrent requests before the pool is "full" while half the bar is hatched-amber unused.

04PagedAttention — the breakthrough

Kwon et al. (2023), the vLLM paper. The reframing: stop treating the KV cache as one big contiguous array per request. Instead, divide the entire cache pool into small fixed-size blocks — vLLM's default is 16 tokens per block — and give each request a per-layer list of pointers to whichever physical blocks it's been assigned. The blocks don't have to be adjacent in memory. The attention kernel chases the pointer list, just like a CPU walking a page table.

The structural consequence is the same as virtual memory in an OS: zero internal fragmentation (you only allocate blocks as the sequence grows), no need to pre-commit to a max length, and a much higher concurrent batch size because you're not reserving headroom that may never be used. The empirical consequence reported in the paper, reproduced widely since, is roughly 2–4× throughput improvement on the same hardware versus FasterTransformer-class baselines — sometimes higher on workloads with very variable output lengths. In the visualizer, paged attention mode fits ~20–24 of the same 25 requests in the same memory footprint.

Logical sequence positions go through a per-request page table (a short list of pointers, one per layer) to land on physically scattered 16-token blocks. The attention kernel walks this indirection on every step. Pointer-chasing overhead is real but small — well under the saved memory bandwidth.

Once the cache is paged, two requests with the same first N tokens can share the same physical blocks instead of materialising a second copy. Each block carries a reference count: when a request lands and its prefix matches an existing block, the counter goes up; when a request finishes, it goes down; the block is freed when the count hits zero. From the attention kernel's perspective, two requests pointing at the same block is indistinguishable from one.

This is mechanical, not semantic — the system isn't "understanding" that the requests share a system prompt, it's just observing that the token IDs of the first ~512 positions are byte-identical and hashing them into the same block. Paged + prefix sharing mode in the visualizer shows what happens when every request shares a 1024-token system prefix: those blocks are highlighted green, allocated once, and the throughput jumps further still. This is the underlying mechanism for provider-side prompt caching at Anthropic, OpenAI, Google, and DeepSeek — the cost discount you see is real economics, not marketing.

06Eviction, swap, and the TTL

What happens when the cache fills? Default policy in vLLM and most successors is LRU eviction: the least-recently-used sequence's blocks get unmapped, either deleted or paged out to CPU DRAM. Resuming an evicted request means copying the blocks back over PCIe — call it 10× slower than a resident cache, but still cheaper than recomputing the prefill from scratch. Modern serving stacks add a tiered hierarchy: HBM (fastest, smallest) → CPU DRAM (10× slower, 10× bigger) → local NVMe (1000× slower, effectively unbounded). DeepSeek's serving infrastructure famously pushes this hard for context caching, and most large providers do something similar.

The 5-minute and 1-hour TTLs you see on provider prompt caches map directly to this. Holding a shared-prefix block in HBM costs real GPU memory; the longer you hold it, the more cache space you've pinned away from other requests. Five minutes is a heuristic compromise: long enough that an active chat session keeps reusing the same blocks, short enough that an abandoned session releases them before they crowd out productive workloads. The 1-hour tier costs more to write because the provider is paying to keep your blocks pinned in the warmer tier for longer.

07Quantizing the KV cache

The K and V tensors are separately quantizable from the model weights. A common configuration on modern serving stacks is FP16 weights with INT8 or FP8 KV — saves roughly 50% of cache memory at small quality cost, because KV values turn out to be more rounding-tolerant than the weights themselves (most attention weight in practice is concentrated in a few large values that survive quantization cleanly). Some stacks go further to INT4 KV with per-channel scaling.

This is the only reason 1M-token contexts work at all. At FP16, a 1M-token KV cache for a 70B model is roughly 320 GB; at INT4 it's 40 GB. The quality cost is real but small enough that the engineering case is unambiguous — the alternative is "you can't have million-token context." When you read about a new model offering a million-token window, KV quantization plus some flavor of attention compression (sliding window, sparse, latent) is doing the heavy lifting underneath.

08What this means in practice

If your self-hosted model OOMs at sequence 4000 with batch 8 when it ran fine at 2000 with batch 8, the math is doing exactly what you'd predict: the KV cache doubled. If you're seeing throughput plateau well before the GPU is compute-bound, it's almost certainly cache-allocation overhead, not the matmul. Upgrading to a paged-attention serving stack (vLLM, SGLang, TensorRT-LLM with paged-KV enabled, TGI) is usually the highest-leverage single change you can make for serving — it's the difference between "this model is impractical to serve" and "this model is fine."

And if you're using a hosted API, prompt caching isn't a marketing gimmick — it's the API surface exposed by the same paged-cache machinery your self-hosted stack would use. Putting your stable system prompt and retrieved documents in front of the cache breakpoint isn't optimization; it's writing requests that match the shape of the underlying allocator.