Inference · precision · deployment

Quantization — fitting a 70B model on one GPU

A 70B-parameter model at FP16 is 140 GB. A single A100 has 40 GB. The reason anyone can self-host these things at all is a set of techniques that shave bits off every weight in exchange for some accuracy. This is what they do, and where they break.

§ 1 The deployment math

Why anyone cares

A language model is, in storage terms, a giant matrix of numbers — one per parameter. Llama-3-70B has roughly 70 billion of them. In FP16 (the format models are usually trained in), each parameter takes 2 bytes. That is 140 GB just for the weights, before the KV cache, before activations, before anything you actually want to do.

The KV cache adds another 10–30 GB at production batch sizes and context lengths, and activations another few GB on top. Total working set: ~180 GB. There is no single GPU you can rent that holds that. A B200 holds 192 GB; you'd be running with no slack, and a B200 by the hour costs more than most teams' coffee budget.

The escape hatch is to store each weight in fewer bits. Drop from 16 bits to 4 bits and the same 70B model is 35 GB — fits on one A100, fits on two 4090s, fits in a place where you can actually serve it. The table below is the deployment unlock at each step. All numbers are weights-only and 2026-approximate; real deployments need 20–40% extra headroom for the KV cache.

Format	Bits/weight	Llama-70B size	Smallest single GPU that fits
FP16 / BF16	16	140 GB	none — needs 2× H100 or 1× B200
FP8	8	70 GB	1× H100 80GB (needs FP8 hardware)
INT8	8	70 GB	1× H100 80GB
INT4 (GPTQ / AWQ)	4	35 GB	1× A100 40GB, fits with KV cache
INT4 + smaller (Q3)	~3	~28 GB	1× RTX 4090 24GB (tight)

Every step roughly halves the memory and roughly doubles the GPUs you can deploy on. Quality goes down, monotonically. The question is how much, and on what.

§ 2 What "bits per weight" actually means

A weight, progressively rounded

Take a single weight in the model — say the value 0.18374921. Every format represents this same number with a different number of bits, trading dynamic range and precision against memory.

FP3232 bits / 4 bytes

0.18374921training only

FP1616 bits / 2 bytes

0.18371582Δ ≈ 3e-5

BF1616 bits, wider range

0.18359375Δ ≈ 2e-4

FP8 (E4M3)8 bits

0.1875Δ ≈ 4e-3

INT88 bits fixed

0.1843Δ ≈ 6e-4

INT44 bits, 16 levels

0.200Δ ≈ 2e-2

FP32 is essentially never used for inference — it's the training scratch space. FP16 and BF16 are the "full precision" baseline. BF16 sacrifices fraction bits for a wider exponent range, which is why it dominated post-2020 training. FP8 arrived with H100 and exists in two flavours — E4M3 (more precision, less range, for weights and activations) and E5M2 (more range, less precision, mostly for gradients). INT8 is fixed-point — uniform spacing between values. INT4 has only sixteen distinct values per weight; the precision loss is enormous on any individual weight but, because billions of them average out, the model still mostly works.

The fundamental tradeoff: fewer bits per weight means more rounding error per multiplication. Each layer's output is the input to the next. Small errors compound. Whether the compounded error matters depends entirely on the task.

§ 3 PTQ vs QAT

After-the-fact vs trained-in

There are two basic strategies for getting a quantized model: quantize a model that was trained in full precision, or train the model knowing it will be quantized later.

Post-Training Quantization (PTQ)

Take a finished FP16 model. Run a calibration pass over a few hundred prompts to figure out the dynamic range of each layer's weights and activations. Pick scales and zero-points. Round every weight down to the target precision. Done — minutes to hours of compute. The overwhelming majority of quantized open-weights models on Hugging Face are PTQ. It is cheap, it is composable with any post-training change, and at INT8 or carefully-done INT4 the quality drop is usually small enough that you ship it.

Quantization-Aware Training (QAT)

Bake the quantization into the forward pass during training (or fine-tuning). The model sees the quantized values in its activations and learns weights that are robust to the rounding. Better quality at the same bit-width, particularly at INT4 and below. The downside: you need to run real training, often on the original training data, and frontier labs rarely bother. You see QAT in serious low-bit work (sub-4-bit, edge deployment) and in some Meta and Microsoft research releases.

The intuition for why QAT helps: if a model has never been asked to round during training, the layers downstream of any given weight have no reason to compensate for that weight's rounding. Errors accumulate freely. QAT lets the network develop "shock absorbers" — alternative pathways for information that survive aggressive rounding.

Rule of thumb: at INT8, PTQ is essentially free quality-wise. At INT4, expect PTQ to lose 1–3 points on most benchmarks; QAT can usually recover most of that. Below 4 bits, QAT is more or less required.

§ 4 The clever methods (mostly PTQ)

How modern quants actually work

Naive PTQ — pick a scale per tensor, round to nearest — works fine at INT8 and falls apart at INT4. The real-world INT4 quants you see in production all do something cleverer. The vocabulary you'll encounter:

GPTQ

Quantizes one layer at a time, using the inverse Hessian of the layer's calibration activations to decide how to compensate each weight's rounding error by adjusting its neighbours. The most widely-used INT4 weight-only method. Works on basically every transformer. Default for vLLM, Hugging Face Transformers.

AWQ (Activation-aware)

Notices that not all weights are equally important — the ones that interact with high-magnitude activations carry more of the model's signal. AWQ protects roughly the top 1% of weights at higher precision and quantizes the rest harder. Comparable or slightly better quality than GPTQ at the same bit-width; faster to produce.

SmoothQuant

Scales activations down and weights up by an equal factor before quantization, redistributing the dynamic range. This is what makes INT8 activation quantization viable (not just weights) — a prerequisite for genuine 2× throughput on INT8 tensor cores.

SqueezeLLM

Newer. Treats quantization as a sparse-plus-dense decomposition: store outlier weights at high precision in a small sparse table, quantize the rest aggressively. Best-in-class quality at INT4 in 2025/26 benchmarks, slower to compile.

GGUF (llama.cpp)

A file format, not a method. Bundles weights with metadata for llama.cpp's CPU/Metal/CUDA backends. Comes in flavours: Q4_K_M, Q5_K_S, Q8_0 — the letter is the quantization style (K = grouped k-quants), the number is bits per weight, M/S/L is medium/small/large group size.

NF4 (bitsandbytes)

"Normal-Float 4" — a 4-bit format with values distributed to match the empirical distribution of LLM weights (roughly Gaussian) rather than uniformly. Used heavily for QLoRA fine-tuning. Trades inference speed for fine-tunability.

In practice you pick the method by what your serving stack accepts. vLLM and TGI consume GPTQ and AWQ checkpoints; llama.cpp expects GGUF; bitsandbytes/NF4 is for fine-tuning, not serving. The quality differences between GPTQ and AWQ at INT4 are usually within noise on a real eval — pick whichever your stack supports natively.

§ 5 Visualizer

See it: same prompt, different bits

Pick a model size and a quantization level. The panel shows whether the weights fit on common GPUs, and — more usefully — what happens when you ask the model to do basic arithmetic at each precision. The outputs are hand-crafted to illustrate the kind of degradation you see; the failure modes are real, not the exact wording.

Quantization · size · quality

Toggle model size and precision. Memory bar and sample output update in lockstep.

Model

Precision

GPU: 1× A100 (40 GB) FITS

Weights

—

KV cache (est.)

—

Total / GPU

—

promptCompute 247 × 184 step by step.

Perplexity Δ

—

Needle@50K recall

—

Recommended GPU

—

The math degrades on a single arithmetic prompt is illustrative, but the pattern is real: INT4 70B models routinely drop digits on multi-step arithmetic that the FP16 version handles cleanly. The needle-in-haystack recall number is the most cited quantization regression in 2025 papers — long-context retrieval is the first capability to break.

§ 6 Where quality actually breaks

Quantization is not lossless

Marketing copy on a model card will say "negligible quality loss." Vendor-published perplexity drops of 0.2 will be cited. These are true and almost completely useless, because the average perplexity on a generic web corpus is dominated by tasks where rounding errors don't matter. The places quantization hurts are specific:

Long-context recall

The most reliably degraded capability. A 70B FP16 model can usually find a fact placed 50K tokens deep in a 100K-token context with >95% accuracy. The INT4 version of the same model drops to 60–75% on the same test. The mechanism: attention scores rely on small magnitude differences between many keys; INT4 rounding collapses the difference between "the right key" and "the second-best key" more often than FP16.

Arithmetic and code

Numerical reasoning relies on precise activations. A model that can do 4-digit multiplication in FP16 will routinely flub a digit at INT4. Code generation degrades similarly on tasks that involve careful index tracking, bit manipulation, or precise constant values. Don't ship an INT4 code-generation backend without an eval on real PRs.

Multi-language & rare tokens

The token embeddings and unembeddings for rare tokens — minority languages, code identifiers, niche jargon — are often "outlier" weights with unusually large magnitudes. Aggressive quantization clips them. The model gets noticeably worse at any language that wasn't in the calibration set's top 10.

Tool use & structured output

JSON schema adherence, function-call argument accuracy, format-following — all rely on the model being confident about specific output tokens ({, ", particular field names). INT4 introduces just enough noise into the logits that these strict-format tasks fail at 2–5× the rate of FP16.

What stays stable

Chat, summarisation, simple classification, general world-knowledge Q&A, creative writing, basic instruction-following. Most of the load on a customer-support chatbot. This is why INT4 deployments work fine for most "talk to an LLM" use cases and break dramatically on agentic workloads.

§ 7 The "is the loss worth it?" math

Per-workload, not per-model

Concrete comparison, 2026-approximate rental pricing:

Setup	GPU	Hourly	Quality	Throughput
Llama-70B FP16	2× H100 80GB	$6–8 /hr	baseline	~50 tok/s
Llama-70B FP8	1× H100 80GB	$3–4 /hr	−0.5%	~85 tok/s
Llama-70B INT4 (AWQ)	1× A100 40GB	$1.5–2 /hr	−3 to −8%	~110 tok/s

The INT4 setup costs roughly 3× less per hour than the FP16 setup, and runs ~2× faster on per-token latency (smaller weights, more cache hits, fewer GPU-to-GPU sync stalls). Whether that's worth it is workload-dependent:

Chatbot, summarisation, classification: the 3–8% quality drop is mostly invisible at the conversational level. Ship INT4. The cost savings compound across millions of requests.
Code generator, fact-checker, anything with downstream consumers: a 5% degradation often shows up as one wrong line in twenty, which silently breaks builds. Pay for FP16 or FP8.
Long-context retrieval (legal discovery, long-doc Q&A): needle-recall regression alone disqualifies INT4. Use FP8 minimum.
Agentic systems with tool use: the schema-adherence regression is multiplicative — every tool call has a chance of malforming. At ten tool calls per task that compounds. Stay at FP8 or FP16.

§ 8 FP8 — the new middle ground

If your fleet is H100-only, this is probably the answer

FP8 is the 2024-onward addition to the precision menu. Same 8 bits as INT8, so the memory savings are identical. But it's a floating-point format with an exponent and mantissa, which means it handles the dynamic range of LLM activations much better than the uniform spacing of INT8 — and crucially, it does so without requiring any of the SmoothQuant-style range balancing.

Quality at FP8 sits within ~0.5% of FP16 on virtually every benchmark, including the ones where INT4 falls over (long-context recall, structured output, code). The only cost: you need hardware support. H100, H200, B100, B200, MI300X — yes. A100, A10G, 4090, anything older — no native FP8 throughput. On an A100 you can store the weights in FP8 to save memory, but the math runs after upcasting to FP16/BF16 in software, which kills the speed advantage.

If your serving fleet is uniformly H100 or newer, FP8 is probably the optimal cost/quality point for production in 2026 for most workloads. The reason you still see INT4 widely is (a) cheaper GPUs in the fleet, and (b) inertia from the open-source ecosystem where INT4 (GPTQ/AWQ/GGUF) has had years of tooling maturity.

§ 9 When not to quantize

Cases where you pay full price

The seductive thing about quantization is that it looks free in casual demos and breaks on rare cases that are exactly the ones you most need to be reliable. Default to full precision (FP16 or BF16) when:

The model already fits at FP16 on your target GPU. An 8B model is 16 GB; that's comfortable on a 24 GB 4090. Don't quantize what doesn't need quantizing.
Rare-failure stakes are high — medical, legal, financial, infrastructure code that ships to production. The cost of one INT4-induced hallucination per 10,000 requests can exceed every cent the quantization saved.
Long-context retrieval is on the critical path — the model needs to scan a 200K-token contract for a specific clause. INT4 will miss it. FP8 minimum, FP16 ideal.
You can't easily eval the workload. If you don't have a good eval set that exercises the failure modes (long-context, math, schema-following), you can't tell whether quantization broke something. Stay at FP16 until you can measure the regression.
The model is part of a research artifact — reproducibility matters and quantization is non-deterministic across hardware (the rounding interacts with kernel implementation).

The general rule: quantize for cost, eval for quality, never quantize without measuring. The model card's "negligible degradation" claim is doing a lot of work; on your workload, "negligible" might be the entire reason the system has value.

The boring failure mode: teams ship INT4 on day one because the demo looked fine, hit production at scale, see the agent's tool-call success rate drop from 92% to 71%, spend two weeks blaming the prompt before someone notices the model is at 4 bits. The fix is not a better prompt. The fix is FP8.

To make the weights small enough to fit on a single, cheaper GPU. A 70B model at FP16 is 140 GB (needs 2× H100); at INT4 it is ~35 GB (fits on one A100). Cost-per-hour drops ~3× and throughput often improves. To improve the model's reasoning ability by reducing overfitting in the high-precision weights. To enable longer context windows, which require fewer bits per attention key. To comply with FP32 deprecation in newer Nvidia drivers. An agentic system that scans long contexts and emits strict-schema tool calls. Long-context recall and schema adherence are the two capabilities that degrade most under INT4, and tool-call errors compound across multi-step agents. A customer-support chatbot answering FAQs from a 5-page document. A summarisation pipeline producing 200-word abstracts of news articles. A sentiment classifier choosing one of three labels. PTQ (post-training quantization) takes a fully trained FP16 model and rounds the weights in a calibration pass — minutes to hours of compute. QAT (quantization-aware training) bakes the rounding into the training forward pass so the model learns weights that are robust to it. QAT yields better quality at low bit-widths but requires real training compute, so frontier labs rarely ship it. PTQ produces quantized weights; QAT produces quantized activations. PTQ is for inference servers; QAT is for fine-tuning libraries like LoRA. PTQ runs at FP16; QAT runs at FP32. FP8 gives you the same 8-bit memory savings as INT8 (so a 70B model fits on a single 80 GB GPU) but the floating-point format preserves dynamic range much better than uniform-spaced INT8 — quality is within ~0.5% of FP16 on most benchmarks, including the failure-prone ones like long-context recall. Requires H100+ for hardware acceleration. FP8 doubles the parameter count of the same model. FP8 is the only format compatible with bfloat16 KV caches. FP8 is mandatory for any model trained after 2024. Long-context retrieval relies on attention picking out one correct key from tens of thousands of competing keys, and the right key often beats the second-best one by only a small margin in attention scores. INT4 rounding noise can swamp that margin, so the model lands on a near-miss key instead of the exact answer. Chat doesn't depend on that kind of fine-grained selection. Long contexts use more KV cache memory, which doesn't compress under INT4. The needle-in-haystack benchmark uses FP32 internally, which INT4 can't match. Chat models are trained with QAT, while base models are not.

All cost, memory, and quality numbers are 2026-approximate and intended as orientation, not procurement. Real degradation depends on the specific model, the quantization method (GPTQ vs AWQ vs GGUF k-quants vs NF4), the calibration data, and your workload. Always eval on your own prompts before shipping the cheaper bits.