Inference internals · production techniques
Sampling parameters — what the knobs actually do
Temperature, top-p, top-k, presence/frequency penalty, seed. A short list. Endless folk wisdom. Here's the mechanics, the myths, and the four presets that cover almost everything you'll ever ship.
The setup: a distribution, then a choice
At every generation step, the model emits a vector of logits — one number per token in its vocabulary, often ~100,000–200,000 entries long. Softmax converts those into probabilities that sum to 1.0. The model then has to pick one token to actually emit.
Every sampling parameter you've ever set does exactly one thing: modify that probability distribution before drawing from it. Once you internalize that single picture, the rest of this page is just the bookkeeping of how each knob distorts the bars.
The visualizer below shows a hand-crafted 10-token distribution — what a real model might produce when asked to complete "The capital of France is". Move the sliders. Watch the bars stretch, sharpen, fade out. Click Sample 100× at the bottom to see what 100 actual draws from the post-filter distribution look like, and notice how often the top token does not win.
Next-token distribution
prompt: "The capital of France is"
Probabilities after filtering & renormalization
Empirical draws from the post-filter distribution
Click Sample 100× to draw 100 tokens from the current distribution.
Temperature: divide the logits, then softmax
Temperature is the only knob that reshapes the distribution rather than truncating it. Mechanically, before softmax runs, every logit is divided by T:
Try the slider at three positions. At T=0.2 the top bar dominates — almost every sample lands on " Paris". At T=1.0, you see the model's native distribution. At T=1.8 the bars flatten dramatically and you start drawing tokens the model didn't really mean — " arguably", " France" (yes, models do this), " probably". That's not creativity. That's noise. Sometimes noise reads as creative; often it reads as a hallucination.
T that adds information the model didn't already encode.Top-p and top-k: truncation, not reshaping
Both of these throw tokens away. They don't change the relative probabilities of the survivors (the renormalization step does that, mechanically), they just decide who gets to play.
Top-k is the blunt version. Keep the top k tokens by probability. Discard the rest. Renormalize so the survivors sum to 1.0 again. k=1 is greedy decoding. k=40 was the classic GPT-2 default. It's a fixed knife — same cut every step, regardless of whether the model was confident.
Top-p (nucleus sampling) is the adaptive version. Sort tokens by probability descending, take the smallest set whose cumulative probability ≥ p, throw the rest away. When the model is sure, the nucleus is tiny (one or two tokens). When the model is genuinely uncertain, the nucleus expands automatically. This is why top-p almost always beats top-k in practice — it adapts to the model's own confidence.
Look at the visualizer: move top-p down to 0.7. Watch the dim-out cascade up from the bottom — anything past the cumulative-probability cutoff gets faded. Now move top-k down to 3. Notice that with the default prompt, top-p at 0.7 and top-k at 3 cut roughly the same set, because the head of the distribution is sharp. On a flatter distribution, they'd cut very differently.
T=0.3, top-p=0.6 — doesn't make the model "extra precise." It makes it nearly deterministic, often picking the same token over and over until it gets stuck in a loop. If you want determinism, set T=0 and leave top-p at 1.0. If you want diversity, set T=1 and use top-p to clip the long tail. Don't double-up.Presence and frequency penalties
These are OpenAI-flavored knobs (Anthropic doesn't expose them, and some providers fold them into a single "repetition penalty"). They subtract from the logits of tokens that have already appeared in the generation so far — before softmax, before any of the truncation above:
They help on long generations where the model is starting to circle the same phrase. They hurt on legal, medical, or technical text where exact repetition is correct — "the patient" really should be "the patient" every time, not "the individual," not "they," not "the subject." If your output starts saying things like "the aforementioned patient" on the fifth paragraph, your frequency penalty is too high.
Rule of thumb: leave both at 0 unless you've actually observed looping behavior in your outputs. They're a fix for a specific symptom, not a default to enable.
Seed and the reproducibility lie
Most LLM APIs accept a seed parameter. The marketing implication is: same prompt + same seed = same output. The reality is more like: same prompt + same seed + same model snapshot + same datacenter + same batch composition = probably the same output, most of the time.
Why it's flaky:
- Floating-point non-associativity in batched inference. When the GPU batches your request with whatever else is in the queue, the order of additions in matrix multiplies changes slightly. Floating-point addition isn't associative;
(a+b)+c ≠ a+(b+c)at the bit level. The result is logits that differ in the 5th or 6th decimal — usually invisible, but right at a softmax boundary it flips the chosen token. - Mid-fleet model swaps. Providers ship silent point updates to "the same model name" all the time. The published version string ("gpt-4.1", "claude-sonnet-4-6") covers many fingerprints over its lifetime. OpenAI returns a
system_fingerprintin the response that changes when the underlying snapshot does. Anthropic doesn't expose one as cleanly. Either way, your "reproducible" run from Tuesday might not reproduce on Thursday. - Hardware variation. Different GPU generations have different rounding behavior. The cluster you land on isn't part of the seed.
seed. Cache the actual outputs and diff against the cache. The seed parameter is useful for reducing variance during prompt iteration ("did my edit change behavior, or did the model just roll different?"); it is not a guarantee of byte-identical replay.Reasoning models break the rules
Models with extended thinking — OpenAI's o-series, Claude with thinking budget enabled, DeepSeek-R1, the Qwen reasoner family — change the sampling story in ways that aren't widely documented.
For the reasoning portion (the hidden or visible scratchpad before the final answer), most providers either ignore your sampling parameters entirely or force them to model-trained defaults. The argument is that the chain-of-thought was trained at specific sampling settings, and deviating from them degrades the reasoning quality more than it helps with diversity. OpenAI explicitly rejects requests that set temperature or top_p to non-default values on o-series models. Claude with extended thinking accepts your parameters but applies them only to the final answer.
Practical advice: if you're using a reasoning model, set the temperature you actually want on the final output, and assume the thinking trace will sample however the provider trained it to. Don't try to "lower the temperature" of the reasoning to make it more focused — that's not how the knob works on these models.
The four presets that cover 95% of cases
Almost every production sampling configuration I've ever shipped looks like one of these four. The fifth column is the row in the visualizer's preset row above — click them to see each one.
| Use case | Temperature | Top-p | Penalties | Why |
|---|---|---|---|---|
| Deterministic extraction, classification, structured output | 0 | 1.0 | 0 | The one place argmax is correct. You want the same input to produce the same output. Combine with strict JSON schema or grammar constraints for production extraction. |
| Default chat / Q&A | 0.7 | 1.0 | 0 | The provider defaults. Tuned by people whose entire job is tuning them. Don't second-guess this unless you can measure that something else wins on your eval. |
| Creative writing / brainstorming | 0.9–1.1 | 0.95 | freq 0.2 | A little above default, with top-p clipping the worst of the long tail. The mild frequency penalty stops the model from settling into one phrase across paragraphs. |
| Code generation | 0.2 | 0.95 | 0 | Just enough variation to avoid identical comments across runs and to break out of mid-token loops. Going lower makes the model more likely to repeat its own bugs verbatim. |
That's the entire decision tree. You almost never need to fine-tune these. The number of production bugs I've seen caused by someone tuning top_p=0.87 on a hunch is roughly equal to the number of production bugs I've seen caused by someone tuning a thermostat to 71.4°F.
Sampling is the bottom layer of the inference stack — the last thing that happens before a token leaves the server. The next chapter (structured outputs) moves one level up: what happens when you constrain the distribution itself, mask it down to a grammar, and stop pretending the model is going to emit valid JSON on vibes alone.