Skip to content

Top-k Sampling

Formula

\[ P'(i)= \begin{cases} \frac{P(i)}{\sum_{j\in K}P(j)}, & i\in K\\ 0, & i\notin K \end{cases} \]
\[ K=\text{indices of the top }k\text{ probabilities} \]

Parameters

  • \(P(i)\): next-token probability
  • \(k\): number of highest-probability tokens kept

What it means

Top-k sampling restricts sampling to the \(k\) most likely tokens, then renormalizes.

What it's used for

  • Preventing low-probability junk tokens during generation.
  • Balancing diversity and coherence.

Key properties

  • Fixed-size candidate set each step.
  • Often combined with temperature.

Common gotchas

  • Small \(k\) can become repetitive.
  • Large \(k\) may reintroduce low-quality tails.

Example

With \(k=50\), sampling only occurs among the 50 most likely next tokens.

How to Compute (Pseudocode)

Input: next-token probabilities/logits and top-k parameter k
Output: sampled token

compute probabilities (or logits ranking)
select the top-k tokens by probability/logit
renormalize probabilities over those k tokens only
sample one token from the truncated distribution
return token

Complexity

  • Time: Depends on top-k selection method; commonly \(O(V\log k)\) or \(O(V)\) selection per decoding step, plus renormalization over \(k\) tokens (excluding model forward-pass cost)
  • Space: \(O(V)\) for logits/probabilities plus \(O(k)\) for the retained candidate set
  • Assumptions: One decoding step shown; practical implementations may fuse temperature/top-k/top-p filtering

See also