Top-p Sampling (Nucleus)¶
Formula¶
\[
S=\min\left\{A:\sum_{i\in A} P(i)\ge p\right\}
\]
Parameters¶
- \(P(i)\): next-token probability
- \(p\in(0,1]\): cumulative probability threshold
- \(S\): smallest set of top tokens whose total probability reaches \(p\)
What it means¶
Top-p sampling keeps a dynamic set of likely tokens whose cumulative probability mass reaches threshold \(p\), then samples from that set.
What it's used for¶
- Adaptive decoding that follows distribution sharpness.
- Common default in LLM generation.
Key properties¶
- Candidate set size varies by step.
- Often more adaptive than fixed top-k.
Common gotchas¶
- Very high \(p\) can include too much tail noise.
- Very low \(p\) can reduce diversity too much.
Example¶
If the distribution is very sharp, top-p with \(p=0.9\) may keep only a handful of tokens.
How to Compute (Pseudocode)¶
Input: next-token probabilities/logits and nucleus threshold p
Output: sampled token
compute probabilities (or sorted logits -> probabilities)
sort tokens by probability descending
accumulate probability mass until cumulative mass >= p
keep that smallest prefix set, renormalize, and sample one token
return token
Complexity¶
- Time: Often dominated by sorting/ranking over vocabulary \(V\) per decoding step (commonly \(O(V\log V)\) unless approximated), excluding model forward-pass cost
- Space: \(O(V)\) for probabilities/logits and sorted indices, plus the selected nucleus set
- Assumptions: One decoding step shown; implementations may use partial sorting or fused filters to reduce constants