Skip to content

Next-Token Prediction

Formula

\[ \hat{x}_{t+1}\sim P(\cdot \mid x_{\le t}) \]

Parameters

  • \(x_{\le t}\): prompt/context tokens up to time \(t\)
  • \(P(\cdot \mid x_{\le t})\): predicted distribution over next token

What it means

The model predicts a probability distribution for the next token given the current context.

What it's used for

  • Autoregressive generation.
  • Scoring and completion tasks.

Key properties

  • Repeated next-token prediction produces full sequence generation.
  • Sampling/decoding strategy affects output diversity and quality.

Common gotchas

  • Greedy next-token choice is not always best globally.
  • Context length limits what the model can condition on.

Example

After "2 + 2 =", a model assigns high probability to token "4".

How to Compute (Pseudocode)

Input: context tokens x_{<=t}, language model, decoding rule
Output: next-token prediction/distribution

run the model on the context to obtain next-token logits/probabilities
apply a decoding rule (greedy, temperature, top-k, top-p, etc.)
return the next-token distribution or sampled/selected token

Complexity

  • Time: Depends on the language model forward pass and decoding rule; decoding-rule postprocessing is usually smaller than model inference cost
  • Space: Depends on model activations/cache and vocabulary logits for the current step
  • Assumptions: One decoding step shown; autoregressive generation repeats this process for each generated token

See also