Next-Token Prediction¶

Formula¶

\[ \hat{x}_{t+1}\sim P(\cdot \mid x_{\le t}) \]

Parameters¶

\(x_{\le t}\): prompt/context tokens up to time \(t\)
\(P(\cdot \mid x_{\le t})\): predicted distribution over next token

What it means¶

The model predicts a probability distribution for the next token given the current context.

What it's used for¶

Autoregressive generation.
Scoring and completion tasks.

Key properties¶

Repeated next-token prediction produces full sequence generation.
Sampling/decoding strategy affects output diversity and quality.

Common gotchas¶

Greedy next-token choice is not always best globally.
Context length limits what the model can condition on.

Example¶

After "2 + 2 =", a model assigns high probability to token "4".

How to Compute (Pseudocode)¶

Input: context tokens x_{<=t}, language model, decoding rule
Output: next-token prediction/distribution

run the model on the context to obtain next-token logits/probabilities
apply a decoding rule (greedy, temperature, top-k, top-p, etc.)
return the next-token distribution or sampled/selected token

Complexity¶

Time: Depends on the language model forward pass and decoding rule; decoding-rule postprocessing is usually smaller than model inference cost
Space: Depends on model activations/cache and vocabulary logits for the current step
Assumptions: One decoding step shown; autoregressive generation repeats this process for each generated token