Skip to content

Language Model

Formula

\[ P(x_1,\dots,x_T)=\prod_{t=1}^T P(x_t\mid x_{\lt t}) \]

Parameters

  • \(x_t\): token at position \(t\)
  • \(x_{\lt t}\): previous tokens (context)

What it means

A language model assigns probabilities to token sequences.

What it's used for

  • Text generation, scoring, and autocomplete.
  • Representation learning and downstream NLP tasks.

Key properties

  • Sequence probability factorizes into conditional probabilities.
  • Training often minimizes negative log-likelihood / cross-entropy.

Common gotchas

  • Tokenization strongly affects modeling behavior.
  • High likelihood does not guarantee factual correctness.

Example

A language model can score whether one sentence is more likely than another under its learned distribution.

How to Compute (Pseudocode)

Input: token sequence x_1..x_T and a language model
Output: sequence probability (or log-probability)

logp <- 0
for t from 1 to T:
  obtain P(x_t | x_{<t}) from the model (or the model-specific factorization)
  logp <- logp + log P(x_t | x_{<t})
return exp(logp) or logp

Complexity

  • Time: Depends on the language model architecture and sequence length \(T\); evaluating all conditionals is typically linear in the number of positions times per-step model cost (or one batched forward pass in training)
  • Space: Depends on model size, sequence length, and whether caches/activations are stored
  • Assumptions: Autoregressive factorization shown; encoder-only or masked objectives use different training/evaluation workflows

See also