Language Model¶

Formula¶

\[ P(x_1,\dots,x_T)=\prod_{t=1}^T P(x_t\mid x_{\lt t}) \]

Parameters¶

\(x_t\): token at position \(t\)
\(x_{\lt t}\): previous tokens (context)

What it means¶

A language model assigns probabilities to token sequences.

What it's used for¶

Text generation, scoring, and autocomplete.
Representation learning and downstream NLP tasks.

Key properties¶

Sequence probability factorizes into conditional probabilities.
Training often minimizes negative log-likelihood / cross-entropy.

Common gotchas¶

Tokenization strongly affects modeling behavior.
High likelihood does not guarantee factual correctness.

Example¶

A language model can score whether one sentence is more likely than another under its learned distribution.

How to Compute (Pseudocode)¶

Input: token sequence x_1..x_T and a language model
Output: sequence probability (or log-probability)

logp <- 0
for t from 1 to T:
  obtain P(x_t | x_{<t}) from the model (or the model-specific factorization)
  logp <- logp + log P(x_t | x_{<t})
return exp(logp) or logp

Complexity¶

Time: Depends on the language model architecture and sequence length \(T\); evaluating all conditionals is typically linear in the number of positions times per-step model cost (or one batched forward pass in training)
Space: Depends on model size, sequence length, and whether caches/activations are stored
Assumptions: Autoregressive factorization shown; encoder-only or masked objectives use different training/evaluation workflows