Perplexity¶

Formula¶

\[ \operatorname{PP}(P,Q) = \exp\big(H(P,Q)\big) \]

Plot¶

fn: exp((-(x*log(x)+(1-x)*log(1-x)))*1)
xmin: 0.001
xmax: 0.999
ymin: 1
ymax: 2.1
height: 280
title: exp(H) for binary entropy (natural-log convention)

Parameters¶

\(H(P,Q)\): cross-entropy
\(\exp\): exponential (use \(2^{H}\) if log base is 2)

What it means¶

Effective number of equally likely choices under the model \(Q\).

What it's used for¶

Evaluating language models (lower is better).
Interpreting average branching factor of a model.

Key properties¶

Lower perplexity means better predictive performance
\(\operatorname{PP} = e^{H}\) (nats) or \(2^{H}\) (bits)

Common gotchas¶

Perplexity values depend on the log base and tokenization.
Not comparable across datasets with different vocabularies.

Example¶

If average cross-entropy is \(H=2\) bits, perplexity is \(2^2=4\).

How to Compute (Pseudocode)¶

Input: cross-entropy H and log base convention
Output: perplexity

if H is measured in nats:
  return exp(H)
if H is measured in bits:
  return 2^H

Complexity¶

Time: \(O(1)\) once cross-entropy is known
Space: \(O(1)\)
Assumptions: The cost of computing cross-entropy itself is excluded; for a discrete support of size \(k\), that step is typically \(O(k)\)

Perplexity¶

Formula¶

Plot¶

Parameters¶

What it means¶

What it's used for¶

Key properties¶

Common gotchas¶

Example¶

How to Compute (Pseudocode)¶

Complexity¶

See also¶