Mutual Information¶

Formula¶

\[ I(X;Y) = \sum_{x,y} p(x,y)\,\log\frac{p(x,y)}{p(x)\,p(y)} \]

Parameters¶

\(X,Y\): random variables
\(p(x,y)\): joint distribution
\(p(x), p(y)\): marginals

What it means¶

How much knowing \(X\) reduces uncertainty about \(Y\) (and vice versa).

What it's used for¶

Measuring dependence between variables.
Feature selection and representation learning.

Key properties¶

\(I(X;Y) \ge 0\), equality iff \(X\) and \(Y\) are independent
Symmetric: \(I(X;Y)=I(Y;X)\)
Relation to entropy: \(I(X;Y)=H(X)-H(X\mid Y)\)

Common gotchas¶

Estimating MI from finite samples is biased and nontrivial.
For continuous variables, MI uses densities and integrals.

Example¶

If \(Y=X\) and \(X\) is a fair bit, then \(I(X;Y)=1\) bit.

How to Compute (Pseudocode)¶

Input: joint probabilities p_xy[x,y], log base b
Output: mutual_information

compute marginals p_x[x] and p_y[y] from p_xy

total <- 0
for each pair (x, y):
  if p_xy[x,y] == 0:
    continue
  total <- total + p_xy[x,y] * log_base_b( p_xy[x,y] / (p_x[x] * p_y[y]) )

return total

Complexity¶

Time: \(O(k_x k_y)\) for a dense discrete joint table
Space: \(O(k_x + k_y)\) additional space for marginals (excluding the joint table)
Assumptions: \(k_x\) and \(k_y\) are support sizes of \(X\) and \(Y\); continuous MI uses density estimation/integration instead

Mutual Information¶

Formula¶

Parameters¶

What it means¶

What it's used for¶

Key properties¶

Common gotchas¶

Example¶

How to Compute (Pseudocode)¶

Complexity¶

See also¶