Mutual Information¶
Formula¶
\[
I(X;Y) = \sum_{x,y} p(x,y)\,\log\frac{p(x,y)}{p(x)\,p(y)}
\]
Parameters¶
- \(X,Y\): random variables
- \(p(x,y)\): joint distribution
- \(p(x), p(y)\): marginals
What it means¶
How much knowing \(X\) reduces uncertainty about \(Y\) (and vice versa).
What it's used for¶
- Measuring dependence between variables.
- Feature selection and representation learning.
Key properties¶
- \(I(X;Y) \ge 0\), equality iff \(X\) and \(Y\) are independent
- Symmetric: \(I(X;Y)=I(Y;X)\)
- Relation to entropy: \(I(X;Y)=H(X)-H(X\mid Y)\)
Common gotchas¶
- Estimating MI from finite samples is biased and nontrivial.
- For continuous variables, MI uses densities and integrals.
Example¶
If \(Y=X\) and \(X\) is a fair bit, then \(I(X;Y)=1\) bit.
How to Compute (Pseudocode)¶
Input: joint probabilities p_xy[x,y], log base b
Output: mutual_information
compute marginals p_x[x] and p_y[y] from p_xy
total <- 0
for each pair (x, y):
if p_xy[x,y] == 0:
continue
total <- total + p_xy[x,y] * log_base_b( p_xy[x,y] / (p_x[x] * p_y[y]) )
return total
Complexity¶
- Time: \(O(k_x k_y)\) for a dense discrete joint table
- Space: \(O(k_x + k_y)\) additional space for marginals (excluding the joint table)
- Assumptions: \(k_x\) and \(k_y\) are support sizes of \(X\) and \(Y\); continuous MI uses density estimation/integration instead