Mutual Information

Quantifying the information one variable carries about another.

Mutual information measures how much knowing one variable reduces uncertainty about another:

I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)

Equivalently, $I(X; Y) = D_{\mathrm{KL}}(P(X, Y) \| P(X) P(Y))$ — KL between the joint and the product of marginals. $I(X; Y) = 0$ iff $X$ and $Y$ are independent.

Unlike correlation, mutual information captures non-linear dependencies. Two variables can have $\rho = 0$ but high $I$ — e.g. $Y = X^2$ on a symmetric $X$ .

Applications: feature selection (drop features with low $I$ relative to the target), causal-graph learning, signal-processing decoding, and any setting where you suspect non-linear dependence.