Permutation Feature Importance¶

Formula¶

\[ I_j = M(f, X, y) - M\big(f, \pi_j(X), y\big) \]

Parameters¶

\(M\): metric (higher-is-better shown here)
\(\pi_j(X)\): dataset with feature \(j\) randomly permuted

What it means¶

Measures importance by how much performance drops when a feature’s information is destroyed.

What it's used for¶

Model-agnostic feature ranking on validation/test sets.
Sanity-checking impurity-based tree importances.

Key properties¶

Simple and widely applicable.
Works best with a meaningful held-out metric.

Common gotchas¶

Correlated features can hide each other’s importance.
Permutation on training data can overstate importance due to overfitting.

Example¶

If permuting "account_age" sharply drops AUC, the feature is important for the model’s ranking.

How to Compute (Pseudocode)¶

Input: trained model f, evaluation set (X, y), metric M, repeats R
Output: feature importances I_j

baseline <- evaluate metric M on (X, y)
for each feature j:
  drops <- empty list
  for r from 1 to R:
    X_perm <- copy of X with column j randomly permuted
    score_perm <- evaluate metric M on (X_perm, y)
    append (baseline - score_perm) to drops   # for higher-is-better metrics
  I_j <- average(drops)

return {I_j}

Complexity¶

Time: \(O(dR\cdot \mathrm{EvalCost})\) for \(d\) features and \(R\) permutations per feature, where \(\mathrm{EvalCost}\) is the model+metric evaluation cost on the held-out set
Space: Typically \(O(n d)\) if permuted copies are materialized (less with in-place permutation/restoration)
Assumptions: \(n\) examples, \(d\) features; evaluation is performed on held-out data, and repeated permutations are used to reduce variance