Skip to content

Permutation Feature Importance

Formula

\[ I_j = M(f, X, y) - M\big(f, \pi_j(X), y\big) \]

Parameters

  • \(M\): metric (higher-is-better shown here)
  • \(\pi_j(X)\): dataset with feature \(j\) randomly permuted

What it means

Measures importance by how much performance drops when a feature’s information is destroyed.

What it's used for

  • Model-agnostic feature ranking on validation/test sets.
  • Sanity-checking impurity-based tree importances.

Key properties

  • Simple and widely applicable.
  • Works best with a meaningful held-out metric.

Common gotchas

  • Correlated features can hide each other’s importance.
  • Permutation on training data can overstate importance due to overfitting.

Example

If permuting "account_age" sharply drops AUC, the feature is important for the model’s ranking.

How to Compute (Pseudocode)

Input: trained model f, evaluation set (X, y), metric M, repeats R
Output: feature importances I_j

baseline <- evaluate metric M on (X, y)
for each feature j:
  drops <- empty list
  for r from 1 to R:
    X_perm <- copy of X with column j randomly permuted
    score_perm <- evaluate metric M on (X_perm, y)
    append (baseline - score_perm) to drops   # for higher-is-better metrics
  I_j <- average(drops)

return {I_j}

Complexity

  • Time: \(O(dR\cdot \mathrm{EvalCost})\) for \(d\) features and \(R\) permutations per feature, where \(\mathrm{EvalCost}\) is the model+metric evaluation cost on the held-out set
  • Space: Typically \(O(n d)\) if permuted copies are materialized (less with in-place permutation/restoration)
  • Assumptions: \(n\) examples, \(d\) features; evaluation is performed on held-out data, and repeated permutations are used to reduce variance