Skip to content

Model Interpretability

Formula

\[ \hat y = f(x) \quad \Rightarrow \quad \text{interpretability asks how } x \text{ drives } \hat y \]

Parameters

  • \(f\): trained model
  • \(x\): input features

What it means

Model interpretability covers tools and practices for understanding why models behave as they do globally and locally.

What it's used for

  • Debugging, stakeholder trust, compliance, and feature audits.
  • Comparing models beyond aggregate metrics.

Key properties

  • Includes intrinsic interpretability (simple models) and post-hoc explanations.
  • Local explanations do not automatically imply causal effects.

Common gotchas

  • Explanation methods can disagree.
  • Correlated features can make importances unstable or misleading.

Example

Use global feature importance plus local examples to review a credit-risk model before launch.

How to Compute (Pseudocode)

Input: trained model, evaluation data, interpretability goals (global/local)
Output: interpretation report/artifacts

choose interpretation methods appropriate for the model and question
  examples: feature importance, PDP/ICE, SHAP, local examples, error slices
compute explanations on held-out or representative data
cross-check explanations against domain constraints and known correlations
summarize global patterns and local examples with caveats

Complexity

  • Time: Depends on the chosen explanation methods and model evaluation costs (for example, SHAP and permutation methods can be expensive)
  • Space: Depends on stored explanation outputs, sampled datasets, and visualization artifacts
  • Assumptions: Interpretability is a workflow umbrella; downstream methods determine actual runtime/memory complexity