Model Interpretability¶
Formula¶
\[
\hat y = f(x) \quad \Rightarrow \quad \text{interpretability asks how } x \text{ drives } \hat y
\]
Parameters¶
- \(f\): trained model
- \(x\): input features
What it means¶
Model interpretability covers tools and practices for understanding why models behave as they do globally and locally.
What it's used for¶
- Debugging, stakeholder trust, compliance, and feature audits.
- Comparing models beyond aggregate metrics.
Key properties¶
- Includes intrinsic interpretability (simple models) and post-hoc explanations.
- Local explanations do not automatically imply causal effects.
Common gotchas¶
- Explanation methods can disagree.
- Correlated features can make importances unstable or misleading.
Example¶
Use global feature importance plus local examples to review a credit-risk model before launch.
How to Compute (Pseudocode)¶
Input: trained model, evaluation data, interpretability goals (global/local)
Output: interpretation report/artifacts
choose interpretation methods appropriate for the model and question
examples: feature importance, PDP/ICE, SHAP, local examples, error slices
compute explanations on held-out or representative data
cross-check explanations against domain constraints and known correlations
summarize global patterns and local examples with caveats
Complexity¶
- Time: Depends on the chosen explanation methods and model evaluation costs (for example, SHAP and permutation methods can be expensive)
- Space: Depends on stored explanation outputs, sampled datasets, and visualization artifacts
- Assumptions: Interpretability is a workflow umbrella; downstream methods determine actual runtime/memory complexity