Data Leakage¶
Formula¶
\[
P(\text{feature}\mid \text{deployment}) \ne P(\text{feature}\mid \text{training pipeline assumptions})
\]
\[
\text{Leakage} \Rightarrow \widehat{\text{generalization}}\ \text{too optimistic}
\]
Parameters¶
- Leakage often means information unavailable at prediction time appears during training/evaluation.
What it means¶
Leakage happens when future or target-derived information enters training features, preprocessing, or validation setup.
What it's used for¶
- Designing safe splits and pipelines.
- Debugging suspiciously high validation scores.
Key properties¶
- Can come from temporal leakage, target leakage, duplicate rows, or preprocessing across splits.
- Often invisible unless you audit data generation timing.
Common gotchas¶
- Standardizing on the full dataset before splitting is leakage.
- Feature definitions like "days since churn" can directly encode the label.
Example¶
If a fraud model uses chargeback outcomes not available at authorization time, offline results will be inflated.
How to Compute (Pseudocode)¶
Input: dataset, feature definitions, split strategy, pipeline design
Output: leakage audit findings
verify train/val/test split logic (time/group/duplicate safety)
check each feature for availability at prediction time
ensure preprocessing/encoding/imputation is fit inside training folds only
audit target-derived and future-derived columns
flag suspiciously optimistic validation patterns for review
Complexity¶
- Time: Depends on dataset audits and validation checks; often dominated by metadata review plus targeted data scans and reruns of evaluation pipelines
- Space: Depends on audit artifacts and validation outputs
- Assumptions: Leakage detection is a workflow/checklist process, not a single metric computation