Data Leakage¶

Formula¶

\[ P(\text{feature}\mid \text{deployment}) \ne P(\text{feature}\mid \text{training pipeline assumptions}) \]

\[ \text{Leakage} \Rightarrow \widehat{\text{generalization}}\ \text{too optimistic} \]

Parameters¶

Leakage often means information unavailable at prediction time appears during training/evaluation.

What it means¶

Leakage happens when future or target-derived information enters training features, preprocessing, or validation setup.

What it's used for¶

Designing safe splits and pipelines.
Debugging suspiciously high validation scores.

Key properties¶

Can come from temporal leakage, target leakage, duplicate rows, or preprocessing across splits.
Often invisible unless you audit data generation timing.

Common gotchas¶

Standardizing on the full dataset before splitting is leakage.
Feature definitions like "days since churn" can directly encode the label.

Example¶

If a fraud model uses chargeback outcomes not available at authorization time, offline results will be inflated.

How to Compute (Pseudocode)¶

Input: dataset, feature definitions, split strategy, pipeline design
Output: leakage audit findings

verify train/val/test split logic (time/group/duplicate safety)
check each feature for availability at prediction time
ensure preprocessing/encoding/imputation is fit inside training folds only
audit target-derived and future-derived columns
flag suspiciously optimistic validation patterns for review

Complexity¶

Time: Depends on dataset audits and validation checks; often dominated by metadata review plus targeted data scans and reruns of evaluation pipelines
Space: Depends on audit artifacts and validation outputs
Assumptions: Leakage detection is a workflow/checklist process, not a single metric computation