Skip to content

Data Leakage

Formula

\[ P(\text{feature}\mid \text{deployment}) \ne P(\text{feature}\mid \text{training pipeline assumptions}) \]
\[ \text{Leakage} \Rightarrow \widehat{\text{generalization}}\ \text{too optimistic} \]

Parameters

  • Leakage often means information unavailable at prediction time appears during training/evaluation.

What it means

Leakage happens when future or target-derived information enters training features, preprocessing, or validation setup.

What it's used for

  • Designing safe splits and pipelines.
  • Debugging suspiciously high validation scores.

Key properties

  • Can come from temporal leakage, target leakage, duplicate rows, or preprocessing across splits.
  • Often invisible unless you audit data generation timing.

Common gotchas

  • Standardizing on the full dataset before splitting is leakage.
  • Feature definitions like "days since churn" can directly encode the label.

Example

If a fraud model uses chargeback outcomes not available at authorization time, offline results will be inflated.

How to Compute (Pseudocode)

Input: dataset, feature definitions, split strategy, pipeline design
Output: leakage audit findings

verify train/val/test split logic (time/group/duplicate safety)
check each feature for availability at prediction time
ensure preprocessing/encoding/imputation is fit inside training folds only
audit target-derived and future-derived columns
flag suspiciously optimistic validation patterns for review

Complexity

  • Time: Depends on dataset audits and validation checks; often dominated by metadata review plus targeted data scans and reruns of evaluation pipelines
  • Space: Depends on audit artifacts and validation outputs
  • Assumptions: Leakage detection is a workflow/checklist process, not a single metric computation

See also