Data Drift¶
Formula¶
\[
P_t(X) \ne P_{t'}(X)
\]
Parameters¶
- \(P_t(X)\): feature distribution at time \(t\)
What it means¶
Data drift means the input feature distribution changes over time relative to training or baseline periods.
What it's used for¶
- Monitoring production pipelines and retraining triggers.
- Investigating drops in model performance.
Key properties¶
- Can occur without label changes.
- Population, covariate, or schema shifts are common subtypes in practice.
Common gotchas¶
- Pure distribution shift metrics do not prove business impact.
- Seasonality can look like drift unless baselines are contextual.
Example¶
A payments model sees a new merchant mix after expansion into a new country, shifting transaction features.
How to Compute (Pseudocode)¶
Input: baseline feature distribution and recent production feature data
Output: drift scores/alerts
for each monitored feature:
compute a drift statistic between baseline and recent windows
(for example PSI, KS statistic, histogram distance)
compare drift statistics to alert thresholds
aggregate alerts and route for investigation/retraining decisions
Complexity¶
- Time: Typically linear in the number of monitored records/features for histogram/statistic updates, plus metric-specific costs
- Space: Depends on retained baselines, histograms, and monitoring windows (often summary-statistics sized)
- Assumptions: Exact complexity depends on drift metric choice, binning strategy, and monitoring frequency