Imputation¶

Formula¶

\[ \tilde{x}_{ij}=\begin{cases} x_{ij}, & x_{ij}\ \text{observed}\\ g(x_{\cdot j}), & x_{ij}\ \text{missing} \end{cases} \]

Parameters¶

\(g\): imputation rule (mean/median/mode/model/KNN/etc.)
\(\tilde{x}_{ij}\): imputed value

What it means¶

Replaces missing values with estimated values so downstream models can train and score.

What it's used for¶

Simple imputation baselines.
Pipeline preprocessing for tabular models.

Key properties¶

Imputation should be fit on the training split only.
Median is more robust than mean for skewed data.

Common gotchas¶

Single imputation understates uncertainty for inference tasks.
Using target information to impute predictors can leak labels.

Example¶

Median-impute income and add a binary "income_missing" indicator column.

How to Compute (Pseudocode)¶

Input: training data with missing values, imputation rule g
Output: fitted imputer and transformed dataset

fit imputer parameters on training data only
  examples: column mean/median/mode, KNN stats, or model-based parameters
for each missing entry:
  replace it using the fitted rule g
optionally add missingness indicator columns
return transformed data and fitted imputer state

Complexity¶

Time: Depends on the imputation method (simple mean/median imputation is often \(O(nd)\); KNN/model-based imputers can be much more expensive)
Space: Depends on the method and data representation; simple imputers store \(O(d)\) fitted statistics
Assumptions: \(n\) samples, \(d\) features; fitting and applying are both performed inside the training/pipeline workflow to avoid leakage