Skip to content

Imputation

Formula

\[ \tilde{x}_{ij}=\begin{cases} x_{ij}, & x_{ij}\ \text{observed}\\ g(x_{\cdot j}), & x_{ij}\ \text{missing} \end{cases} \]

Parameters

  • \(g\): imputation rule (mean/median/mode/model/KNN/etc.)
  • \(\tilde{x}_{ij}\): imputed value

What it means

Replaces missing values with estimated values so downstream models can train and score.

What it's used for

  • Simple imputation baselines.
  • Pipeline preprocessing for tabular models.

Key properties

  • Imputation should be fit on the training split only.
  • Median is more robust than mean for skewed data.

Common gotchas

  • Single imputation understates uncertainty for inference tasks.
  • Using target information to impute predictors can leak labels.

Example

Median-impute income and add a binary "income_missing" indicator column.

How to Compute (Pseudocode)

Input: training data with missing values, imputation rule g
Output: fitted imputer and transformed dataset

fit imputer parameters on training data only
  examples: column mean/median/mode, KNN stats, or model-based parameters
for each missing entry:
  replace it using the fitted rule g
optionally add missingness indicator columns
return transformed data and fitted imputer state

Complexity

  • Time: Depends on the imputation method (simple mean/median imputation is often \(O(nd)\); KNN/model-based imputers can be much more expensive)
  • Space: Depends on the method and data representation; simple imputers store \(O(d)\) fitted statistics
  • Assumptions: \(n\) samples, \(d\) features; fitting and applying are both performed inside the training/pipeline workflow to avoid leakage

See also