Skip to content

Missing Data

Formula

\[ X = X_{obs} \cup X_{mis} \]

Parameters

  • \(X_{obs}\): observed values
  • \(X_{mis}\): missing values

What it means

Missing values are not just blanks; the mechanism behind missingness affects bias and valid analysis choices.

What it's used for

  • Data quality audits.
  • Choosing deletion vs imputation vs model-based handling.

Key properties

  • Common mechanisms: MCAR, MAR, MNAR.
  • Missingness itself can be predictive and worth modeling.

Common gotchas

  • Dropping rows can change the target distribution.
  • Imputation without a missingness indicator can hide useful signal.

Example

A medical lab test missing because only severe patients receive it is not random missingness.

How to Compute (Pseudocode)

Input: dataset with missing values
Output: missingness audit and handling plan

compute missing-value counts/rates per feature and optionally per subgroup
identify likely missingness mechanisms (MCAR/MAR/MNAR assumptions)
choose handling strategy (drop, impute, model-based, indicators)
validate impact on downstream model performance and bias

Complexity

  • Time: Typically \(O(nd)\) for an initial missingness audit over \(n\) samples and \(d\) features, plus the cost of the chosen handling method
  • Space: \(O(d)\) for summary statistics (plus any transformed datasets or imputation state)
  • Assumptions: Audit workflow shown; imputation/modeling costs depend on the selected method and pipeline implementation

See also