Missing Data¶

Formula¶

\[ X = X_{obs} \cup X_{mis} \]

Parameters¶

\(X_{obs}\): observed values
\(X_{mis}\): missing values

What it means¶

Missing values are not just blanks; the mechanism behind missingness affects bias and valid analysis choices.

What it's used for¶

Data quality audits.
Choosing deletion vs imputation vs model-based handling.

Key properties¶

Common mechanisms: MCAR, MAR, MNAR.
Missingness itself can be predictive and worth modeling.

Common gotchas¶

Dropping rows can change the target distribution.
Imputation without a missingness indicator can hide useful signal.

Example¶

A medical lab test missing because only severe patients receive it is not random missingness.

How to Compute (Pseudocode)¶

Input: dataset with missing values
Output: missingness audit and handling plan

compute missing-value counts/rates per feature and optionally per subgroup
identify likely missingness mechanisms (MCAR/MAR/MNAR assumptions)
choose handling strategy (drop, impute, model-based, indicators)
validate impact on downstream model performance and bias

Complexity¶

Time: Typically \(O(nd)\) for an initial missingness audit over \(n\) samples and \(d\) features, plus the cost of the chosen handling method
Space: \(O(d)\) for summary statistics (plus any transformed datasets or imputation state)
Assumptions: Audit workflow shown; imputation/modeling costs depend on the selected method and pipeline implementation