Skip to content

Class Imbalance

Formula

\[ \pi = P(y=1),\quad \pi \ll 0.5\ \text{(rare positive class)} \]

Parameters

  • \(\pi\): positive class prevalence

What it means

Class imbalance means one class is much rarer than another, which changes training and evaluation behavior.

What it's used for

  • Choosing metrics (PR AUC, recall, precision, cost-based metrics).
  • Resampling, class weighting, threshold tuning.

Key properties

  • Accuracy becomes less informative as imbalance increases.
  • Probability calibration and thresholding matter more than default 0.5 cuts.

Common gotchas

  • Oversampling before splitting can leak duplicates into validation/test.
  • SMOTE-like methods need careful validation setup.

Example

With 1% fraud prevalence, a model predicting all negatives gets 99% accuracy but zero recall.

How to Compute (Pseudocode)

Input: labeled dataset and task objective
Output: imbalance-aware training/evaluation setup

measure class prevalence and baseline class counts
choose metrics aligned with costs (for example precision/recall, PR AUC)
choose mitigation strategy if needed (class weights, resampling, threshold tuning)
validate using leakage-safe splits (often stratified)
report metrics by class and at the chosen threshold

Complexity

  • Time: Mostly \(O(n)\) data scans for prevalence/metrics, plus the cost of the chosen training and validation workflow
  • Space: Depends on whether resampled datasets, weights, or per-class reports are materialized
  • Assumptions: Exact cost is dominated by the mitigation/training method rather than imbalance measurement itself

See also