Class Imbalance¶
Formula¶
\[
\pi = P(y=1),\quad \pi \ll 0.5\ \text{(rare positive class)}
\]
Parameters¶
- \(\pi\): positive class prevalence
What it means¶
Class imbalance means one class is much rarer than another, which changes training and evaluation behavior.
What it's used for¶
- Choosing metrics (PR AUC, recall, precision, cost-based metrics).
- Resampling, class weighting, threshold tuning.
Key properties¶
- Accuracy becomes less informative as imbalance increases.
- Probability calibration and thresholding matter more than default 0.5 cuts.
Common gotchas¶
- Oversampling before splitting can leak duplicates into validation/test.
- SMOTE-like methods need careful validation setup.
Example¶
With 1% fraud prevalence, a model predicting all negatives gets 99% accuracy but zero recall.
How to Compute (Pseudocode)¶
Input: labeled dataset and task objective
Output: imbalance-aware training/evaluation setup
measure class prevalence and baseline class counts
choose metrics aligned with costs (for example precision/recall, PR AUC)
choose mitigation strategy if needed (class weights, resampling, threshold tuning)
validate using leakage-safe splits (often stratified)
report metrics by class and at the chosen threshold
Complexity¶
- Time: Mostly \(O(n)\) data scans for prevalence/metrics, plus the cost of the chosen training and validation workflow
- Space: Depends on whether resampled datasets, weights, or per-class reports are materialized
- Assumptions: Exact cost is dominated by the mitigation/training method rather than imbalance measurement itself