Probability Calibration¶

Formula¶

\[ P(Y=1\mid \hat p = p) \approx p \]

Plot¶

fns: x | x^0.8
colors: #111111 | #ff6b2c
labels: Perfect calibration | Example model
xmin: 0
xmax: 1
ymin: 0
ymax: 1.05
height: 280
title: Reliability curve vs ideal diagonal

Parameters¶

\(\hat p\): predicted probability
\(p\): probability value/bin target

What it means¶

A classifier is calibrated when predicted probabilities match observed frequencies.

What it's used for¶

Risk estimation, ranking-to-decision systems, and cost-sensitive thresholding.
Comparing models beyond discrimination metrics like AUC.

Key properties¶

A model can have high AUC and poor calibration.
Calibration can be improved post hoc (Platt scaling, isotonic regression).

Common gotchas¶

Calibrating on the test set leaks information.
Calibration can drift over time as prevalence changes.

Example¶

If predictions near 0.8 occur 80% positive in reality, that bin is well calibrated.

How to Compute (Pseudocode)¶

Input: model scores/probabilities on calibration set, labels
Output: calibrated prediction function g(. )

fit a calibration model g using held-out calibration data
  examples: Platt scaling (logistic) or isotonic regression

for a new model score p_hat:
  return g(p_hat)

Complexity¶

Time: Depends on the calibration method (for example, Platt scaling is typically cheap iterative optimization; isotonic regression is near-linear after sorting)
Space: \(O(n)\) to store calibration examples or fitted calibration map (method-dependent)
Assumptions: \(n\) is calibration-set size; calibration is fit on held-out data separate from the final test set