Skip to content

Probability Calibration

Formula

\[ P(Y=1\mid \hat p = p) \approx p \]

Plot

fns: x | x^0.8
colors: #111111 | #ff6b2c
labels: Perfect calibration | Example model
xmin: 0
xmax: 1
ymin: 0
ymax: 1.05
height: 280
title: Reliability curve vs ideal diagonal

Parameters

  • \(\hat p\): predicted probability
  • \(p\): probability value/bin target

What it means

A classifier is calibrated when predicted probabilities match observed frequencies.

What it's used for

  • Risk estimation, ranking-to-decision systems, and cost-sensitive thresholding.
  • Comparing models beyond discrimination metrics like AUC.

Key properties

  • A model can have high AUC and poor calibration.
  • Calibration can be improved post hoc (Platt scaling, isotonic regression).

Common gotchas

  • Calibrating on the test set leaks information.
  • Calibration can drift over time as prevalence changes.

Example

If predictions near 0.8 occur 80% positive in reality, that bin is well calibrated.

How to Compute (Pseudocode)

Input: model scores/probabilities on calibration set, labels
Output: calibrated prediction function g(. )

fit a calibration model g using held-out calibration data
  examples: Platt scaling (logistic) or isotonic regression

for a new model score p_hat:
  return g(p_hat)

Complexity

  • Time: Depends on the calibration method (for example, Platt scaling is typically cheap iterative optimization; isotonic regression is near-linear after sorting)
  • Space: \(O(n)\) to store calibration examples or fitted calibration map (method-dependent)
  • Assumptions: \(n\) is calibration-set size; calibration is fit on held-out data separate from the final test set

See also