Calibration Error (ECE)¶

Formula¶

\[ \operatorname{ECE} = \sum_{m=1}^M \frac{n_m}{N}\,\big|\operatorname{acc}(B_m) - \operatorname{conf}(B_m)\big| \]

Plot¶

fns: x | x^0.8
colors: #111111 | #ff6b2c
labels: Perfect calibration | Example reliability curve
xmin: 0
xmax: 1
ymin: 0
ymax: 1.05
height: 280
title: Reliability curve and ideal diagonal (ECE intuition)

Parameters¶

\(B_m\): probability bin \(m\)
\(n_m\): samples in bin \(m\)
\(\operatorname{acc}\): empirical accuracy in bin
\(\operatorname{conf}\): average predicted confidence in bin

What it means¶

Measures mismatch between predicted probabilities and observed frequencies.

What it's used for¶

Checking how well predicted probabilities match frequencies.
Comparing calibration across models.

Key properties¶

Lower is better; 0 is perfectly calibrated
Depends on binning choice

Common gotchas¶

ECE is sensitive to number of bins and binning strategy.
Not differentiable; not suited as a direct training loss.

Example¶

All predictions fall in one bin with \(\operatorname{conf}=0.8\) \(\operatorname{acc}=0.75\) gives \(\mathrm{ECE}=|0.75-0.8|=0.05\).

How to Compute (Pseudocode)¶

Input: predicted confidences p[1..N], labels y[1..N], number of bins M
Output: ECE

partition predictions into bins B_1..B_M
ECE <- 0
for each bin B_m:
  if B_m is empty:
    continue
  conf_m <- average confidence in B_m
  acc_m <- empirical accuracy in B_m
  ECE <- ECE + (|B_m|/N) * abs(acc_m - conf_m)
return ECE

Complexity¶

Time: \(O(N + M)\) after bin assignment (often \(O(N)\) overall for fixed bins)
Space: \(O(M)\) for bin aggregates/counters (plus optional stored bin assignments)
Assumptions: Fixed-bin ECE shown; adaptive binning and multiclass calibration variants use different procedures