Skip to content

Calinski-Harabasz Score

Formula

\[ \mathrm{CH} = \frac{\operatorname{tr}(B_k)/(k-1)}{\operatorname{tr}(W_k)/(n-k)} \]

Parameters

  • \(n\): number of points
  • \(k\): number of clusters
  • \(W_k\): within-cluster dispersion matrix
  • \(B_k\): between-cluster dispersion matrix
  • \(\operatorname{tr}(\cdot)\): matrix trace

What it means

Ratio of between-cluster dispersion to within-cluster dispersion, adjusted for degrees of freedom.

What it's used for

  • Internal clustering validation.
  • Selecting number of clusters \(k\) (often maximize CH).

Key properties

  • Higher is better
  • Common for k-means-like clustering
  • Scale-invariant under uniform scaling of distances

Common gotchas

  • Often favors larger \(k\) on some datasets.
  • Less reliable for highly non-spherical or variable-density clusters.

Example

If clusters are tight (small \(W_k\)) and far apart (large \(B_k\)), CH is large.

How to Compute (Pseudocode)

Input: data points, cluster labels (and metric-specific settings)
Output: Calinski-Harabasz index

compute the per-cluster and/or per-point quantities required by the metric
aggregate them according to the metric definition
return the metric value

Complexity

  • Time: Depends on the metric and implementation; many clustering validation metrics require at least \(O(n)\) scans, and some need pairwise distances (up to \(O(n^2)\))
  • Space: Depends on whether pairwise distances are materialized (from \(O(1)\)/\(O(k)\) summaries up to \(O(n^2)\))
  • Assumptions: \(n\) points and cluster labels are given; exact cost depends on distance computation/caching and metric-specific formulas

See also