Calinski-Harabasz Score¶

Formula¶

\[ \mathrm{CH} = \frac{\operatorname{tr}(B_k)/(k-1)}{\operatorname{tr}(W_k)/(n-k)} \]

Parameters¶

\(n\): number of points
\(k\): number of clusters
\(W_k\): within-cluster dispersion matrix
\(B_k\): between-cluster dispersion matrix
\(\operatorname{tr}(\cdot)\): matrix trace

What it means¶

Ratio of between-cluster dispersion to within-cluster dispersion, adjusted for degrees of freedom.

What it's used for¶

Internal clustering validation.
Selecting number of clusters \(k\) (often maximize CH).

Key properties¶

Higher is better
Common for k-means-like clustering
Scale-invariant under uniform scaling of distances

Common gotchas¶

Often favors larger \(k\) on some datasets.
Less reliable for highly non-spherical or variable-density clusters.

Example¶

If clusters are tight (small \(W_k\)) and far apart (large \(B_k\)), CH is large.

How to Compute (Pseudocode)¶

Input: data points, cluster labels (and metric-specific settings)
Output: Calinski-Harabasz index

compute the per-cluster and/or per-point quantities required by the metric
aggregate them according to the metric definition
return the metric value

Complexity¶

Time: Depends on the metric and implementation; many clustering validation metrics require at least \(O(n)\) scans, and some need pairwise distances (up to \(O(n^2)\))
Space: Depends on whether pairwise distances are materialized (from \(O(1)\)/\(O(k)\) summaries up to \(O(n^2)\))
Assumptions: \(n\) points and cluster labels are given; exact cost depends on distance computation/caching and metric-specific formulas