Calinski-Harabasz Score¶
Formula¶
\[
\mathrm{CH}
=
\frac{\operatorname{tr}(B_k)/(k-1)}{\operatorname{tr}(W_k)/(n-k)}
\]
Parameters¶
- \(n\): number of points
- \(k\): number of clusters
- \(W_k\): within-cluster dispersion matrix
- \(B_k\): between-cluster dispersion matrix
- \(\operatorname{tr}(\cdot)\): matrix trace
What it means¶
Ratio of between-cluster dispersion to within-cluster dispersion, adjusted for degrees of freedom.
What it's used for¶
- Internal clustering validation.
- Selecting number of clusters \(k\) (often maximize CH).
Key properties¶
- Higher is better
- Common for k-means-like clustering
- Scale-invariant under uniform scaling of distances
Common gotchas¶
- Often favors larger \(k\) on some datasets.
- Less reliable for highly non-spherical or variable-density clusters.
Example¶
If clusters are tight (small \(W_k\)) and far apart (large \(B_k\)), CH is large.
How to Compute (Pseudocode)¶
Input: data points, cluster labels (and metric-specific settings)
Output: Calinski-Harabasz index
compute the per-cluster and/or per-point quantities required by the metric
aggregate them according to the metric definition
return the metric value
Complexity¶
- Time: Depends on the metric and implementation; many clustering validation metrics require at least \(O(n)\) scans, and some need pairwise distances (up to \(O(n^2)\))
- Space: Depends on whether pairwise distances are materialized (from \(O(1)\)/\(O(k)\) summaries up to \(O(n^2)\))
- Assumptions: \(n\) points and cluster labels are given; exact cost depends on distance computation/caching and metric-specific formulas