Davies-Bouldin Index¶

Formula¶

\[ \mathrm{DB}=\frac{1}{k}\sum_{i=1}^k \max_{j\ne i}\left(\frac{S_i+S_j}{M_{ij}}\right) \]

Parameters¶

\(k\): number of clusters
\(S_i\): within-cluster scatter for cluster \(i\) (e.g., mean distance to centroid)
\(M_{ij}\): distance between centroids of clusters \(i\) and \(j\)

What it means¶

Compares cluster spread to separation; each cluster is penalized by its most similar other cluster.

What it's used for¶

Internal validation when no ground-truth labels exist.
Choosing \(k\) for centroid-based clustering.

Key properties¶

Lower is better
Uses both compactness and separation
Fast to compute once centroids/scatters are available

Common gotchas¶

Assumes centroid/spread summaries are meaningful.
Can be misleading for non-convex clusters.

Example¶

If each cluster has a nearby overlapping neighbor, the max ratios are large, so DB increases (worse clustering).

How to Compute (Pseudocode)¶

Input: data points, cluster labels (and metric-specific settings)
Output: Davies-Bouldin index

compute the per-cluster and/or per-point quantities required by the metric
aggregate them according to the metric definition
return the metric value

Complexity¶

Time: Depends on the metric and implementation; many clustering validation metrics require at least \(O(n)\) scans, and some need pairwise distances (up to \(O(n^2)\))
Space: Depends on whether pairwise distances are materialized (from \(O(1)\)/\(O(k)\) summaries up to \(O(n^2)\))
Assumptions: \(n\) points and cluster labels are given; exact cost depends on distance computation/caching and metric-specific formulas