Density-Based Clustering Validation (DBCV)¶
Formula¶
\[
\mathrm{DBCV}
=
\sum_{i=1}^k \frac{|C_i|}{n}\,V(C_i)
\]
\[
V(C_i)=\frac{\mathrm{Sep}(C_i)-\mathrm{Spar}(C_i)}
{\max\{\mathrm{Sep}(C_i),\mathrm{Spar}(C_i)\}}
\]
Parameters¶
- \(C_i\): cluster \(i\)
- \(|C_i|\): size of cluster \(i\)
- \(n\): total number of clustered points
- \(\mathrm{Spar}(C_i)\): density-based sparseness (internal density weakness)
- \(\mathrm{Sep}(C_i)\): density-based separation from the nearest other cluster
What it means¶
Internal clustering score designed for density-based clusters (e.g., arbitrary shapes), balancing density separation against internal sparseness.
What it's used for¶
- Evaluating DBSCAN/HDBSCAN-style clustering.
- Comparing density-based clusterings where centroid-based metrics fail.
Key properties¶
- Typically interpreted on \([-1,1]\) with higher better
- Better aligned with non-convex clusters than centroid-based indices
Common gotchas¶
- Multiple DBCV implementations differ in exact density graph details.
- Noise/outlier handling is implementation-specific and can change the score.
Example¶
If clusters are internally dense and well-separated in density space, \(\mathrm{Sep}\gg \mathrm{Spar}\), so DBCV is high.
How to Compute (Pseudocode)¶
Input: data points, cluster labels (and metric-specific settings)
Output: density-based clustering validation metric
compute the per-cluster and/or per-point quantities required by the metric
aggregate them according to the metric definition
return the metric value
Complexity¶
- Time: Depends on the metric and implementation; many clustering validation metrics require at least \(O(n)\) scans, and some need pairwise distances (up to \(O(n^2)\))
- Space: Depends on whether pairwise distances are materialized (from \(O(1)\)/\(O(k)\) summaries up to \(O(n^2)\))
- Assumptions: \(n\) points and cluster labels are given; exact cost depends on distance computation/caching and metric-specific formulas