DBSCAN¶
Formula¶
\[
\text{core point } x_i: |\{x_j: d(x_i,x_j)\le \varepsilon\}| \ge \text{MinPts}
\]
Parameters¶
- \(\varepsilon\): neighborhood radius
- \(\text{MinPts}\): minimum points for a dense region
What it means¶
Density-based clustering groups points in dense regions and labels sparse points as noise.
What it's used for¶
- Finding arbitrarily shaped clusters.
- Outlier-aware clustering without specifying \(K\).
Key properties¶
- Naturally identifies noise points.
- Works poorly when cluster densities differ strongly.
Common gotchas¶
- Feature scaling and distance choice matter a lot.
- Parameter tuning can be difficult in high dimensions.
Example¶
GPS points form route clusters while isolated points remain labeled as noise.
How to Compute (Pseudocode)¶
Input: points x[1..n], radius eps, MinPts
Output: cluster labels (including noise)
mark all points as unvisited
cluster_id <- 0
for each point p:
if p is visited:
continue
mark p visited
N <- neighbors within distance eps of p
if size(N) < MinPts:
label p as noise (tentatively)
continue
cluster_id <- cluster_id + 1
assign p to cluster_id
expand cluster by repeatedly visiting neighbors:
if a neighbor q is unvisited, mark visited and query its eps-neighbors
if q is a core point, add its neighbors to the expansion frontier
assign density-reachable points to cluster_id
return labels
Complexity¶
- Time: Depends on neighborhood-query implementation; typically \(O(n\log n)\) to \(O(n^2)\), with \(O(n^2)\) worst case for naive distance checks
- Space: \(O(n)\) to \(O(n^2)\) depending on whether neighborhoods/distances are materialized
- Assumptions: \(n\) points; indexing structures (KD-tree/ball tree/grid) and data dimensionality strongly affect runtime