K-Means Clustering¶
Formula¶
\[
\min_{\{C_k,\mu_k\}}\sum_{k=1}^{K}\sum_{x_i\in C_k}\|x_i-\mu_k\|_2^2
\]
Parameters¶
- \(K\): number of clusters
- \(\mu_k\): cluster centroid
- \(C_k\): assigned points
What it means¶
Partitions data into \(K\) clusters by alternating assignment and centroid updates.
What it's used for¶
- Unsupervised segmentation and prototype-based summarization.
- Preprocessing for compression or downstream labeling.
Key properties¶
- Sensitive to initialization and feature scaling.
- Assumes roughly spherical clusters under Euclidean distance.
Common gotchas¶
- Empty clusters and local minima can occur.
- Use multiple restarts or k-means++ initialization.
Example¶
Customer vectors are assigned to the nearest centroid, then centroids are recomputed iteratively.
How to Compute (Pseudocode)¶
Input: data points x[1..n] in R^d, number of clusters K, iterations T
Output: centroids mu[1..K], assignments c[1..n]
initialize centroids mu[1..K] (for example, k-means++)
for iter from 1 to T:
# Assignment step
for each point x_i:
c[i] <- index of nearest centroid
# Update step
for each cluster k:
mu[k] <- mean of points assigned to cluster k
if cluster k is empty:
reinitialize mu[k]
if assignments/centroids stop changing enough:
break
return mu, c
Complexity¶
- Time: \(O(TnKd)\) for \(T\) iterations, \(n\) points, \(K\) clusters, and \(d\) features (dominant cost is assignment step)
- Space: \(O(nd + Kd + n)\) including data, centroids, and assignments
- Assumptions: Dense Euclidean k-means with full-batch updates; initialization and restart costs are excluded from the main bound