K-Means Clustering¶

Formula¶

\[ \min_{\{C_k,\mu_k\}}\sum_{k=1}^{K}\sum_{x_i\in C_k}\|x_i-\mu_k\|_2^2 \]

Parameters¶

\(K\): number of clusters
\(\mu_k\): cluster centroid
\(C_k\): assigned points

What it means¶

Partitions data into \(K\) clusters by alternating assignment and centroid updates.

What it's used for¶

Unsupervised segmentation and prototype-based summarization.
Preprocessing for compression or downstream labeling.

Key properties¶

Sensitive to initialization and feature scaling.
Assumes roughly spherical clusters under Euclidean distance.

Common gotchas¶

Empty clusters and local minima can occur.
Use multiple restarts or k-means++ initialization.

Example¶

Customer vectors are assigned to the nearest centroid, then centroids are recomputed iteratively.

How to Compute (Pseudocode)¶

Input: data points x[1..n] in R^d, number of clusters K, iterations T
Output: centroids mu[1..K], assignments c[1..n]

initialize centroids mu[1..K] (for example, k-means++)

for iter from 1 to T:
  # Assignment step
  for each point x_i:
    c[i] <- index of nearest centroid

  # Update step
  for each cluster k:
    mu[k] <- mean of points assigned to cluster k
    if cluster k is empty:
      reinitialize mu[k]

  if assignments/centroids stop changing enough:
    break

return mu, c

Complexity¶

Time: \(O(TnKd)\) for \(T\) iterations, \(n\) points, \(K\) clusters, and \(d\) features (dominant cost is assignment step)
Space: \(O(nd + Kd + n)\) including data, centroids, and assignments
Assumptions: Dense Euclidean k-means with full-batch updates; initialization and restart costs are excluded from the main bound