Introduction to K-Means Clustering

Unsupervised Learning & Clustering

In “supervised learning” the data set already has features derived from the data, and the model learns from these features.
In “unsupervised learning” there are no predetermined features. The goal is to derive these features that exists in the data.
Clustering is a unsupervised task where observations are grouped (clusters) so that data points in the same group have a similar feature.
K-means is the clustering algorithm that is used to group the data points.

We will use the built-in iris dataset as a running example throughout this presentation.

K-Means Algorithm

The K-means algorithm assigns \(n\) observations into \(K\) clusters. Each cluster is created using what is know as a centroid (mean vector). The job of the algorithm is to move data points close to the centroid. The algorithm alternates two steps until the cluster assignments stop changing:

Assignment step: assign each data point to the nearest centroid.
Update step: recompute each centroid as the mean of the data points assigned to it.

Euclidean Distance

The idea of the “closest” cluster is determined by a distance metric. For two points \(\mathbf{x} = (x_1, \dots, x_p)\) and \(\mathbf{y} = (y_1, \dots, y_p)\), K-means uses the Euclidean distance formula:

\[ d(\mathbf{x}, \mathbf{y}) = \sqrt{\textstyle\sum_{j=1}^{p} \left( x_j - y_j \right)^2}. \]

A data point is assigned to cluster \(k\) where it’s centroid \(\boldsymbol{\mu}_k\) minimizes \(d(\mathbf{x}, \boldsymbol{\mu}_k)\).

The Objective

The goal: make each cluster as “tight” as possible. K-means tries to make the total distance from data points to their cluster’s centers as small as it can. We measure this total distance with the Within Cluster Sum of Squares (WCSS), written \(J\):

\[ J = \sum_{k=1}^{K} \sum_{\mathbf{x} \in C_k} \left\lVert \mathbf{x} - \boldsymbol{\mu}_k \right\rVert^{2}. \]

For every data point \(\mathbf{x}\), we take the distance to its cluster’s center \(\boldsymbol{\mu}_k\), then add all the distance’s across all points. A smaller \(J\) means a tighter cluster.

The center \(\boldsymbol{\mu}_k\) is the average of the data points in that cluster.

Every iteration of the algorithm makes \(J\) smaller or leaves it the same, so it always converges.

Running the Algorithm

We will choose \(K = 3\), then we can fit the model with kmeans().

X <- scale(iris[, 1:4]) 
km <- kmeans(X, centers = 3, nstart = 25)

km$size            # number of points in each cluster

## [1] 50 53 47

centers = 3 - creates three clusters.
nstart = 25 - tries 25 iterations and returns the one with the lowest WCSS

2D Visualization of Clusters

Using petal length and petal width shows the three clusters separated. Each shaded circle contains the data points belonging to one cluster.

3D Visualization with Plotly

K-means works in any number of dimensions. Here is an interactive 3D scatter plot of three measurements.

Summary

K-means groups data points by minimizing the within-cluster sum of squares \(J\).
It alternates between assignment (nearest centroid by distance) and update (recompute means) until convergence.
Visualize results in 2D and interactive 3D to validate the clustering.