A primer to K Means clustering for a statistics audience

K Means clustering

Clustering is an unsupervised learning technique with the goal of grouping sets of objects based on similarity. Consider the plot below:

set.seed(25)
y_means <- c(2, 14, 28)
x_means <- c(2, 15, 20)
y_data <- rnorm(30,y_means)
x_data <- rnorm(30,x_means)
fake_data <- data.frame(x = x_data, y = y_data)
g <- ggplot(fake_data,aes(x = x,y = y)) 
g + geom_point() + theme_minimal()

We can classify points as similar based on “proximity” as illustrated below:

g + geom_point() + 
  annotate("rect", xmin = 0,xmax = 5,ymin = 0, ymax = 5,alpha = .2, fill = "red") + 
   annotate("rect", xmin = 13,xmax = 17,ymin = 10, ymax = 15,alpha = .2, fill = "red") + 
   annotate("rect", xmin = 17,xmax = 22,ymin = 25, ymax = 31,alpha = .2, fill = "red")+ theme_minimal()

A standard k-means algorithm is able to achieve this type of classification for quantitative data of multiple dimensions as long as a k value is specified. The k-means algorithm then produces k clusters of best fit as output.

How it works

A standard k-means algorithm (known as naive k-means) uses an iterative refinement technique. Given a set of k means (either specified by the user as a vector of starting cluster centroids or randomly chosen by the algorithm if a positive integer is supplied), the algorithm alternates between two steps:

Assign each data point to the cluster with the nearest mean using the least squared Euclidean (or Pythagorean) distance.
Recalculate means for observations assigned to each cluster.

The algorithm converges when there are no changes in assignments.

Key take-aways

A naive k-means algorithm is unable to determine the optimal k value for any given dataset. For example, running k-means on the dataset plotted above on a k value of 2 will yield 2 clusters although, according to the naked eye, 3 is the best value. Determining an optimal k value programmatically will be covered in the next section.
The naive k-means algorithm is not guaranteed to produce an optimal set of clusters. Apart from the specified k value, the quality of clusters produced depends on the number of iterations the algorithm cycles through. The higher the number of iterations, the better the cluster quality.
Re-running k-means on the same dataset several times will not always yield the same clusters if starting cluster centroids are not specified. Since the algorithm randomly chooses starting centroids in such a situation, different data points are likely be placed in different clusters each time. In practice, it might be helpful to “warm-start” the k-means algorithm by specifying starting centroids.
Cluster centroids can be influenced heavily by outliers. Sometimes, outliers get placed in their own cluster instead of (maybe) being ignored. Consider removing or clipping outliers before clustering.
K-means underperforms when optimal clusters are non-spherical (first image). Other clustering algorithms, like Spherical clustering (second image) do better than k-means on non-spherical data.

KMeans on non-spherical data - Source: Reference (4)

Spectral clustering on the same data - Source: Reference (4)

K-Means clustering is a form of hard clustering, where a data point either belongs to one cluster or not. Other clustering algorithms fall into another category known as soft clustering algorithms, where data points can belong to multiple clusters by assigning them probabilities. A common example of a soft clustering algorithm is the Gaussian Mixture Model (GMM). Also, a soft variant of k-means is the fuzzy k-means algorithm.

Running a k-means algorithm in R

There is an in-built R function for k-means clustering. The output of this function is an object that has the following important properties (not all are listed): 1. cluster - a cluster assignment label (from 1 to k) for each input data point. 2. centers - a matrix of cluster centers. 3. tot.withinss - Total within-cluster sum of squares.

The graph below is a visualization of the cluster output from a k-means model.

#kmeans(x, centers, iter.max = 10, nstart = 1, 
#algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), trace=FALSE)
k_obj <- kmeans(fake_data,3)
fake_data$cluster = k_obj$cluster
ggplot(fake_data,aes(x = x, y = y, col = as.factor(cluster)))  + 
  geom_jitter() + 
  labs(col = "Cluster",title = "Cluster output after running k-means")+ theme_minimal()

Choosing a good k value

For more complicated datasets, an optimal k value becomes increasingly subjective. There are more than 30 published methods for identifying the optimal number of clusters. Three of such methods will be discussed here. First, you will need to install the following libraries that provide functions for running some of the methods we will discuss here:

#install.packages(c("NbClust","factoextra"))

The Elbow Method

One way to measure the quality of is by minimizing intra-cluster sum of squares (or tots.withinss from the model output) over several. The intra-cluster sum of squares is computed as follows:

\[\sum^{n}_{i = 1}{(y_i - \bar{y})^2}\] where \(y_i\) is a sample observation, \(\bar{y}\) is the cluster mean and n is the number of observations. In the Elbow method, a k-means model is ran with increasing k values and the total intra-cluster sum of squares (or within-cluster sum of squares) is computed for each k value. A graph of k values vs total sum of squares yields a graph resembling an elbow. At the elbow point (in this case, a k-value of 3), the sum of squares value is 0. Therefore, a k value of 3 is optimal for our dataset.

library(NbClust)
library(factoextra)
fviz_nbclust(fake_data,kmeans,method = "wss") +
  geom_vline(xintercept = 3, linetype = 2) +
  labs(subtitle = "Elbow method")+ theme_minimal()

The Silhouette method

The silhouette method measures the quality of clusters by measuring how well an observation is clustered, and estimating the average distance between clusters. According to [2], the silhoutte method works as follows:

Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
For each k, calculate the average silhouette of observations (Read reference [3] to learn how this measure is computed).
Plot the curve of avg.sil according to the number of clusters k.
The location of the maximum is considered as the appropriate number of clusters.

fviz_nbclust(fake_data, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")+ theme_minimal()

Common Applications

K-means is used in image compression. There are often several colors (represented as RGB values per pixel) in images. With K-means it is possible to reduce the number of colors in an image by a large factor. For example, in order to compress an image to a 30-color image, with pixels as data points, K-means can be used to cluster pixels around 30 centroids. Therefore, the final image will have pixels that take on the RGB values of each cluster centroid they were placed into thus effectively compressing the image into one with 30 colors.
K-means is also used in marketing to segment a customer base according to certain features like spending habits, demography and geography. Based on cluster analysis, a company can best determine how to serve groups of customers depending on clustering results.

References

Self-grade

This “blog” discusses k-means clustering, a method not discussed in class.It is publication ready, includes reproducible code and graphics, pitched towards a Stat audience and contains rich and easily digestible information about K-Means clustering. I believe this project deserves an Excellent grade.