Introduction
Mathematical Description
Algorithm to solve the optimization
An Example
Some Comments
R function for K-Means
An analogy with SLR: detemining K with elbow method

I always tend to find explanations of topics involving minimum complexity of mathematical notations. Mathematical notations are great. But too much dependency may sometimes cause to deviate from the core objective, proper understanding of the concept. This is specially true when someone pulls out a machine learning topic. Here is my effort to give a demonstration of K-Means clustering method. I tried to minimze the use of notations, yet I tried to fully cover the basic idea. I tried to explain the K-Means algorithm (with the underlying mathematics behind it) in simple language and hand-coded for a specific case. Then I went over the built in kmeans function in R with an application. I ended the write up giving an overview of elbow method to choose K for the algorithm.

Acknowledgments:

Dr. Brandenburger¹
Dr. Saunders²
Introduction to Statistical Learning (ISLR)

Introduction

K-Means is a simple unsupervised learning (clustering) method, which attaches labels to the observations of the datasets.

K-Means partitions a data set into K distinct, non-overlapping clusters. An important feature of K-Means is that the number of clusters is user defined.

K-Means algorithm makes K clusters of datasets by solving an optimization priblem. It solves the clustering problem is such a way that it satisfies two properties:

Each observation belongs to at least one of the K clusters.
The clusters are non overlapping: no observation belongs to more than one cluster.

Mathematical Description

The key idea of K-Means clustering is that it tries to minimize the within cluster variation. The definition of within cluster variation is subjective and there are multiple ways to define it.

For the time being, let \(C_i, \ i\epsilon \{1,2, \dots K\}\) represent the clusters and \(W(C_i)\) represent the within cluster variation of \(C_i\). Now, what K-Means algorithm does, it minimizes the sum of \(K\) within cluster variations. In mathematical terms, it solves the following optimization problem:

\[Minimize \{ \sum^K_{i=1} W(C_i)\}\]

Now, coming back to defining the within cluster variation, there are multiple ways, including the identity error, Euclidean diastance, \(L_q\) norms, Canberra distance etc. By far the most popular is the Euclidean distance. It can be defined as follows:

\[W(C_k)=\frac{1}{|C_k|} \sum_{i,i'\epsilon C_k}\sum_{j=1}^{p} (x_{ij}-x_{i'j})^2 \]

In the above equation, \({|C_k|}\) is the number of elements in cluster \(C_k\). At a first glance, this might not seem very intuitive. A careful observations reveals that the double sum evaluates the squared Euclidean distance. Here, squared distance is taken instead of distance. Since squaring is a monotonic increasing function, optimization will give the same result, but it will reduce the computational cost significantly since in each calcualtion we are avoiding a square root calculation.

The inner sum in the above equation is for \(j\) dimensions in the data. That is, we are assuming that our data has \(j\) dimensions. Note that in the inner sum, \(i\) and \(i'\) are fixed. Both belong to the cluster \(C_k\). What the inner sum is doing is summarized as follows:

First it is taking two observations in a cluster
For all \(j\) dimensions, it is taking the difference, and squaring it, and summing it altogether. An example will make it clear. Suppose our data has two dimensions, height and weight. Two persons (observations) have heights \(67\) and \(69\) respectively and weights \(175\) and \(189\) respectively. Now for each dimension, we are taking difference, squaring it and summing it. That is, we are interested in the quantity: \((67-69)^2+(175-189)^2\). This calculation is being done in the inner sum. Here two persons are being represented by \(i\) and \(i'\). Here \(j\) is \(2\) as we have just height and weight. If we have other dimensions, then in the similar manner, we will just add all the squared differences. If we take the square root of this quantity, it is called Euclidean distance.

The outer sum is selecting all the possible pairs of observations within a cluster. So, if there are \(n_k\) observations in some clusters, we will have to take \(\binom{n_k}{2}=\frac{n_k(n_k-1)}{2}\) inner sums for that cluster, and sum across all of them.

So here we are defining the within cluster variation as the sum of all pairwise Euclidean distances, divided by the number of elements within that cluster.

Algorithm to solve the optimization

In the previous section, we formulated the clustering problem and shaped it into an optimization problem. For an exact solution, we need to iterate through all the possible combinations (it could be intersting problem, in how many ways we can arrange \(N\) points into \(K\) clusters, where each cluster gets at least one point). Intuitively, this number is not very small and it would be computationally exhaustive. Fortunately, we have very simple (literally, super easy!) algorithm that is very efficient. The only concern is that it could end up getting the local extreme.

I personally have gone through many versions of the same algorithm. Here I tried to summarize in my own way:

Randomly choose \(K\) centroids in the data
Eavlaute the Euclidean distance (Squared distance will do!) of all the observations from all the centroids. Assign each observation to a cluster, based on the smallest distance (or squared distance) from the centroids.
After step 2, we have actual members in \(K\) clusters. Now from the members, re-evaluate the centroids of the clusters.
Now go back to step 2, re-assign all the observations to clusters, based on the new centroids. If the cluster assignment is similar to the previous assignment, then stop. We say this point as the converging point.

Now why the algorithm works, can be explained by the following identity:

\[\frac{1}{|C_k|} \sum_{i,i'\epsilon C_k}\sum_{j=1}^{p} (x_{ij}-x{i'j})^2=2\sum_{i\epsilon C_k} \sum_{j=1}^p (x_{ij}-\bar{x}_{kj})^2\] where, \(\bar{x}_{kj}\) is the centroid of the cluster. Since we are re-assigning all the observations based on smallest distance from the centroid, in each iteration we are actually reducing the RHS of above equation.

An Example

We will demonstrate the K-Means clustering algorithm using the iris data. The iris data contains sepal length, sepal sidth, petal length and petal width of \(150\) flowers. There are three kinds of flowers, namely setosa, versicolor and virginica. There are \(50\) of each species. Although this is a four dimensional data, for simplicity, we will use two dimnsions, sepal length and petal length. FOllowing plot shows the scatter plot of two dimensions, color coded for species.

data(iris)
sl=iris[,1];pl=iris[,3];Spec=c(rep(1,50), rep(2,50), rep(3,50))
md=as.data.frame(cbind(sl,pl))
plot(sl~pl,col=Spec, pch=16, ylab="Sepal Length", xlab="Petal Length")
legend("topleft", pch=16, col=c(1, 2,3), c("Setosa", "Versicolor", "Viginica"), bty = "n")

Now based on the two dimesions, we will try to make three clusters. So, here we are pre-defining \(K=3\), the number of clusters we want by K-Means.

Step 1: Randomly pick three centroids

The first task is to pick theee centroids. Now for this we need three values for the sepal length, three values for the petal length. Let us pick uniformly from the range of respective dimension.

set.seed(1256)
sl_points=runif(3,range(sl)[1], range(sl)[2])
set.seed(4408)
pl_points=runif(3,range(pl)[1], range(pl)[2])

centroid=matrix(1:6, nrow = 3)
for(i in 1:3){centroid[i,]=c(pl_points[i],sl_points[i])}
message("Initial centroids (petal length first)")

## Initial centroids (petal length first)

print(centroid)

##          [,1]     [,2]
## [1,] 3.101015 4.988436
## [2,] 5.669583 4.428857
## [3,] 5.282241 7.144083

Step 2: Calculate the distances of all observations and assign cluster

Now we will perform the step 2 of our algorithm. We will calcluate the distances of all observations from three centroids and assign cluster based on the smallest distance.

calcSqDist=function(x, y){return(sum((x-y)^2))}

distances=rep(NA, 3)
assignments=rep(NA, 150)->assignments_new

for(i in 1:150){
  for(j in 1:3){
    distances[j]=calcSqDist(md[i,],rev(centroid[j,]))
  }
  assignments[i]=which.min(distances)
}

plot(sl~pl, pch=16, ylab="Sepal Length", xlab="Petal Length", col=assignments,
     main="Clustering after 1st iteration")
for(i in 1:3){points(centroid[i,1],centroid[i,2] , col=i, pch=17, cex=2)}

The clustering in the above figure is very trivial. As we picked the centroids randomly, the cluters are a little better (if at all) than random. Three triangles show the three randomly picked centroids. Now we will tune this clustering.

Step 3: Re-calculate centroids and re-assign clusters

Now we already have some assignments. Now from these assignments, we will re-calculate the centroids and re-assign the members to some clusters as before, based on the minimum squared distance, but this time from the new centroids.

## Update the centroids
for(i in 1:3){
  centroid[i,1]=mean(pl[which(assignments==i)])
  centroid[i,2]=mean(sl[which(assignments==i)])
}
message("Updated Centroid (petal length first)")

## Updated Centroid (petal length first)

print(centroid)

##          [,1]     [,2]
## [1,] 2.331169 5.198701
## [2,] 4.800000 5.400000
## [3,] 5.282857 6.571429

## Re assign clusters
for(i in 1:150){
  for(j in 1:3){
    distances[j]=calcSqDist(md[i,],rev(centroid[j,]))
  }
  assignments_new[i]=which.min(distances)
}

plot(sl~pl, pch=16, ylab="Sepal Length", xlab="Petal Length", col=assignments_new,
     main="Clustering after 2nd iteration")
for(i in 1:3){points(centroid[i,1],centroid[i,2] , col=i, pch=17, cex=2)}

Figure above shows the clustering at second iteration. There is a dramatic change in the clustering assignment even in the second iteration. Now, a careful look reveals that the cluster assignemnts in this step are saved in a new array. We did this on-purpose, to compare the new assignment with the previous one to find a converging point. Let us have a look if our assignments changed at all.

tellAoutChnage=function(x){
  if(x==0){message("Assignments Unchanged")}
  else{message("Assignments Changed")}
}
tellAoutChnage( sum(abs(assignments-assignments_new)) )

## Assignments Changed

Step 4: Iteration

We can see that the cluster assignments changed after two iterations. Now our job is to run the similar iteration for many times, until we do not get a change in cluster assignment. We will make a new counter variable to count the number of iteration required.

counter=2 ## since we already have two iterations
identifier=sum(abs(assignments-assignments_new))
repeat{
  assignments=assignments_new
  
  ##recalculate and reassign
  
  ## Update the centroids
  for(i in 1:3){
    centroid[i,1]=mean(pl[which(assignments==i)])
    centroid[i,2]=mean(sl[which(assignments==i)])
  }
  ## Re assign clusters
  for(i in 1:150){
    for(j in 1:3){
      distances[j]=calcSqDist(md[i,],rev(centroid[j,]))
    }
    assignments_new[i]=which.min(distances)
  }
  identifier=sum(abs(assignments-assignments_new))
  counter=counter+1
  if (identifier==0) {break}
}


plot(sl~pl, pch=16, ylab="Sepal Length", xlab="Petal Length", col=assignments_new,
     main="Clustering after final iteration")
for(i in 1:3){points(centroid[i,1],centroid[i,2] , col=i, pch=17, cex=2)}
for(i in 1:150){points(y=sl[i],x=pl[i],pch=4, col=Spec[i])}
legend("topleft", c("Original Cluster", "K-Means assignment"),pch=c(4,16),bty="n")

tellAoutChnage(identifier)

## Assignments Unchanged

message("Numnber of iteration required: ", counter)

## Numnber of iteration required: 6

So, here we are. This converged after just 6 iterations. Above figure shows the original clusters and K-Means assignments. We see some of the original virginica are grouped with versicolor. Still, this clustering is not that bad at all, considering the simplicity of the algorithm. And we used two dimensions here for clustering. Original data has four. We expect to have an imporved cluster assignment if we use all four dimensions.

Some Comments

Run the clustering algorithm many times

In the above demonstration, we did not calculate the objective function directly (It could be done without much complexity). Rather we indirectly approached towards convergence. The problem with the above algorithm is that this finds local minimum, rather than global. So, in practice, we should run the algorithm many times with differecnt initial conditions (that is, different centroids) and each time calculate the objective after final iteration. Each time, with different initial condition, we may end up with different clusters. We should retain to the one, that has the minimum objective function.

Choosing K

One biggest weak side of K-Means is that the number of clusters is a user defiened term. The algorithm is not “intelligent” enough to choose \(K\) by itself.

However, we can choose \(K\) employing the elbow method. That would be discussed in a later section.

K-Means finds spherical clusters

As long as we use the Euclidean distance as our withing cluster variation, the K-Means tries hard to find \(p\)-dimensional spherical clusters. This was the case for our demostration. The virginica, that were grouped with versicolor is due to the fact that they form circular share with versicolor.

R function for K-Means

In the previous section, we wrote our own code for K-Means algorithm. This piece of code was very specific. It takes only two dimensions, and was written for three clusters and 150 observations. Efforts could be made to generalize it. But thanks to R that it already has built in powerful function to perform K-Means clustering. We do not need to develop our own code except for demonstration purpose. The function is called kmeans.

Now we will do K-Means clustering on the same dataset, but using all four dimensions.

md=iris[,-5]
set.seed(10)
KM=kmeans(as.matrix(md), centers = 3, iter.max = 35,nstart = 10)
print(KM)

## K-means clustering with 3 clusters of sizes 38, 50, 62
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.850000    3.073684     5.742105    2.071053
## 2     5.006000    3.428000     1.462000    0.246000
## 3     5.901613    2.748387     4.393548    1.433871
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1
## [106] 1 3 1 1 1 1 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1
## [141] 1 1 3 1 1 1 3 1 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 23.87947 15.15100 39.82097
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The kmeans function in R takes the data as a matirx. Here we specified that we want \(3\) clusters, we capped the number of iterations to \(35\) and we specified that we want \(10\) different random starts.

kmeans gives us a kmeans object. This was printed above. It gives us a number of things:

Cluster Means : This gives us the \(K\) centroids, calculated after the final iteration.

Custering vector: Cluster assignment by the algorithm. Looking at the above results, it is intuitive that cluster 2 is for setosa, 3 is for versicolor and 1 is for virginica.

Available components: A bunch of things returned by the function.

cluster: It is same thing as Custering vector
centers: It is same thing as Cluster Means
totss: Stands for total sum of square. Mathematical formula is below: \[totss=\frac{1}{N} \sum_{i,i'\epsilon \{1,2,\dots N\}}\sum_{j=1}^{p} (x_{ij}-x_{i'j})^2\] where \(N\) is the total number of observations in the data. It can be thought of as a within cluster variaton when we have just one cluster (and all observations belong to that cluster).
withinss: Stands for within sum of square. It is same as the within cluster variation as above. We will have \(K\) withinss, one for each cluster.
tot.withinss: Sum of \(K\) withinss
betweenss: defined as totss-tot.withinss size: Size (number of members) of each of \(K\) clusters. iter: the numnber of iteration required for convergence
ifault: Indicator of a possible algorithm problem

An analogy with SLR: detemining K with elbow method

The totss, betweenss and the tot.withinss can be thought of total sum of square, regression sum of square and residual sum of square in simple linear regression. In regression, we define the \(R^2\) value as the ration of regression sum of square and total sum of square, which indicates the variability of the data explined by the regression model. In K-Means, the ratio of betweenss and totss has a similar interpretation. This ratio indicates the amount of information retained after clustering. Sometimes this ratio is used to determine the number \(K\), known as elbow method. We apply K-Means algorithm for a number of \(K\) values and evaluate this ratio. Then we plot this ratio against \(K\). We retain to the \(K\) after which we do not see significant improvement of this ratio.

Below is a plot showing the inforamtion retained against the number of clusters in the iris data. This is obvious that we do not see a significant change in the information retained as we go beyond \(3\) clusters. In fact, \(2\) seems to be a good choice as well, but that would be more aggressive.

N=10
information=rep(NA,N)
for(i in 1:N){
  KM=kmeans(as.matrix(md), centers = i, iter.max = 35,nstart = 10)
  information[i]=KM$betweenss/KM$totss
}

plot(information~seq(1:N), type="b",pch=16, col=4, ylab="Information Retained",lwd=2,
     xlab="Number of Clusters", main="Selecting K by elbow method")

Thomas Brandenburger , PhD, Associate Professor, Math and Stat, South Dakota State Univeristy↩
Christopher Saunders, PhD, Associate Professor, Math and Stat, South Dakota State Univeristy↩

K-Means Clustering: A Gentle Overview

Riaz Khan, South Dakota State University

September 16, 2017