Undestanding Kmeans

This analysis was initially published here. I have redone the analysis to understand how Kmeans works.

Kmeans is one of the most popular and simple algorithm to discover underlying structures in your data. The goal of kmeans is simple, split your data in k different groups represented by their mean. The mean of each group is assumed to be a good summary of each observation of this cluster.

Kmeans Algorithm

We assume that we want to split the data into k groups, so we need to find and assign k centers.

Initialize randomly the centers by selecting k-observations
While some convergence criterion is not met
Assign each observation to its closest center
Update each center. The new centers are the mean of the observation of each group.
Update the convergence criterion.

K-means from scratch with R

Let us now implement kmeans from scratch in R. First, we’ll create a dataframe based on five 2D gaussian distributions.

set.seed(1234)
set1 <- mvrnorm(n = 300, c(-4,10), matrix(c(1.5,1,1,1.5),2))
set2 <- mvrnorm(n = 300, c(5,7), matrix(c(1,2,2,6),2))
set3 <- mvrnorm(n = 300, c(-1,1), matrix(c(4,0,0,4),2))
set4 <- mvrnorm(n = 300, c(10,-10), matrix(c(4,0,0,4),2))
set5 <- mvrnorm(n = 300, c(3,-3), matrix(c(4,0,0,4),2))
DF <- data.frame(rbind(set1,set2,set3,set4,set5), cluster = as.factor(c(rep(1:5, each=300))))

ggplot(DF,aes(x=X1,y=X2,color=cluster)) + 
  geom_point()

On this dataset, Kmeans will work well since each distribution has a circular shape.

Initialisation of the centroids

The initialisation of the centroids is crucial and will change how the algorithm behave. Here, we will simply takes K random points from the data.

#Centroids initialisation
centroids <- data[sample.int(nrow(DF),5),]
##Stopping criteria initilisation. 
current_stop_crit <- 10e10
##Vector where the assigned centers of each points will be saved
cluster <- rep(0,nrow(data))
##Has the alogrithm converged ?
converged=F
it=1

Assigning points to their clusters

At each iteration, every points will be assigned to its closest cluster. To do so, the euclidian distance between each points and each centers is computed, the lowest distance and the center for which it’s reached is saved.

###Iterating over observations
for (i in 1:nrow(data))
{
##Setting a high minimum distance
  min_dist=10e10
##Iterating over centroids
  for (centroid in 1:nrow(centroids))
  {
##Computing the L2 distance
    distance_to_centroid=sum((centroids[centroid,]-data[i,])^2)
##This centroid is the closest centroid to the point
    if (distance_to_centroid<=min_dist)
    {
##The point is assigned to this centroid/cluster
      cluster[i]=centroid
      min_dist=distance_to_centroid
    }
  }
}

Centroids update

Once each point has been assigned to the closest centroids, the coordinates of each centroid are updated. The new coordinates are the means of the observations which belongs to the cluster.

##For each centroid
for (i in 1:nrow(centroids))
{
##The new coordinates are the means of the point in the cluster
  centroids[i,]=apply(data[cluster==i,],2,mean)
}

Stopping criterion

We do not want the algorithm to run indefinetely, hence we need a stopping criterion to stop the algorthm when we are close enough to a minimum. The criterion is simply that when the centroids stop moving, the algorithm should stop.

while(current_stop_crit >= stop_crit & converged == F)
  {
    it=it+1
    if (current_stop_crit<=stop_crit)
    {
      converged=T
    }
    old_centroids <- centroids
###Run previous step
 
####Recompute stop criterion
current_stop_crit=mean((old_centroids-centroids)^2)

Complete function

kmeans <- function(data,K,stop_crit=10e-5)
{
  #Initialisation of clusters
  centroids=data[sample.int(nrow(data),K),]
  current_stop_crit=1000
  cluster=rep(0,nrow(data))
  converged=F
  it=1
  while(current_stop_crit>=stop_crit & converged==F)
  {
    it=it+1
    if (current_stop_crit<=stop_crit)
    {
      converged=T
    }
    old_centroids=centroids
    ##Assigning each point to a centroid
    for (i in 1:nrow(data))
    {
      min_dist=10e10
      for (centroid in 1:nrow(centroids))
      {
        distance_to_centroid=sum((centroids[centroid,]-data[i,])^2)
        if (distance_to_centroid<=min_dist)
        {
          cluster[i]=centroid
          min_dist=distance_to_centroid
        }
      }
    }
    ##Assigning each point to a centroid
    for (i in 1:nrow(centroids))
    {
      centroids[i,]=apply(data[cluster==i,],2,mean)
    }
    current_stop_crit=mean((old_centroids-centroids)^2)
  }
  return(list(data=data.frame(data,cluster),centroids=centroids))
}

Running K-means on our Dataset

res <- kmeans(DF[1:2],5)
res$centroids$cluster <- 1:5
res$data$isCentroid <- F
res$centroids$isCentroid <- T
data_plot <- rbind(res$centroids,res$data)
ggplot(data_plot, aes(x=X1, y=X2, color=as.factor(cluster), size=isCentroid, alpha=isCentroid)) + 
  geom_point()

Running K-means on another Dataset

Now, instead of having nice gaussian distributions, we will build three rings on into another.

##Building three sets
set1 <- data.frame(r=runif(300,0.1,0.5), theta=runif(300,0,360), set='1')
set2 <- data.frame(r=runif(300,1,1.5), theta=runif(300,0,360), set='2')
set3 <- data.frame(r=runif(300,3,5), theta=runif(300,0,360), set='3')
##Transformation in rings
data_2=rbind(set1,set2,set3)
data_2$x=data_2$r*cos(2*3.14*data_2$theta)
data_2$y=(data_2$r)*sin(2*3.14*data_2$theta)

res <- kmeans(data_2[4:5],3)
res$centroids$cluster <- 1:3
res$data$isCentroid <- F
res$centroids$isCentroid <- T
data_plot <- rbind(res$centroids,res$data)
ggplot(data_plot, aes(x=x, y=y, color=as.factor(cluster), size = isCentroid, alpha = isCentroid)) + 
  geom_point()

The kmeans is performing very poorly on these new data. Actually, the euclidian distance is not adapted to this kind of problem, since the data are not in a circular shape.

So before using kmeans, you should ensure that the data is in appropriate shapes, if not, you can apply transformations or change the distance you are using in the kmeans.

The importance of a good initialisation

The kmeans algorithm only looks for a local mimimum which is often not a global optimum. Hence, different initialisation can lead to very different results. The quality of initialization can also be improved with kmeans ++, the algorithm selects starting points which are less likely to perform poorly.