This analysis was initially published here. I have redone the analysis to understand how Kmeans works.
Kmeans is one of the most popular and simple algorithm to discover underlying structures in your data. The goal of kmeans is simple, split your data in k different groups represented by their mean. The mean of each group is assumed to be a good summary of each observation of this cluster.
We assume that we want to split the data into k groups, so we need to find and assign k centers.
Let us now implement kmeans from scratch in R. First, we’ll create a dataframe based on five 2D gaussian distributions.
set.seed(1234)
set1 <- mvrnorm(n = 300, c(-4,10), matrix(c(1.5,1,1,1.5),2))
set2 <- mvrnorm(n = 300, c(5,7), matrix(c(1,2,2,6),2))
set3 <- mvrnorm(n = 300, c(-1,1), matrix(c(4,0,0,4),2))
set4 <- mvrnorm(n = 300, c(10,-10), matrix(c(4,0,0,4),2))
set5 <- mvrnorm(n = 300, c(3,-3), matrix(c(4,0,0,4),2))
DF <- data.frame(rbind(set1,set2,set3,set4,set5), cluster = as.factor(c(rep(1:5, each=300))))
ggplot(DF,aes(x=X1,y=X2,color=cluster)) +
geom_point()
On this dataset, Kmeans will work well since each distribution has a circular shape.
The initialisation of the centroids is crucial and will change how the algorithm behave. Here, we will simply takes K random points from the data.
#Centroids initialisation
centroids <- data[sample.int(nrow(DF),5),]
##Stopping criteria initilisation.
current_stop_crit <- 10e10
##Vector where the assigned centers of each points will be saved
cluster <- rep(0,nrow(data))
##Has the alogrithm converged ?
converged=F
it=1
At each iteration, every points will be assigned to its closest cluster. To do so, the euclidian distance between each points and each centers is computed, the lowest distance and the center for which it’s reached is saved.
###Iterating over observations
for (i in 1:nrow(data))
{
##Setting a high minimum distance
min_dist=10e10
##Iterating over centroids
for (centroid in 1:nrow(centroids))
{
##Computing the L2 distance
distance_to_centroid=sum((centroids[centroid,]-data[i,])^2)
##This centroid is the closest centroid to the point
if (distance_to_centroid<=min_dist)
{
##The point is assigned to this centroid/cluster
cluster[i]=centroid
min_dist=distance_to_centroid
}
}
}
Once each point has been assigned to the closest centroids, the coordinates of each centroid are updated. The new coordinates are the means of the observations which belongs to the cluster.
##For each centroid
for (i in 1:nrow(centroids))
{
##The new coordinates are the means of the point in the cluster
centroids[i,]=apply(data[cluster==i,],2,mean)
}
We do not want the algorithm to run indefinetely, hence we need a stopping criterion to stop the algorthm when we are close enough to a minimum. The criterion is simply that when the centroids stop moving, the algorithm should stop.
while(current_stop_crit >= stop_crit & converged == F)
{
it=it+1
if (current_stop_crit<=stop_crit)
{
converged=T
}
old_centroids <- centroids
###Run previous step
####Recompute stop criterion
current_stop_crit=mean((old_centroids-centroids)^2)
kmeans <- function(data,K,stop_crit=10e-5)
{
#Initialisation of clusters
centroids=data[sample.int(nrow(data),K),]
current_stop_crit=1000
cluster=rep(0,nrow(data))
converged=F
it=1
while(current_stop_crit>=stop_crit & converged==F)
{
it=it+1
if (current_stop_crit<=stop_crit)
{
converged=T
}
old_centroids=centroids
##Assigning each point to a centroid
for (i in 1:nrow(data))
{
min_dist=10e10
for (centroid in 1:nrow(centroids))
{
distance_to_centroid=sum((centroids[centroid,]-data[i,])^2)
if (distance_to_centroid<=min_dist)
{
cluster[i]=centroid
min_dist=distance_to_centroid
}
}
}
##Assigning each point to a centroid
for (i in 1:nrow(centroids))
{
centroids[i,]=apply(data[cluster==i,],2,mean)
}
current_stop_crit=mean((old_centroids-centroids)^2)
}
return(list(data=data.frame(data,cluster),centroids=centroids))
}
res <- kmeans(DF[1:2],5)
res$centroids$cluster <- 1:5
res$data$isCentroid <- F
res$centroids$isCentroid <- T
data_plot <- rbind(res$centroids,res$data)
ggplot(data_plot, aes(x=X1, y=X2, color=as.factor(cluster), size=isCentroid, alpha=isCentroid)) +
geom_point()
Now, instead of having nice gaussian distributions, we will build three rings on into another.
##Building three sets
set1 <- data.frame(r=runif(300,0.1,0.5), theta=runif(300,0,360), set='1')
set2 <- data.frame(r=runif(300,1,1.5), theta=runif(300,0,360), set='2')
set3 <- data.frame(r=runif(300,3,5), theta=runif(300,0,360), set='3')
##Transformation in rings
data_2=rbind(set1,set2,set3)
data_2$x=data_2$r*cos(2*3.14*data_2$theta)
data_2$y=(data_2$r)*sin(2*3.14*data_2$theta)
res <- kmeans(data_2[4:5],3)
res$centroids$cluster <- 1:3
res$data$isCentroid <- F
res$centroids$isCentroid <- T
data_plot <- rbind(res$centroids,res$data)
ggplot(data_plot, aes(x=x, y=y, color=as.factor(cluster), size = isCentroid, alpha = isCentroid)) +
geom_point()
The kmeans is performing very poorly on these new data. Actually, the euclidian distance is not adapted to this kind of problem, since the data are not in a circular shape.
So before using kmeans, you should ensure that the data is in appropriate shapes, if not, you can apply transformations or change the distance you are using in the kmeans.
The kmeans algorithm only looks for a local mimimum which is often not a global optimum. Hence, different initialisation can lead to very different results. The quality of initialization can also be improved with kmeans ++, the algorithm selects starting points which are less likely to perform poorly.