Clustering in R

This article consists the tutorial on how to cluster data in R. Clustering is a unsupervised learning technique in which the dataset has no target variable \(Y\). Clustering mainly aims at finding similarities between the features \(X_i\) using a similarity metric and grouping them together into clusters/groups.

K-Means clustering is a clustering algorithm which aims at clustering continious(numeric) data into \(K\) clusters which are needed to be specified before feeding data to the model.Scaling of features matter in k-means algorithm as it computes the euclidean distance between the cluster centroid and the data points in each iteration, hence we need to standardize the variables if they are skewed or unscaled.

We solve for a objective in k-means i.e we want to minimize the within cluster variance \(WCV\) , which simply implies that the points within a cluster should be as close as possible to the cluster centroid(mean for that cluster). \[minimize {\sum_{k=1}^{K} WCV(C_k) } \] over the clusters \(c_1,c_2,c_3.....c_k\).

The function can be further written as- \[minimize {\sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i \in C_k}\sum_{j=1}^{p}(x_{ij}-\bar x_{kj} )^2}\]

, where \(K\) are the number of clusters and \(|C_k|\) are the number of observations in \(K^{th}\) cluster, \(p\) are the number of variables,and most importantly \(\bar x_{kj}\) is the mean of the \(K^{th}\) cluster i.e the centroid value for that cluster.

The centroid value for a cluster is equal to the mean of the observations in that cluster i.e \[ \bar x_{kj} = \frac{1}{|C_k|} \sum_{i \in C_k} x_{ij} \].

This simply means that we want to partition the observations into \(K\)-clusters such that the total within-cluster variation ,summed over all \(K\)-clusters is as small as possible i.e they are as close to each other as possible.

The K-means algorithm

Randomly assign a number 1 to K,to each of the observations. These serves as initial cluster assignments.
Iterate until the cluster assignments stop changing :

2.a For each of the k-clusters compute the cluster centroid. The \(K^{th}\) cluster centroid is the mean for the observations in the \(K^{th}\) cluster.

2.b Assign each observation to the cluster whose centroid is closest to it by calculating the distance between them.(where closest is defined by the distance metric-Euclidean distance).
Stop:if cluster assignments stop changing , else go to : step 2).

The algorithm is guaranteed to decrease the value of the objective \(WCV(C_k)\). The local minima will be founded by K-means ,however it is not guaranteed that it will give us the global minima.

Local minima is the smallest value of a function within a range.

Global minima is the smallest value of a function over the enrtire domain.

So this means that K-means will land you in a valley, but not necessarily in the lowest/deepest valley because the function is not CONVEX.

Implementing K-means in R

K-means can work in any dimension but for purposes of demonstration I will use it in 2-D. I am going to generate some fake data and try to cluster it.

#setting seed
set.seed(101)
x=matrix(rnorm(100^2),100,2) # a 100 x 2 dim matrix
xmean = matrix(rnorm(8,sd=4),4,2) # a 4 x 2 dim matrix, as we want 4 clusters

which=sample(1:4,100,replace=T) #random sample 
x=x+xmean[which,]

#plotting the data
plot(x,col=which,pch=19)

Now the plot above shows 4 clusters. We know the clusters but now let’s feed the data to k-means and check out its performance and how it clusters the data.

km.out<-kmeans(x,4,nstart=15)
km.out

## K-means clustering with 4 clusters of sizes 20, 20, 32, 28
## 
## Cluster means:
##        [,1]      [,2]
## 1  2.867030  1.432841
## 2  2.785733 -1.695050
## 3 -6.216403  2.257261
## 4 -3.400747  4.507759
## 
## Clustering vector:
##   [1] 3 2 4 3 1 3 1 4 2 1 2 4 1 1 3 3 3 3 4 2 3 2 2 2 3 2 2 3 3 4 2 2 3 2 4
##  [36] 3 3 3 4 3 2 1 1 4 2 4 3 1 1 4 4 1 3 4 4 2 3 4 1 2 1 4 3 4 1 1 3 4 1 3
##  [71] 3 3 3 3 4 3 3 1 4 3 4 2 3 4 4 4 4 2 2 1 1 4 3 3 1 4 1 4 2 4
## 
## Within cluster sum of squares by cluster:
## [1] 27.86071 31.82209 46.20229 56.64773
##  (between_SS / total_SS =  92.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Let’s plot the clusters made by K-means algorithm.

plot(x,col=km.out$cluster,cex=2,pch=1,lwd=2)
points(x,col=which,pch=19)
points(x,col=c(4,3,2,1)[which],pch=19)

Now the inner circles represent the actual cluster assignments , whereas the outer circles represent the cluster assignments by K-means algorithm. So we can easily notice the mismatches.

Conclusion

So this was a small article and tutorial on how to implement K-means clustering in R. K-means clustering is a nice method to cluster numeric data. The only drawback is we need some domain knowledge to tell the algorithm about the number fo clusters we want a-priori. Secondly, K-means is suited only for data which is normally distributed or either standardized. So scaling of variables actually matter a lot in K-means clustering.