Comparison of Clustering Methods

library(ggplot2)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Objective:

To perform non-hierarchical (k-means) and hierarchical (Ward) clustering on the USArrests dataset, then compare the results.

Initial Examination of the Data

We begin by loading the USArrests dataset and viewing a summary of the data.

data("USArrests")
summary(USArrests)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

R documentation tells us that the dataset contains data on the rate of arrests per 100,000 of the population for murder, assault, rape as well as the percentage of the population living in urban areas.

Now we standardise the data and assign the scaled data to a new variable ‘df’:

df <- scale(USArrests)
summary(df)

##      Murder           Assault           UrbanPop             Rape        
##  Min.   :-1.6044   Min.   :-1.5090   Min.   :-2.31714   Min.   :-1.4874  
##  1st Qu.:-0.8525   1st Qu.:-0.7411   1st Qu.:-0.76271   1st Qu.:-0.6574  
##  Median :-0.1235   Median :-0.1411   Median : 0.03178   Median :-0.1209  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.7949   3rd Qu.: 0.9388   3rd Qu.: 0.84354   3rd Qu.: 0.5277  
##  Max.   : 2.2069   Max.   : 1.9948   Max.   : 1.75892   Max.   : 2.6444

The ‘head’ function allows us to view the first 6 rows of data:

head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

The ‘pairs’ function allows us to quickly see the relationship, via scatterplots, between each pair of variables:

pairs(df)

Examining the scatterplots by eye, most of the variables seem to be positively correlated. The strongest positive correlation appears to be between ‘Assault’ and ‘Murder’ or between ‘Assault’ and ‘Rape’. We can double-check using the function ‘cor’:

round(cor(df),2)

##          Murder Assault UrbanPop Rape
## Murder     1.00    0.80     0.07 0.56
## Assault    0.80    1.00     0.26 0.67
## UrbanPop   0.07    0.26     1.00 0.41
## Rape       0.56    0.67     0.41 1.00

As expected, the highest correlation coefficients are between ‘Assault’ and ‘Murder’ or between ‘Assault’ and ‘Rape’. There is little correlation between the percentage of the population living in urban areas and the other variables.

Non Hierarchical Clustering - K-means

Determining the Optimal Number of Clusters

The function fviz_nbclust calculates the ‘total within sum of squares’ against the number of clusters, k, so that we can see the optimal number of clusters to use:

fviz_nbclust(df, kmeans, method = "wss")

The plot indicates that 4 clusters is optimal.

To create a similar plot manually, we calculate the total within sum of squares using from 2 to 8 clusters, then plot the total within sum of squares against the number of clusters:

km2 <- kmeans(df,2,iter.max = 10, nstart = 10)
km3 <- kmeans(df,2,iter.max = 10, nstart = 10)
km4 <- kmeans(df,4,iter.max = 10, nstart = 10)
km5 <- kmeans(df,5, iter.max = 10, nstart = 10) 
km6 <- kmeans(df,6, iter.max = 10, nstart = 10) 
km7 <- kmeans(df,7, iter.max = 10, nstart = 10) 
km8 <- kmeans(df,8, iter.max = 10, nstart = 10) 

plot(c(2,3,4,5,6,7,8),
     c(km2$tot.withinss, km3$tot.withinss, km4$tot.withinss, km5$tot.withinss, km6$tot.withinss, km7$tot.withinss, km8$tot.withinss),
     xlab = "No. of clusters",
     ylab = "Total within deviance",
     col = "blue")

lines(c(2,3,4,5,6,7,8),
      c(km2$tot.withinss, km3$tot.withinss, km4$tot.withinss, km5$tot.withinss, km6$tot.withinss, km7$tot.withinss, km8$tot.withinss),
      col="blue")

points(4,km4$tot.withinss,
       pch=20,
       col="red")

We can clearly see the ‘elbow’ at 4 clusters. Adding more clusters does not significantly reduce the total within sum of squares, so we take 4 as the optimal number of clusters.

Applying K-means Clustering With k = 4

We apply the ‘kmeans’ function to the scaled data with 4 clusters, using 10 iterations and 100 random starts to avoid local minima:

km.opt <- kmeans(df,4, iter.max = 10, nstart = 100)

To view the cluster assigned to each observation:

km.opt$cluster

##        Alabama         Alaska        Arizona       Arkansas     California 
##              3              4              4              3              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              1              1              4              3 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              1              2              4              1              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              1              2              3              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              1              4              2              3              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              1 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              3              2              1 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              1              1              1              3 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              3              4              1              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              1              1              2              2              1

To view the cluster centres:

round(km.opt$centers,2)

##   Murder Assault UrbanPop  Rape
## 1  -0.49   -0.38     0.58 -0.26
## 2  -0.96   -1.11    -0.93 -0.97
## 3   1.41    0.87    -0.81  0.02
## 4   0.70    1.04     0.72  1.28

Without more information, these values seem to indicate: The cluster with four positive values seems to have very high assault arrests and rape arrests, and above-average murder arrests. The percentage of the population living in urban areas is more than average as well. The cluster with the highest murder value has a lower than average percentage of the population living in urban areas and more assault arrests than average. The cluster with four negative values seems the safest and is probably the most peaceful! The next safest cluster seems to be where a fair amount of the population lives in urban areas compared to average, yet we see negative values for all types of arrests.

To view the within-cluster sum of squares:

km.opt$withinss

## [1] 16.212213 11.952463  8.316061 19.922437

To view the total within-cluster sum of squares (our objective function, which we aim to minimise):

km.opt$tot.withinss

## [1] 56.40317

To view the size of each cluster:

km.opt$size

## [1] 16 13  8 13

To view the number of iterations:

km.opt$iter

## [1] 2

Now we can calculate the 50 x 4 matrix U, which defines which observation belongs to which cluster [X-observations = U X-centroids + E-errors]. We’ll view only the first 8 rows for brevity:

U <- matrix(NA, 50, 4)
I <- diag(4)
U <- I[km.opt$cluster,]

U[1:8,]

##      [,1] [,2] [,3] [,4]
## [1,]    0    0    1    0
## [2,]    0    0    0    1
## [3,]    0    0    0    1
## [4,]    0    0    1    0
## [5,]    0    0    0    1
## [6,]    0    0    0    1
## [7,]    1    0    0    0
## [8,]    1    0    0    0

Returning to the original data values, we calculate the means of each cluster, so that we can better appreciate the real proportions of arrests in each category:

round(aggregate(USArrests, by=list(cluster=km.opt$cluster),mean),1)

##   cluster Murder Assault UrbanPop Rape
## 1       1    5.7   138.9     73.9 18.8
## 2       2    3.6    78.5     52.1 12.2
## 3       3   13.9   243.6     53.8 21.4
## 4       4   10.8   257.4     76.0 33.2

The ‘cbind’ function allows us to create a vector relating the clusters to the original data (here we view only the first 6 rows for brevity). We can therefore compare the arrest rates, state by state, to the means of the cluster they belong to:

dd <- cbind(USArrests, cluster = km.opt$cluster)
head(dd)

##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       3
## Alaska       10.0     263       48 44.5       4
## Arizona       8.1     294       80 31.0       4
## Arkansas      8.8     190       50 19.5       3
## California    9.0     276       91 40.6       4
## Colorado      7.9     204       78 38.7       4

Visualising Clustering Results - Principal Component Analysis

Next, we plot the clusters using the ‘fviz_cluster’ function from the factoextra package in order to visualise the clusters against the first two principal components:

library(factoextra)
fviz_cluster(km.opt,df)

We can see each cluster relative to the first and second principal components, along with the cluster centres and the percentage of variance explained by each component (total explained variance 86.7%). Using the principal components allows us to visualise the clusters in a 2D space and see how well they are separated, but makes interpretation of the clusters more difficult.

Hierarchical Clustering - Ward Method

We can also approach the problem using hierarchical clustering, then compare our solution to the solution obtained using K-means. We begin by calculating the distance matrix using the ‘dist’ function.

dist_mat <- dist(df, method = 'euclidean')

The ‘hclust’ hierarchical clustering function is applied to the distance matrix and we can select the ‘ward.D2’ method which considers the distance squared (NOT “ward.D”).

Hier_cl_w <- hclust(dist_mat, method = "ward.D2")
Hier_cl_w

## 
## Call:
## hclust(d = dist_mat, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 50

The ‘plot’ function allows us to visualise the dendrogram. The dendrogram can be cut at h = 5 to create 4 clusters. The ‘rect.hclust’ function allows us to draw a rectangle around every cluster.

plot(Hier_cl_w)
abline(h=6, col="green")
rect.hclust(Hier_cl_w, k = 4, border = "red")

The 1st cluster contains 7 states (Al, Ls, etc), the 2nd 12 states (NY, etc), the 3rd 19 states (Iw, etc) and the 4th 12 states (Ut, Ws, etc).

Hierarchical Clustering Solution - Ward Method

The ‘cutree’ function allows us to see, data point by data point, which cluster they belong to. Here we use the ‘table’ function to summarise the number of data points in each cluster found using the Ward method.

fit_w <- cutree(Hier_cl_w, k = 4)
fit_w

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3

table(fit_w)

## fit_w
##  1  2  3  4 
##  7 12 19 12

Comparison of Hierarchical and K-means Clustering Solutions

We can then use the ‘table’ function to compare our Ward method solution to the K-means solution.

table(fit_w, km.opt$cluster)

##      
## fit_w  1  2  3  4
##     1  0  0  7  0
##     2  0  0  0 12
##     3 16  1  1  1
##     4  0 12  0  0

The Ward method cluster numbers are listed vertically and the K-means cluster numbers horizontally. This table shows that the hierarchical clustering and K-means solutions are very similar, but differ in the assignment of 3 states (see the row of fit_w with 16, 1, 1, 1). In the Ward clustering solution, these states are all assigned to a single cluster, while in the K-means method, 16 are assigned to one cluster but the remaining three states are assigned to each of the other three clusters.

Conclusion

Although similar, the solutions are different because the hierarchical clustering method is based on the distance between the data points, while the K-means method is based on the sum of squares. The hierarchical clustering method is more robust to outliers and noise, but the K-means method is faster and more scalable, although a poor initial choice of centroids can lead to longer convergence times.