library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
We begin by loading the USArrests dataset and viewing a summary of the data.
data("USArrests")
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
R documentation tells us that the dataset contains data on the rate of arrests per 100,000 of the population for murder, assault, rape as well as the percentage of the population living in urban areas.
Now we standardise the data and assign the scaled data to a new variable ‘df’:
df <- scale(USArrests)
summary(df)
## Murder Assault UrbanPop Rape
## Min. :-1.6044 Min. :-1.5090 Min. :-2.31714 Min. :-1.4874
## 1st Qu.:-0.8525 1st Qu.:-0.7411 1st Qu.:-0.76271 1st Qu.:-0.6574
## Median :-0.1235 Median :-0.1411 Median : 0.03178 Median :-0.1209
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.7949 3rd Qu.: 0.9388 3rd Qu.: 0.84354 3rd Qu.: 0.5277
## Max. : 2.2069 Max. : 1.9948 Max. : 1.75892 Max. : 2.6444
The ‘head’ function allows us to view the first 6 rows of data:
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
The ‘pairs’ function allows us to quickly see the relationship, via scatterplots, between each pair of variables:
pairs(df)
Examining the scatterplots by eye, most of the variables seem to be positively correlated. The strongest positive correlation appears to be between ‘Assault’ and ‘Murder’ or between ‘Assault’ and ‘Rape’. We can double-check using the function ‘cor’:
round(cor(df),2)
## Murder Assault UrbanPop Rape
## Murder 1.00 0.80 0.07 0.56
## Assault 0.80 1.00 0.26 0.67
## UrbanPop 0.07 0.26 1.00 0.41
## Rape 0.56 0.67 0.41 1.00
As expected, the highest correlation coefficients are between ‘Assault’ and ‘Murder’ or between ‘Assault’ and ‘Rape’. There is little correlation between the percentage of the population living in urban areas and the other variables.
The function fviz_nbclust calculates the ‘total within sum of squares’ against the number of clusters, k, so that we can see the optimal number of clusters to use:
fviz_nbclust(df, kmeans, method = "wss")
The plot indicates that 4 clusters is optimal.
To create a similar plot manually, we calculate the total within sum of squares using from 2 to 8 clusters, then plot the total within sum of squares against the number of clusters:
km2 <- kmeans(df,2,iter.max = 10, nstart = 10)
km3 <- kmeans(df,2,iter.max = 10, nstart = 10)
km4 <- kmeans(df,4,iter.max = 10, nstart = 10)
km5 <- kmeans(df,5, iter.max = 10, nstart = 10)
km6 <- kmeans(df,6, iter.max = 10, nstart = 10)
km7 <- kmeans(df,7, iter.max = 10, nstart = 10)
km8 <- kmeans(df,8, iter.max = 10, nstart = 10)
plot(c(2,3,4,5,6,7,8),
c(km2$tot.withinss, km3$tot.withinss, km4$tot.withinss, km5$tot.withinss, km6$tot.withinss, km7$tot.withinss, km8$tot.withinss),
xlab = "No. of clusters",
ylab = "Total within deviance",
col = "blue")
lines(c(2,3,4,5,6,7,8),
c(km2$tot.withinss, km3$tot.withinss, km4$tot.withinss, km5$tot.withinss, km6$tot.withinss, km7$tot.withinss, km8$tot.withinss),
col="blue")
points(4,km4$tot.withinss,
pch=20,
col="red")
We can clearly see the ‘elbow’ at 4 clusters. Adding more clusters does not significantly reduce the total within sum of squares, so we take 4 as the optimal number of clusters.
We apply the ‘kmeans’ function to the scaled data with 4 clusters, using 10 iterations and 100 random starts to avoid local minima:
km.opt <- kmeans(df,4, iter.max = 10, nstart = 100)
To view the cluster assigned to each observation:
km.opt$cluster
## Alabama Alaska Arizona Arkansas California
## 3 4 4 3 4
## Colorado Connecticut Delaware Florida Georgia
## 4 1 1 4 3
## Hawaii Idaho Illinois Indiana Iowa
## 1 2 4 1 2
## Kansas Kentucky Louisiana Maine Maryland
## 1 2 3 2 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 1 4 2 3 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 4 2 1
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 3 2 1
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 1 1 1 1 3
## South Dakota Tennessee Texas Utah Vermont
## 2 3 4 1 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 1 1 2 2 1
To view the cluster centres:
round(km.opt$centers,2)
## Murder Assault UrbanPop Rape
## 1 -0.49 -0.38 0.58 -0.26
## 2 -0.96 -1.11 -0.93 -0.97
## 3 1.41 0.87 -0.81 0.02
## 4 0.70 1.04 0.72 1.28
Without more information, these values seem to indicate: The cluster with four positive values seems to have very high assault arrests and rape arrests, and above-average murder arrests. The percentage of the population living in urban areas is more than average as well. The cluster with the highest murder value has a lower than average percentage of the population living in urban areas and more assault arrests than average. The cluster with four negative values seems the safest and is probably the most peaceful! The next safest cluster seems to be where a fair amount of the population lives in urban areas compared to average, yet we see negative values for all types of arrests.
To view the within-cluster sum of squares:
km.opt$withinss
## [1] 16.212213 11.952463 8.316061 19.922437
To view the total within-cluster sum of squares (our objective function, which we aim to minimise):
km.opt$tot.withinss
## [1] 56.40317
To view the size of each cluster:
km.opt$size
## [1] 16 13 8 13
To view the number of iterations:
km.opt$iter
## [1] 2
Now we can calculate the 50 x 4 matrix U, which defines which observation belongs to which cluster [X-observations = U X-centroids + E-errors]. We’ll view only the first 8 rows for brevity:
U <- matrix(NA, 50, 4)
I <- diag(4)
U <- I[km.opt$cluster,]
U[1:8,]
## [,1] [,2] [,3] [,4]
## [1,] 0 0 1 0
## [2,] 0 0 0 1
## [3,] 0 0 0 1
## [4,] 0 0 1 0
## [5,] 0 0 0 1
## [6,] 0 0 0 1
## [7,] 1 0 0 0
## [8,] 1 0 0 0
Returning to the original data values, we calculate the means of each cluster, so that we can better appreciate the real proportions of arrests in each category:
round(aggregate(USArrests, by=list(cluster=km.opt$cluster),mean),1)
## cluster Murder Assault UrbanPop Rape
## 1 1 5.7 138.9 73.9 18.8
## 2 2 3.6 78.5 52.1 12.2
## 3 3 13.9 243.6 53.8 21.4
## 4 4 10.8 257.4 76.0 33.2
The ‘cbind’ function allows us to create a vector relating the clusters to the original data (here we view only the first 6 rows for brevity). We can therefore compare the arrest rates, state by state, to the means of the cluster they belong to:
dd <- cbind(USArrests, cluster = km.opt$cluster)
head(dd)
## Murder Assault UrbanPop Rape cluster
## Alabama 13.2 236 58 21.2 3
## Alaska 10.0 263 48 44.5 4
## Arizona 8.1 294 80 31.0 4
## Arkansas 8.8 190 50 19.5 3
## California 9.0 276 91 40.6 4
## Colorado 7.9 204 78 38.7 4
Next, we plot the clusters using the ‘fviz_cluster’ function from the factoextra package in order to visualise the clusters against the first two principal components:
library(factoextra)
fviz_cluster(km.opt,df)
We can see each cluster relative to the first and second principal components, along with the cluster centres and the percentage of variance explained by each component (total explained variance 86.7%). Using the principal components allows us to visualise the clusters in a 2D space and see how well they are separated, but makes interpretation of the clusters more difficult.
We can also approach the problem using hierarchical clustering, then compare our solution to the solution obtained using K-means. We begin by calculating the distance matrix using the ‘dist’ function.
dist_mat <- dist(df, method = 'euclidean')
The ‘hclust’ hierarchical clustering function is applied to the distance matrix and we can select the ‘ward.D2’ method which considers the distance squared (NOT “ward.D”).
Hier_cl_w <- hclust(dist_mat, method = "ward.D2")
Hier_cl_w
##
## Call:
## hclust(d = dist_mat, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 50
The ‘plot’ function allows us to visualise the dendrogram. The dendrogram can be cut at h = 5 to create 4 clusters. The ‘rect.hclust’ function allows us to draw a rectangle around every cluster.
plot(Hier_cl_w)
abline(h=6, col="green")
rect.hclust(Hier_cl_w, k = 4, border = "red")
The 1st cluster contains 7 states (Al, Ls, etc), the 2nd 12 states (NY, etc), the 3rd 19 states (Iw, etc) and the 4th 12 states (Ut, Ws, etc).
The ‘cutree’ function allows us to see, data point by data point, which cluster they belong to. Here we use the ‘table’ function to summarise the number of data points in each cluster found using the Ward method.
fit_w <- cutree(Hier_cl_w, k = 4)
fit_w
## Alabama Alaska Arizona Arkansas California
## 1 2 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 2 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 4 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 4 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 2 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 4 1 2 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
table(fit_w)
## fit_w
## 1 2 3 4
## 7 12 19 12
We can then use the ‘table’ function to compare our Ward method solution to the K-means solution.
table(fit_w, km.opt$cluster)
##
## fit_w 1 2 3 4
## 1 0 0 7 0
## 2 0 0 0 12
## 3 16 1 1 1
## 4 0 12 0 0
The Ward method cluster numbers are listed vertically and the K-means cluster numbers horizontally. This table shows that the hierarchical clustering and K-means solutions are very similar, but differ in the assignment of 3 states (see the row of fit_w with 16, 1, 1, 1). In the Ward clustering solution, these states are all assigned to a single cluster, while in the K-means method, 16 are assigned to one cluster but the remaining three states are assigned to each of the other three clusters.
Although similar, the solutions are different because the hierarchical clustering method is based on the distance between the data points, while the K-means method is based on the sum of squares. The hierarchical clustering method is more robust to outliers and noise, but the K-means method is faster and more scalable, although a poor initial choice of centroids can lead to longer convergence times.