load libraries needed
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(cluster))
load and prepare dataset
df <- USArrests
remove missing values
df <- na.omit(df)
scale each variable to have mean of 0 and standard deviation of 1
df <- scale(df)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
**Start clustering
Since we don’t know beforehand which method will produce the best clusters, we can write a short function to perform hierarchical clustering using several different methods.**
define linkage methods
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
function to compute agglomerative coefficient
ac <- function(x) {
agnes(df, method = x)$ac
}
calculate agglomerative coefficient for each clustering linkage method
sapply(m, ac)
## average single complete ward
## 0.7379371 0.6276128 0.8531583 0.9346210
We can see that Ward’s minimum variance method produces the highest agglomerative coefficient, thus we’ll use that as the method for our final hierarchical clustering
perform hierarchical clustering using Ward’s minimum variance and produce dendrogram
clust <- agnes(df, method = 'ward')
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")
As we move up the dendrogram from the bottom, observations that are similar to each other are fused together into a branch.
Determine the optimal numbers of clusters
calculate the gap statistics for each number of clusters
gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)
Apply Cluster Labels to original dataset
Compute distance matrix
d <- dist(df, method = "euclidean")
perform heirarchical clustering using Ward’s method
final_clust <- hclust(d, method = "ward.D2")
cut the dendrogram into 4 clusters
groups <- cutree(final_clust, k=4)
table(groups)
## groups
## 1 2 3 4
## 7 12 19 12
append clusters labels to original data
final_data <- cbind(USArrests, clusters = groups)
head(final_data)
## Murder Assault UrbanPop Rape clusters
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 2
## Arizona 8.1 294 80 31.0 2
## Arkansas 8.8 190 50 19.5 3
## California 9.0 276 91 40.6 2
## Colorado 7.9 204 78 38.7 2
find the mean values for each clusters
aggregate(final_data, by = list(clusters = final_data$clusters),
mean)
## clusters Murder Assault UrbanPop Rape clusters
## 1 1 14.671429 251.2857 54.28571 21.68571 1
## 2 2 10.966667 264.0000 76.50000 33.60833 2
## 3 3 6.210526 142.0526 71.26316 19.18421 3
## 4 4 3.091667 76.0000 52.08333 11.83333 4
The mean number of murders per 100,100 citizens among the state in cluster 1 is 146.67
The mean number of assaults per 100,000 citizens among the states in cluster 1 is 251.28.
The mean percentage of residents living in an urban area among the states in cluster 1 is 54.28%.
The mean number of rapes per 100,000 citizens among the states in cluster 1 is 21.68. Zach (2020)