load libraries needed

suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(cluster))

load and prepare dataset

df <- USArrests

remove missing values

df <- na.omit(df)

scale each variable to have mean of 0 and standard deviation of 1

df <- scale(df)

head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

**Start clustering

Since we don’t know beforehand which method will produce the best clusters, we can write a short function to perform hierarchical clustering using several different methods.**

define linkage methods

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

function to compute agglomerative coefficient

ac <- function(x) {
  agnes(df, method = x)$ac
}

calculate agglomerative coefficient for each clustering linkage method

sapply(m, ac)

##   average    single  complete      ward 
## 0.7379371 0.6276128 0.8531583 0.9346210

We can see that Ward’s minimum variance method produces the highest agglomerative coefficient, thus we’ll use that as the method for our final hierarchical clustering

perform hierarchical clustering using Ward’s minimum variance and produce dendrogram

clust <- agnes(df, method = 'ward')
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")

As we move up the dendrogram from the bottom, observations that are similar to each other are fused together into a branch.

Determine the optimal numbers of clusters

calculate the gap statistics for each number of clusters

gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

From the plot we can see that the gap statistic is highest at k = 4 clusters. Thus, we’ll choose to group our observations into 4 distinct clusters

Apply Cluster Labels to original dataset

Compute distance matrix

d <- dist(df, method = "euclidean")

perform heirarchical clustering using Ward’s method

final_clust <- hclust(d, method = "ward.D2")

cut the dendrogram into 4 clusters

groups <- cutree(final_clust, k=4)
table(groups)

## groups
##  1  2  3  4 
##  7 12 19 12

append clusters labels to original data

final_data <- cbind(USArrests, clusters = groups)
head(final_data)

##            Murder Assault UrbanPop Rape clusters
## Alabama      13.2     236       58 21.2        1
## Alaska       10.0     263       48 44.5        2
## Arizona       8.1     294       80 31.0        2
## Arkansas      8.8     190       50 19.5        3
## California    9.0     276       91 40.6        2
## Colorado      7.9     204       78 38.7        2

find the mean values for each clusters

aggregate(final_data, by = list(clusters = final_data$clusters),
          mean)

##   clusters    Murder  Assault UrbanPop     Rape clusters
## 1        1 14.671429 251.2857 54.28571 21.68571        1
## 2        2 10.966667 264.0000 76.50000 33.60833        2
## 3        3  6.210526 142.0526 71.26316 19.18421        3
## 4        4  3.091667  76.0000 52.08333 11.83333        4

The mean number of murders per 100,100 citizens among the state in cluster 1 is 146.67
The mean number of assaults per 100,000 citizens among the states in cluster 1 is 251.28.
The mean percentage of residents living in an urban area among the states in cluster 1 is 54.28%.
The mean number of rapes per 100,000 citizens among the states in cluster 1 is 21.68. Zach (2020)

References

Zach. 2020. “Hierarchical Clustering in r: Step-by-Step Example.” Statology. https://www.statology.org/hierarchical-clustering-in-r/.

Hierarchical Clustering

Jhon Paul Kionesala BS Mathematics

2022-06-18

References