Hierarchical Clustering

load libraries needed

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)

load and prepare dataset

df <- USArrests

remove missing values

df <- na.omit(df)

scale each variable to have mean of 0 and standard deviation of 1

df <- scale(df)

head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

**Start clustering

Since we don’t know beforehand which method will produce the best clusters, we can write a short function to perform hierarchical clustering using several different methods.**

define linkage methods

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

function to compute agglomerative coefficient

ac <- function(x) {
  agnes(df, method = x)$ac
}

calculate agglomerative coefficient for each clustering linkage method

sapply(m, ac)

##   average    single  complete      ward 
## 0.7379371 0.6276128 0.8531583 0.9346210

We can see that Ward’s minimum variance method produces the highest agglomerative coefficient, thus we’ll use that as the method for our final hierarchical clustering

perform hierarchical clustering using Ward’s minimum variance and produce dendrogram

clust <- agnes(df, method = 'ward')
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")

As we move up the dendrogram from the bottom, observations that are similar to each other are fused together into a branch.

Determine the optimal numbers of clusters

calculate the gap statistics for each number of clusters

gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

From the plot we can see that the gap statistic is highest at k = 4 clusters. Thus, we’ll choose to group our observations into 4 distinct clusters

Apply Cluster Labels to original dataset

Compute distance matrix

d <- dist(df, method = "euclidean")

perform heirarchical clustering using Ward’s method

final_clust <- hclust(d, method = "ward.D2")

cut the dendrogram into 4 clusters

groups <- cutree(final_clust, k=4)
table(groups)

## groups
##  1  2  3  4 
##  7 12 19 12

append clusters labels to original data

final_data <- cbind(USArrests, clusters = groups)
head(final_data)

find the mean values for each clusters

aggregate(final_data, by = list(clusters = final_data$clusters),
          mean)

The mean number of murders per 100,100 citizens among the state in cluster 1 is 146.67
The mean number of assaults per 100,000 citizens among the states in cluster 1 is 251.28.
The mean percentage of residents living in an urban area among the states in cluster 1 is 54.28%.
The mean number of rapes per 100,000 citizens among the states in cluster 1 is 21.68.

Hierarchical Clustering

Duque, Don Gerard H. BS-Math

2022-06-20