load libraries needed

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)

load and prepare dataset

df <- USArrests

remove missing values

df <- na.omit(df)

scale each variable to have mean of 0 and standard deviation of 1

df <- scale(df)

head(df)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

**Start clustering

Since we don’t know beforehand which method will produce the best clusters, we can write a short function to perform hierarchical clustering using several different methods.**

define linkage methods

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

function to compute agglomerative coefficient

ac <- function(x) {
  agnes(df, method = x)$ac
}

calculate agglomerative coefficient for each clustering linkage method

sapply(m, ac)
##   average    single  complete      ward 
## 0.7379371 0.6276128 0.8531583 0.9346210

We can see that Ward’s minimum variance method produces the highest agglomerative coefficient, thus we’ll use that as the method for our final hierarchical clustering

perform hierarchical clustering using Ward’s minimum variance and produce dendrogram

clust <- agnes(df, method = 'ward')
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")

As we move up the dendrogram from the bottom, observations that are similar to each other are fused together into a branch.

Determine the optimal numbers of clusters

calculate the gap statistics for each number of clusters

gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

Apply Cluster Labels to original dataset

Compute distance matrix

d <- dist(df, method = "euclidean")

perform heirarchical clustering using Ward’s method

final_clust <- hclust(d, method = "ward.D2")

cut the dendrogram into 4 clusters

groups <- cutree(final_clust, k=4)
table(groups)
## groups
##  1  2  3  4 
##  7 12 19 12

append clusters labels to original data

final_data <- cbind(USArrests, clusters = groups)
head(final_data)

find the mean values for each clusters

aggregate(final_data, by = list(clusters = final_data$clusters),
          mean)