Hierarchical Clustering

# Load the data
data("USArrests")
# Standardize the data
df <- scale(USArrests)
# Show the first 6 rows
head(df, nrow = 6)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

You need to compute the distance matrix

# Compute the dissimilarity matrix
# df = the standardized data
res.dist <- dist(df, method = "euclidean")

Reforming in the form of matrix

as.matrix(res.dist)[1:6, 1:6]

##             Alabama   Alaska  Arizona Arkansas California Colorado
## Alabama    0.000000 2.703754 2.293520 1.289810   3.263110 2.651067
## Alaska     2.703754 0.000000 2.700643 2.826039   3.012541 2.326519
## Arizona    2.293520 2.700643 0.000000 2.717758   1.310484 1.365031
## Arkansas   1.289810 2.826039 2.717758 0.000000   3.763641 2.831051
## California 3.263110 3.012541 1.310484 3.763641   0.000000 1.287619
## Colorado   2.651067 2.326519 1.365031 2.831051   1.287619 0.000000

R base function hclust() can be used to create the hierarchical tree

res.hc <- hclust(d=res.dist,method="ward.D2")

Next we visualize the dendogram

# cex: label size
library("factoextra")

## Warning: package 'factoextra' was built under R version 3.6.2

## Loading required package: ggplot2

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_dend(res.hc, cex = 0.5)

The height of the tree is called Cophenetic Distance. Lets verify that the distances in the tree reflect the original distances accurately.

# Compute cophentic distance
res.coph <- cophenetic(res.hc)
# Correlation between cophenetic distance and
# the original distance
cor(res.dist, res.coph)

## [1] 0.6975266

The closer the value of the correlation coe!cient is to 1, the more accurately the clustering solution reflects your data. Values above 0.75 are felt to be good. The “average” linkage method appears to produce high values of this statistic.

Lets use the average method of linkage, and assess

res.hc2 <- hclust(res.dist, method = "average")
cor(res.dist, cophenetic(res.hc2))

## [1] 0.7180382

It represents the original trees slighly better. Make Clusters- or cut the tree into different groups

# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)
head(grp, n = 7)

##     Alabama      Alaska     Arizona    Arkansas  California    Colorado 
##           1           2           2           3           2           2 
## Connecticut 
##           3

# Number of members in each cluster
table(grp)

## grp
##  1  2  3  4 
##  7 12 19 12

# Get the names for the members of cluster 1
rownames(df)[grp == 1]

## [1] "Alabama"        "Georgia"        "Louisiana"      "Mississippi"   
## [5] "North Carolina" "South Carolina" "Tennessee"

You can visualize the cut

# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)

We can draw a cluster plot

fviz_cluster(list(data = df, cluster = grp),
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
ellipse.type = "convex", # Concentration ellipse
repel = TRUE, # Avoid label overplotting (slow)
show.clust.cent = FALSE, ggtheme = theme_minimal())

Auto Hierarchical Clustering. You can use agnes() and diana() for computing agglomerative and divisive clustering.

library("cluster")

## Warning: package 'cluster' was built under R version 3.6.2

# Agglomerative Nesting (Hierarchical Clustering)
res.agnes <- agnes(x = USArrests, # data matrix
stand = TRUE, # Standardize the data
metric = "euclidean", # metric for distance matrix
method = "ward" # Linkage method
)

You can use diana() also

#DIvisive ANAlysis Clustering
res.diana <- diana(x = USArrests, # data matrix
stand = TRUE, # standardize the data
metric = "euclidean" # metric for distance matrix
)

Hierarchical Clustering

Priyank Goyal

22/03/2020