Hierarchical Clustering tutorial

Background
Introduction
Details:
Examples:
Conclusion:
Reference:

Background

Hierarchical Clustering is a clustering method that does not need to specify the number of clusters.

Introduction

Hierarchical Clustering is a tree-based representation of observations, namely a dendrogram. At any greater height, the clusters obtained by cutting the dendrogram have nested structure. Otherwise, it yields worse (i.e. less accurate) results than K-means clustering for a given number of clusters.

Details:

Hierarchical algorithm:
1. start with n observations and a measure e.g. Euclidean distance. Treat each observation as its own cluster.
2. For i = n, n − 1, . . . , 2:
(a) Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are most similar. Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.
(b)Compute the new pairwise inter-cluster dissimilarities among the i − 1 remaining clusters.

Examples:

“USArrests” set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. Three plots using average averag linkage, complete linkage, and single linkage for “USArrests”.

hc <- hclust(dist(USArrests), "ave")
plot(hc)

hc <- hclust(dist(USArrests), "complete")
plot(hc)

hc <- hclust(dist(USArrests), "single")
plot(hc)

1973 USArrest in 50 states hclust”complete”

1973 USArrest in 50 states hclust”average”

1973 USArrest in 50 states hclust”single”

Conclusion:

Complete and average method yield more balanced tree than the single method does. Also, single method has the shortest height among the three since it selects the minimum dissimilarity.

Reference:

@Manual{ title = {hclust: Hierarchical Clustering}, author = {{R Core Team}}, organization = {Rdocumentation}, url = {https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust}, } @Manual{R-base, title = {Dynamic documents with rmarkdown cheat sheet}, author = {{Rstudio}}, year = {2021},organization = {Rstudio}, url = {https://www.rstudio.com/resources/cheatsheets/}, }
@Manual{R-base, title = {An Introduction to Statistical Learning with Applications in R}, author = {{Gareth James, Daniela Witten,Trevor Hastie, Robert Tibshirani}}, organization = {Springer}, address = {NewYork, U.S}, year = {2021}, url = {https://doi.org/10.1007/978-1-0716-1418-1_1}, }
@Manual{R-base, title = {Violent Crime Rates by US State}, author = {{Rstudio}}, url ={https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/USArrests.html},}