Hierarchical clustering is an unsupervised learning technique used to group observations based on their similarity. Unlike methods that require the number of clusters to be specified in advance, hierarchical clustering builds a nested structure of clusters that can be explored at different levels of granularity.
The result is often represented as a dendrogram, a tree-like diagram that shows how observations are progressively merged (or split) into clusters based on a chosen distance measure and linkage method. Observations that are more similar are joined at lower levels of the tree, while more dissimilar groups are connected higher up.
In the following section, we will learn how to compute hierarchical clustering in R, how to choose appropriate distance and linkage methods, and how to interpret the resulting dendrogram.
The first step in hierarchical clustering is to compute a distance matrix, which quantifies how dissimilar each pair of observations is. For a dataset with n observations, this results in an n × n matrix where each entry represents the distance between two observations.
In R, this is typically done using the dist() function.
The most commonly used distance measure is Euclidean distance, which
reflects the straight-line distance between two observations in
multidimensional space. However, other distance measures can also be
used depending on the nature of the data, such as Manhattan distance or
maximum distance.
The choice of distance measure is important because it directly affects how similarity between observations is defined. In addition, since distance calculations are sensitive to the scale of the variables, it is often necessary to standardize the data beforehand when variables are measured on different scales.
Syntax:
dist(x, method=“euclidean)Where:
- x: Data matrix or data frame containing the observations (rows) and variables (columns)
- method: Specifies the distance measure to be used. Default is Euclidean.
The output of dist() is a distance object that contains
all pairwise distances between observations. This object can then be
passed directly to clustering functions such as
hclust().
Syntax:
hclust(d, method = “complete”)Where:
- d: Distance object (typically created using dist())
- method: Specifies the linkage method used to combine clusters. Default is complete.
The output of hclust() is an object that contains the hierarchical clustering structure, which can be visualized using a dendrogram with the plot() function.
First we need to read in our data into R.Throughtout this example we
will use the wine data. These data are the results of a
chemical analysis of wines grown in the same region in Italy but derived
from three different cultivars. The analysis determined the quantities
of 13 constituents found in each of the three types of wines.
The attributes are:
The wine data is in a .txt format, so to read in the
data we can use the read.table() function in R.
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline")
dim(wine)
#> [1] 178 14
head(wine, 5)
#> Cultivar Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols
#> 1 1 14.23 1.71 2.43 15.6 127 2.80
#> 2 1 13.20 1.78 2.14 11.2 100 2.65
#> 3 1 13.16 2.36 2.67 18.6 101 2.80
#> 4 1 14.37 1.95 2.50 16.8 113 3.85
#> 5 1 13.24 2.59 2.87 21.0 118 2.80
#> Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
#> 1 3.06 0.28 2.29 5.64 1.04
#> 2 2.76 0.26 1.28 4.38 1.05
#> 3 3.24 0.30 2.81 5.68 1.03
#> 4 3.49 0.24 2.18 7.80 0.86
#> 5 2.69 0.39 1.82 4.32 1.04
#> OD280/OD315 of diluted wines Proline
#> 1 3.92 1065
#> 2 3.40 1050
#> 3 3.17 1185
#> 4 3.45 1480
#> 5 2.93 735The wine dataset contains 178 observations of 14
variables, including the 13 measured quantities of chemicals and the
variable Cultivar, which indicates the type of grape from which the wine
was produced.
The measured attributes have very different ranges, so the data should be standardized before performing hierarchical clustering to ensure that all variables contribute equally to the analysis.
We create distance matrices with different distance measures:
dist_eucl <- dist(wine_stand[,2:14],method='euclidean')
dist_manh <- dist(wine_stand[,2:14],method='manhattan')
dist_max <- dist(wine_stand[,2:14],method='maximum')
dist_mink <- dist(wine_stand[,2:14],method='minkowski')Then we perform hierarchical clustering on the distance matrix using different linkage methods:
clusters_ward <- hclust(dist_eucl, method="ward")
#> The "ward" method has been renamed to "ward.D"; note new "ward.D2"
clusters_complete <- hclust(dist_eucl, method="complete")
clusters_average <- hclust(dist_eucl, method="average")
clusters_single <- hclust(dist_eucl, method="single")We can visualize the dendrogram using the plot()
function:
plot(clusters_ward, main="Ward linkage",xlab="")
plot(clusters_complete, main="Complete linkage",xlab="")
plot(clusters_average, main="Average linkage",xlab="")
plot(clusters_single, main="Single linkage",xlab="")Different linkage methods can lead to substantially different dendrograms, as they define the distance between clusters in different ways. As a result, the structure of the clusters and the level at which observations are merged can vary considerably, so the choice of linkage method can have a strong impact on the final clustering solution.
The cutree() function in R is used to convert the
hierarchical structure produced by hclust() into a
practical clustering solution by assigning each observation to a
cluster.
The basic idea is that a dendrogram represents clusters being merged step by step. By “cutting” the tree at a certain level, we decide where to stop merging and thus define the final clusters.
Syntax:
cutree(tree, k = NULL, h = NULL)Where:
- tree: A tree as produced by hclust
- k: The desired number of groups
- h: The height where the tree should be cut
There are two main ways to use cutree():
By specifying the number of clusters (k):
You choose how many clusters you want, and R will cut the dendrogram at
the appropriate level to produce exactly that number of groups.
By specifying the height (h):
You cut the dendrogram at a specific height (i.e., level of
dissimilarity). Observations that are merged below this height will
belong to the same cluster.
The output is a vector of cluster memberships, where each value indicates the cluster assignment of an observation. For example, a result like 1 1 2 2 3 means that the first two observations belong to cluster 1, the next two to cluster 2, and the last one to cluster 3.
In practice, cutree() is often used together with the
dendrogram plot: you visually inspect the dendrogram to decide a
reasonable number of clusters or cut height, and then use
cutree() to extract those clusters for further
analysis.
In this example, we know that the data contain three distinct groups corresponding to the three wine cultivars. Therefore, we choose to cut the dendrogram into three clusters (i.e., set k = 3). This allows us to compare the clustering results with the true group structure and assess how well the hierarchical clustering method recovers the underlying cultivars.
CC_ward <- cutree(clusters_ward, 3)
table(CC_ward, wine$Cultivar)
#>
#> CC_ward 1 2 3
#> 1 58 7 0
#> 2 1 58 0
#> 3 0 6 48
CC_complete <- cutree(clusters_complete, 3)
table(CC_complete, wine$Cultivar)
#>
#> CC_complete 1 2 3
#> 1 51 18 0
#> 2 8 50 0
#> 3 0 3 48
CC_average <- cutree(clusters_average, 3)
table(CC_average, wine$Cultivar)
#>
#> CC_average 1 2 3
#> 1 58 68 48
#> 2 1 2 0
#> 3 0 1 0
CC_single <- cutree(clusters_single, 3)
table(CC_single, wine$Cultivar)
#>
#> CC_single 1 2 3
#> 1 59 67 48
#> 2 0 3 0
#> 3 0 1 0From the results, we observe that the Ward linkage method performs best, as it yields the smallest number of misclassified observations. This suggests that Ward’s method is most effective at recovering the true underlying group structure of the data in this case.
The rect.hclust() function in R is used to highlight
clusters on a dendrogram by drawing rectangles around them. This is
particularly useful for visualizing the result after choosing a specific
number of clusters. The border feature in the syntax gives
you the option to choose different colors for the rectangles.
plot(clusters_ward, main="Ward linkage",xlab="")
rect.hclust(clusters_ward, k=3, border=2:5)
plot(clusters_complete, main="Complete linkage",xlab="")
rect.hclust(clusters_complete, k=3, border=2:5)
plot(clusters_average, main="Average linkage",xlab="")
rect.hclust(clusters_average, k=3, border=2:5)
plot(clusters_single, main="Single linkage",xlab="")
rect.hclust(clusters_single, k=3, border=2:5)The fviz_cluster() function (from the
factoextra package) provides a convenient way to visualize
clustering results in a low-dimensional space, typically using principal
components. The function plots the observations and colors them
according to their cluster membership, making it easy to see how well
the clusters are separated.
The resulting plot shows the data projected onto the first two principal components, with points colored by cluster and optional ellipses indicating cluster boundaries. This provides a clear visual assessment of how well-separated the clusters are.
library(factoextra)
#> Warning: package 'factoextra' was built under R version 4.5.3
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 4.5.3
#> Welcome to factoextra!
#> Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
wine_clusters <- cutree(clusters_ward, 3)
rownames(wine_stand) <- paste(wine$Cultivar, 1:dim(wine)[1],sep="_")
fviz_cluster(list(data=wine_stand,cluster=wine_clusters))The ape package (Analysis of Phylogenetics and
Evolution) provides tools that can also be used to visualize
hierarchical clustering results in a tree format that is often more
flexible than the base R dendrogram.
The function as.phylo() is used to convert an
hclust object into a phylogenetic tree object. This allows
us to take the hierarchical clustering result and represent it using
tree structures that can be further customised and visualised.
Once converted, the tree can be plotted using the plot() function from ape, which often produces a cleaner and more flexible dendrogram-style visualization. Additional options are available to rotate the tree, adjust labels, and improve readability.
library(ape)
#> Warning: package 'ape' was built under R version 4.5.3
colors = c("red", "blue", "green")
plot(as.phylo(clusters_ward), type = "fan", tip.color = colors[wine_clusters],
label.offset = 1, cex = 0.7)
plot(as.phylo(clusters_ward), type = "unrooted", cex = 0.6,
no.margin = TRUE,tip.color = colors[wine_clusters],
label.offset = 1)