Hierarchical clustering

dr. Annelies Agten

2026-04-18

Hierarchical clustering is an unsupervised learning technique used to group observations based on their similarity. Unlike methods that require the number of clusters to be specified in advance, hierarchical clustering builds a nested structure of clusters that can be explored at different levels of granularity.

The result is often represented as a dendrogram, a tree-like diagram that shows how observations are progressively merged (or split) into clusters based on a chosen distance measure and linkage method. Observations that are more similar are joined at lower levels of the tree, while more dissimilar groups are connected higher up.

In the following section, we will learn how to compute hierarchical clustering in R, how to choose appropriate distance and linkage methods, and how to interpret the resulting dendrogram.

Hierarchical clustering in R

The first step in hierarchical clustering is to compute a distance matrix, which quantifies how dissimilar each pair of observations is. For a dataset with n observations, this results in an n × n matrix where each entry represents the distance between two observations.

In R, this is typically done using the dist() function. The most commonly used distance measure is Euclidean distance, which reflects the straight-line distance between two observations in multidimensional space. However, other distance measures can also be used depending on the nature of the data, such as Manhattan distance or maximum distance.

The choice of distance measure is important because it directly affects how similarity between observations is defined. In addition, since distance calculations are sensitive to the scale of the variables, it is often necessary to standardize the data beforehand when variables are measured on different scales.

Syntax:
dist(x, method=“euclidean)

Where:

The output of dist() is a distance object that contains all pairwise distances between observations. This object can then be passed directly to clustering functions such as hclust().

Syntax:
hclust(d, method = “complete”)

Where:

The output of hclust() is an object that contains the hierarchical clustering structure, which can be visualized using a dendrogram with the plot() function.

Read data

First we need to read in our data into R.Throughtout this example we will use the wine data. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The attributes are:

The wine data is in a .txt format, so to read in the data we can use the read.table() function in R.

wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")

colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline")

dim(wine)
#> [1] 178  14

head(wine, 5)
#>   Cultivar Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols
#> 1        1   14.23       1.71 2.43              15.6       127          2.80
#> 2        1   13.20       1.78 2.14              11.2       100          2.65
#> 3        1   13.16       2.36 2.67              18.6       101          2.80
#> 4        1   14.37       1.95 2.50              16.8       113          3.85
#> 5        1   13.24       2.59 2.87              21.0       118          2.80
#>   Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity  Hue
#> 1       3.06                 0.28            2.29            5.64 1.04
#> 2       2.76                 0.26            1.28            4.38 1.05
#> 3       3.24                 0.30            2.81            5.68 1.03
#> 4       3.49                 0.24            2.18            7.80 0.86
#> 5       2.69                 0.39            1.82            4.32 1.04
#>   OD280/OD315 of diluted wines Proline
#> 1                         3.92    1065
#> 2                         3.40    1050
#> 3                         3.17    1185
#> 4                         3.45    1480
#> 5                         2.93     735

The wine dataset contains 178 observations of 14 variables, including the 13 measured quantities of chemicals and the variable Cultivar, which indicates the type of grape from which the wine was produced.

The measured attributes have very different ranges, so the data should be standardized before performing hierarchical clustering to ensure that all variables contribute equally to the analysis.

wine_stand <- as.data.frame(scale(wine)) # standardize data by subtracting the mean and deviding by the sd

Perform hierarchical clustering

We create distance matrices with different distance measures:

dist_eucl <- dist(wine_stand[,2:14],method='euclidean')
dist_manh <- dist(wine_stand[,2:14],method='manhattan')
dist_max <- dist(wine_stand[,2:14],method='maximum')
dist_mink <- dist(wine_stand[,2:14],method='minkowski')

Then we perform hierarchical clustering on the distance matrix using different linkage methods:

clusters_ward <- hclust(dist_eucl, method="ward")
#> The "ward" method has been renamed to "ward.D"; note new "ward.D2"
clusters_complete <- hclust(dist_eucl, method="complete")
clusters_average <- hclust(dist_eucl, method="average")
clusters_single <- hclust(dist_eucl, method="single")

We can visualize the dendrogram using the plot() function:

plot(clusters_ward, main="Ward linkage",xlab="")
plot(clusters_complete, main="Complete linkage",xlab="")
plot(clusters_average, main="Average linkage",xlab="")
plot(clusters_single, main="Single linkage",xlab="")

Different linkage methods can lead to substantially different dendrograms, as they define the distance between clusters in different ways. As a result, the structure of the clusters and the level at which observations are merged can vary considerably, so the choice of linkage method can have a strong impact on the final clustering solution.

Cutting the tree

The cutree() function in R is used to convert the hierarchical structure produced by hclust() into a practical clustering solution by assigning each observation to a cluster.

The basic idea is that a dendrogram represents clusters being merged step by step. By “cutting” the tree at a certain level, we decide where to stop merging and thus define the final clusters.

Syntax:
cutree(tree, k = NULL, h = NULL)

Where:

There are two main ways to use cutree():

By specifying the number of clusters (k):
You choose how many clusters you want, and R will cut the dendrogram at the appropriate level to produce exactly that number of groups.

By specifying the height (h):
You cut the dendrogram at a specific height (i.e., level of dissimilarity). Observations that are merged below this height will belong to the same cluster.

The output is a vector of cluster memberships, where each value indicates the cluster assignment of an observation. For example, a result like 1 1 2 2 3 means that the first two observations belong to cluster 1, the next two to cluster 2, and the last one to cluster 3.

In practice, cutree() is often used together with the dendrogram plot: you visually inspect the dendrogram to decide a reasonable number of clusters or cut height, and then use cutree() to extract those clusters for further analysis.

In this example, we know that the data contain three distinct groups corresponding to the three wine cultivars. Therefore, we choose to cut the dendrogram into three clusters (i.e., set k = 3). This allows us to compare the clustering results with the true group structure and assess how well the hierarchical clustering method recovers the underlying cultivars.

CC_ward <- cutree(clusters_ward, 3)
table(CC_ward, wine$Cultivar)
#>        
#> CC_ward  1  2  3
#>       1 58  7  0
#>       2  1 58  0
#>       3  0  6 48

CC_complete <- cutree(clusters_complete, 3)
table(CC_complete, wine$Cultivar)
#>            
#> CC_complete  1  2  3
#>           1 51 18  0
#>           2  8 50  0
#>           3  0  3 48

CC_average <- cutree(clusters_average, 3)
table(CC_average, wine$Cultivar)
#>           
#> CC_average  1  2  3
#>          1 58 68 48
#>          2  1  2  0
#>          3  0  1  0

CC_single <- cutree(clusters_single, 3)
table(CC_single, wine$Cultivar)
#>          
#> CC_single  1  2  3
#>         1 59 67 48
#>         2  0  3  0
#>         3  0  1  0

From the results, we observe that the Ward linkage method performs best, as it yields the smallest number of misclassified observations. This suggests that Ward’s method is most effective at recovering the true underlying group structure of the data in this case.

The rect.hclust() function in R is used to highlight clusters on a dendrogram by drawing rectangles around them. This is particularly useful for visualizing the result after choosing a specific number of clusters. The border feature in the syntax gives you the option to choose different colors for the rectangles.

plot(clusters_ward, main="Ward linkage",xlab="")
rect.hclust(clusters_ward, k=3, border=2:5)

plot(clusters_complete, main="Complete linkage",xlab="")
rect.hclust(clusters_complete, k=3, border=2:5)

plot(clusters_average, main="Average linkage",xlab="")
rect.hclust(clusters_average, k=3, border=2:5)

plot(clusters_single, main="Single linkage",xlab="")
rect.hclust(clusters_single, k=3, border=2:5)

Visualizing the results

The fviz_cluster() function (from the factoextra package) provides a convenient way to visualize clustering results in a low-dimensional space, typically using principal components. The function plots the observations and colors them according to their cluster membership, making it easy to see how well the clusters are separated.

The resulting plot shows the data projected onto the first two principal components, with points colored by cluster and optional ellipses indicating cluster boundaries. This provides a clear visual assessment of how well-separated the clusters are.

library(factoextra) 
#> Warning: package 'factoextra' was built under R version 4.5.3
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 4.5.3
#> Welcome to factoextra!
#> Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/

wine_clusters <- cutree(clusters_ward, 3)

rownames(wine_stand) <- paste(wine$Cultivar, 1:dim(wine)[1],sep="_")

fviz_cluster(list(data=wine_stand,cluster=wine_clusters))

The ape package (Analysis of Phylogenetics and Evolution) provides tools that can also be used to visualize hierarchical clustering results in a tree format that is often more flexible than the base R dendrogram.

The function as.phylo() is used to convert an hclust object into a phylogenetic tree object. This allows us to take the hierarchical clustering result and represent it using tree structures that can be further customised and visualised.

Once converted, the tree can be plotted using the plot() function from ape, which often produces a cleaner and more flexible dendrogram-style visualization. Additional options are available to rotate the tree, adjust labels, and improve readability.

library(ape)
#> Warning: package 'ape' was built under R version 4.5.3

colors = c("red", "blue", "green")

plot(as.phylo(clusters_ward), type = "fan", tip.color = colors[wine_clusters],
     label.offset = 1, cex = 0.7)

plot(as.phylo(clusters_ward), type = "unrooted", cex = 0.6,
     no.margin = TRUE,tip.color = colors[wine_clusters],
     label.offset = 1)