Introduction

Have you ever wondered how scientists identify groups of similar genes, how marketers segment customers, or how ecologists classify species? One powerful technique for discovering natural groupings in data is hierarchical clustering. Unlike other clustering methods that require you to specify the number of clusters upfront, hierarchical clustering builds a tree-like structure (called a dendrogram) that reveals relationships at multiple scales simultaneously.

In this tutorial, we’ll explore hierarchical clustering from the ground up. We’ll start with the fundamental concepts, work through concrete examples using synthetic data, and then apply our knowledge to a real-world dataset. By the end, you’ll understand how hierarchical clustering works and be able to apply it to your own data analysis projects.

Background: What is Hierarchical Clustering?

The Basic Idea

Hierarchical clustering is an unsupervised learning technique that groups similar observations together based on their features, without requiring labeled training data. The key distinguishing feature of hierarchical clustering is that it produces a hierarchy of clusters, represented visually as a dendrogram (tree diagram).

There are two main approaches:

Agglomerative (Bottom-Up): Start with each observation as its own cluster, then iteratively merge the most similar clusters until only one cluster remains. This is the most common approach and the focus of this tutorial.
Divisive (Top-Down): Start with all observations in one cluster, then recursively split clusters until each observation is in its own cluster.

Why Use Hierarchical Clustering?

Hierarchical clustering offers several advantages:

No need to pre-specify the number of clusters: The dendrogram shows clustering at all levels
Interpretable visualizations: Dendrograms provide intuitive visual representations of data structure
Flexible cluster selection: You can “cut” the dendrogram at any height to obtain different numbers of clusters
Deterministic results: Unlike k-means, hierarchical clustering produces the same result every time (no random initialization)

Key Concepts

Before diving into examples, let’s understand three crucial concepts:

1. Distance/Dissimilarity Measures

To cluster observations, we need to measure how different they are. Common distance metrics include:

Euclidean distance: Straight-line distance between points (\(d = \sqrt{\sum_{i=1}^{p}(x_i - y_i)^2}\))
Manhattan distance: Sum of absolute differences along each dimension
Correlation-based distance: Based on correlation between observations

2. Linkage Methods

Once we know distances between individual observations, how do we measure distance between clusters? This is where linkage comes in:

Complete linkage: Maximum distance between any two points in different clusters (maximal intercluster dissimilarity)
Single linkage: Minimum distance between any two points in different clusters (minimal intercluster dissimilarity)
Average linkage: Average distance between all pairs of points in different clusters
Centroid linkage: Distance between cluster centroids (means)

3. Dendrograms

A dendrogram is a tree diagram that shows the hierarchical relationship between clusters. The height at which two clusters merge indicates their dissimilarity. Cutting the dendrogram at a particular height yields a specific number of clusters.

The Algorithm

Let’s walk through the agglomerative hierarchical clustering algorithm step by step:

Algorithm: Agglomerative Hierarchical Clustering

Begin with \(n\) observations and a measure (such as Euclidean distance) of all \({n \choose 2} = n(n-1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.
For \(i = n, n-1, ..., 2\):
- Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters
- Identify the pair of clusters that are least dissimilar (most similar)
- Merge these two clusters
- Compute the new pairwise inter-cluster dissimilarities among the remaining \(i-1\) clusters using the chosen linkage method
The result is a dendrogram showing how clusters merge at each step.

Example 1: A Simple Synthetic Example

Let’s start with a very simple example to see hierarchical clustering in action. We’ll create a small dataset with clear groups.

# Load required libraries
library(ggplot2)
library(gridExtra)

# Create a simple dataset with 6 observations in 2D space
set.seed(42)
simple_data <- data.frame(
  x = c(1, 1.5, 1.2, 5, 5.5, 5.2),
  y = c(1, 1.3, 0.9, 4, 4.2, 3.8),
  label = c("A", "B", "C", "D", "E", "F")
)

# Visualize the data
ggplot(simple_data, aes(x = x, y = y, label = label)) +
  geom_point(size = 4, color = "steelblue") +
  geom_text(vjust = -1, size = 5) +
  theme_minimal() +
  labs(title = "Simple Dataset: 6 Observations in 2D Space",
       x = "Feature 1", y = "Feature 2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Just by looking at the plot, we can see two natural groups: {A, B, C} on the left and {D, E, F} on the right.

Now let’s perform hierarchical clustering with complete linkage:

# Compute distance matrix
dist_matrix <- dist(simple_data[, c("x", "y")], method = "euclidean")

# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")

# Plot the dendrogram
plot(hc_complete, labels = simple_data$label, 
     main = "Dendrogram: Complete Linkage",
     xlab = "Observation", ylab = "Height (Dissimilarity)",
     sub="",
     cex = 1.2, cex.main = 1.5)

Interpreting the Dendrogram:

The height of each merge (vertical axis) represents the dissimilarity between clusters
Observations A and C merge first (around height 0.3), indicating they’re most similar
Then B joins the {A, C} cluster (around height 0.4)
Similarly, observations D and F merge (around height 0.4), then E joins them
Finally, the two groups merge at a much higher height (around 5.4), showing they’re quite different

If we want exactly 2 clusters, we can “cut” the dendrogram:

# Cut the dendrogram to get 2 clusters
clusters_2 <- cutree(hc_complete, k = 2)

# Add cluster assignments to data
simple_data$cluster <- as.factor(clusters_2)

# Visualize with cluster colors
ggplot(simple_data, aes(x = x, y = y, label = label, color = cluster)) +
  geom_point(size = 4) +
  geom_text(vjust = -1, size = 5, show.legend = FALSE) +
  scale_color_manual(values = c("steelblue", "coral")) +
  theme_minimal() +
  labs(title = "Clustering Results: 2 Clusters",
       x = "Feature 1", y = "Feature 2", color = "Cluster") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Perfect! The algorithm correctly identified our two natural groups.

Example 2: Comparing Linkage Methods

The choice of linkage method can significantly affect clustering results. Let’s create a synthetic dataset and compare three common linkage methods.

# Create synthetic data with three distinct clusters
set.seed(123)
n_per_cluster <- 20

cluster1 <- data.frame(x = rnorm(n_per_cluster, mean = 0, sd = 0.5),
                       y = rnorm(n_per_cluster, mean = 0, sd = 0.5))
cluster2 <- data.frame(x = rnorm(n_per_cluster, mean = 4, sd = 0.5),
                       y = rnorm(n_per_cluster, mean = 0, sd = 0.5))
cluster3 <- data.frame(x = rnorm(n_per_cluster, mean = 2, sd = 0.5),
                       y = rnorm(n_per_cluster, mean = 3, sd = 0.5))

synth_data <- rbind(cluster1, cluster2, cluster3)
true_labels <- factor(rep(1:3, each = n_per_cluster))

# Visualize the data
ggplot(synth_data, aes(x = x, y = y, color = true_labels)) +
  geom_point(size = 2.5, alpha = 0.7) +
  scale_color_manual(values = c("steelblue", "coral", "forestgreen"),
                     name = "True Cluster") +
  theme_minimal() +
  labs(title = "Synthetic Dataset with Three Clusters",
       x = "Feature 1", y = "Feature 2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Now let’s apply hierarchical clustering with three different linkage methods:

# Compute distance matrix
dist_synth <- dist(synth_data, method = "euclidean")

# Perform clustering with different linkage methods
hc_single <- hclust(dist_synth, method = "single")
hc_complete <- hclust(dist_synth, method = "complete")
hc_average <- hclust(dist_synth, method = "average")

# Plot dendrograms
par(mfrow = c(3, 1), mar = c(4, 4, 3, 1))

plot(hc_single, labels = FALSE, main = "Single Linkage", 
     sub="", xlab = "", ylab = "Height", cex.main = 1.5)
abline(h = 1.5, col = "red", lty = 2, lwd = 2)

plot(hc_complete, labels = FALSE, main = "Complete Linkage",
     sub="", xlab = "", ylab = "Height", cex.main = 1.5)
abline(h = 3.5, col = "red", lty = 2, lwd = 2)

plot(hc_average, labels = FALSE, main = "Average Linkage",
     sub="", xlab = "Observation", ylab = "Height", cex.main = 1.5)
abline(h = 2.5, col = "red", lty = 2, lwd = 2)

The red dashed lines show where we might cut to obtain 3 clusters. Notice how the dendrogram structures differ:

Single linkage tends to create elongated, “chain-like” clusters
Complete linkage tends to produce more compact, spherical clusters
Average linkage provides a balance between the two

Let’s visualize the resulting clusters:

# Cut trees to get 3 clusters
clusters_single <- cutree(hc_single, k = 3)
clusters_complete <- cutree(hc_complete, k = 3)
clusters_average <- cutree(hc_average, k = 3)

# Create plotting data
plot_data <- data.frame(
  x = rep(synth_data$x, 3),
  y = rep(synth_data$y, 3),
  cluster = factor(c(clusters_single, clusters_complete, clusters_average)),
  method = factor(rep(c("Single", "Complete", "Average"), 
                      each = nrow(synth_data)),
                  levels = c("Single", "Complete", "Average"))
)

# Plot
ggplot(plot_data, aes(x = x, y = y, color = cluster)) +
  geom_point(size = 2, alpha = 0.7) +
  facet_wrap(~ method, ncol = 3) +
  scale_color_manual(values = c("steelblue", "coral", "forestgreen")) +
  theme_minimal() +
  labs(title = "Comparison of Linkage Methods",
       x = "Feature 1", y = "Feature 2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        strip.text = element_text(face = "bold", size = 12))

Key Observations:

Complete and Average linkage both successfully identify the three true clusters
Single linkage can be sensitive to “chaining” effects where clusters merge prematurely due to nearby points
For most applications, complete or average linkage is recommended

Example 3: Real-World Application - Wine Chemical Properties

Now let’s apply hierarchical clustering to a real dataset. We’ll use the wine dataset, which contains chemical analyses of wines grown in the same region in Italy but derived from three different cultivars

# Load the wine dataset from UCI Machine Learning Repository
# This dataset contains 178 wines with 13 chemical measurements
wine_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"

wine <- read.csv(wine_url, header = FALSE)
colnames(wine) <- c("Cultivar", "Alcohol", "Malic_acid", "Ash", 
                    "Alcalinity_of_ash", "Magnesium", "Total_phenols",
                    "Flavanoids", "Nonflavanoid_phenols", 
                    "Proanthocyanins", "Color_intensity", "Hue",
                    "OD280_OD315", "Proline")

# Store true cultivar labels
true_cultivars <- wine$Cultivar

# Remove the cultivar label for clustering (unsupervised learning)
wine_features <- wine[, -1]

# Show dataset structure
cat("Wine Dataset Dimensions:", dim(wine_features), "\n")

## Wine Dataset Dimensions: 178 13

cat("Number of wines per cultivar:\n")

## Number of wines per cultivar:

table(true_cultivars)

## true_cultivars
##  1  2  3 
## 59 71 48

Before clustering, it’s important to standardize the features since they’re measured on different scales (e.g., Alcohol ranges from 11-15% while Proline ranges from 278-1680 mg/L):

# Standardize the features (mean = 0, sd = 1)
wine_scaled <- scale(wine_features)

# Verify standardization
cat("Mean of first feature:", round(mean(wine_scaled[, 1]), 6), "\n")

## Mean of first feature: 0

cat("SD of first feature:", round(sd(wine_scaled[, 1]), 6), "\n")

## SD of first feature: 1

Now let’s perform hierarchical clustering:

# Compute distance matrix using Euclidean distance
wine_dist <- dist(wine_scaled, method = "euclidean")

# Perform hierarchical clustering with complete linkage
wine_hc <- hclust(wine_dist, method = "complete")

# Plot the dendrogram
plot(wine_hc, labels = FALSE, main = "Wine Dataset Dendrogram (Complete Linkage)",
     xlab = "Wine Sample", ylab = "Height (Euclidean Distance)", sub="",
     cex.main = 1.5)
abline(h = 10, col = "red", lty = 2, lwd = 2)
text(x = 140, y = 11, labels = "Cut height for 3 clusters", col = "red")

The dendrogram suggests natural groupings in the data. Since we know there are 3 wine cultivars, let’s cut the tree at a height that produces 3 clusters:

# Cut tree to obtain 3 clusters
wine_clusters <- cutree(wine_hc, k = 3)

# Compare with true cultivar labels
comparison_table <- table(True_Cultivar = true_cultivars, 
                          Predicted_Cluster = wine_clusters)
print(comparison_table)

##              Predicted_Cluster
## True_Cultivar  1  2  3
##             1 51  8  0
##             2 18 50  3
##             3  0  0 48

# Calculate clustering accuracy
# (matching predicted clusters to true labels for best alignment)
accuracy <- max(sum(comparison_table[1,]), sum(comparison_table[2,]), 
                sum(comparison_table[3,])) / sum(comparison_table)
cat("\nClustering accuracy (best alignment):", round(accuracy * 100, 1), "%\n")

## 
## Clustering accuracy (best alignment): 39.9 %

Hierarchical clustering successfully recovered the three wine cultivars with high accuracy, using only the chemical measurements (no cultivar labels were used in the clustering).

Let’s visualize the results using the first two principal components for dimensionality reduction:

# Perform PCA for visualization
wine_pca <- prcomp(wine_scaled)
pca_data <- data.frame(
  PC1 = wine_pca$x[, 1],
  PC2 = wine_pca$x[, 2],
  True_Cultivar = factor(true_cultivars),
  Predicted_Cluster = factor(wine_clusters)
)

# Plot true cultivars vs predicted clusters
p1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = True_Cultivar)) +
  geom_point(size = 2.5, alpha = 0.7) +
  scale_color_manual(values = c("steelblue", "coral", "forestgreen")) +
  theme_minimal() +
  labs(title = "True Wine Cultivars", color = "Cultivar") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

p2 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = Predicted_Cluster)) +
  geom_point(size = 2.5, alpha = 0.7) +
  scale_color_manual(values = c("steelblue", "coral", "forestgreen")) +
  theme_minimal() +
  labs(title = "Predicted Clusters", color = "Cluster") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

grid.arrange(p1, p2, ncol = 2, 
             top = "Wine Classification: True vs Predicted")

The close correspondence between the true cultivars and predicted clusters demonstrates that hierarchical clustering successfully identified meaningful structure in the wine chemical data.

Practical Considerations

When to Use Hierarchical Clustering

Hierarchical clustering works best when:

You want to explore data structure at multiple scales
The number of clusters is unknown
You have a moderate number of observations (\(n < 5000\), due to \(O(n^3)\) complexity)
Interpretability and visualization are important

Limitations

Be aware of these limitations:

Computational complexity: \(O(n^3)\) time complexity makes it impractical for very large datasets
Irreversible decisions: Once observations are merged, they cannot be separated
Sensitivity to outliers: Especially with complete linkage
Scale sensitivity: Always standardize features measured on different scales

Tips for Better Results

Standardize your data when features have different units or scales
Choose linkage thoughtfully: Complete or average linkage work well for most applications
Consider distance metrics: Euclidean distance is most common, but correlation-based distance can be useful for gene expression data
Validate results: If you have labeled data (as in our wine example), check how well clusters correspond to known groups
Try multiple methods: Compare hierarchical clustering with k-means or other methods

Conclusion

Hierarchical clustering is a powerful and intuitive technique for discovering structure in unlabeled data. Its key strengths are:

Flexibility: No need to pre-specify the number of clusters
Interpretability: Dendrograms provide clear visual representations
Deterministic: Same results every run (unlike k-means)

Through our examples, we’ve seen how hierarchical clustering can:

Identify clear groupings in simple datasets
Reveal how linkage methods affect results
Successfully recover true categories in real-world data (wine cultivars)

The methodology is straightforward: compute pairwise dissimilarities, iteratively merge the most similar clusters using a chosen linkage method, and visualize the hierarchy with a dendrogram. While the algorithm is simple, the resulting tool is remarkably effective for exploratory data analysis.

As you continue your data science journey, hierarchical clustering will prove valuable for initial data exploration, identifying natural groupings, and generating hypotheses about data structure. Combined with domain knowledge and other analytical techniques, it’s an essential tool in the modern data analyst’s toolkit.

A Practical Guide to Hierarchical Clustering

Harshitha Murali, Student ID: 121302984

2025-10-28