Have you ever wondered how scientists identify groups of similar genes, how marketers segment customers, or how ecologists classify species? One powerful technique for discovering natural groupings in data is hierarchical clustering. Unlike other clustering methods that require you to specify the number of clusters upfront, hierarchical clustering builds a tree-like structure (called a dendrogram) that reveals relationships at multiple scales simultaneously.
In this tutorial, we’ll explore hierarchical clustering from the ground up. We’ll start with the fundamental concepts, work through concrete examples using synthetic data, and then apply our knowledge to a real-world dataset. By the end, you’ll understand how hierarchical clustering works and be able to apply it to your own data analysis projects.
Hierarchical clustering is an unsupervised learning technique that groups similar observations together based on their features, without requiring labeled training data. The key distinguishing feature of hierarchical clustering is that it produces a hierarchy of clusters, represented visually as a dendrogram (tree diagram).
There are two main approaches:
Agglomerative (Bottom-Up): Start with each observation as its own cluster, then iteratively merge the most similar clusters until only one cluster remains. This is the most common approach and the focus of this tutorial.
Divisive (Top-Down): Start with all observations in one cluster, then recursively split clusters until each observation is in its own cluster.
Hierarchical clustering offers several advantages:
Before diving into examples, let’s understand three crucial concepts:
To cluster observations, we need to measure how different they are. Common distance metrics include:
Once we know distances between individual observations, how do we measure distance between clusters? This is where linkage comes in:
A dendrogram is a tree diagram that shows the hierarchical relationship between clusters. The height at which two clusters merge indicates their dissimilarity. Cutting the dendrogram at a particular height yields a specific number of clusters.
Let’s walk through the agglomerative hierarchical clustering algorithm step by step:
Algorithm: Agglomerative Hierarchical Clustering
Begin with \(n\) observations and a measure (such as Euclidean distance) of all \({n \choose 2} = n(n-1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.
For \(i = n, n-1, ..., 2\):
The result is a dendrogram showing how clusters merge at each step.
Let’s start with a very simple example to see hierarchical clustering in action. We’ll create a small dataset with clear groups.
# Load required libraries
library(ggplot2)
library(gridExtra)
# Create a simple dataset with 6 observations in 2D space
set.seed(42)
simple_data <- data.frame(
x = c(1, 1.5, 1.2, 5, 5.5, 5.2),
y = c(1, 1.3, 0.9, 4, 4.2, 3.8),
label = c("A", "B", "C", "D", "E", "F")
)
# Visualize the data
ggplot(simple_data, aes(x = x, y = y, label = label)) +
geom_point(size = 4, color = "steelblue") +
geom_text(vjust = -1, size = 5) +
theme_minimal() +
labs(title = "Simple Dataset: 6 Observations in 2D Space",
x = "Feature 1", y = "Feature 2") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Just by looking at the plot, we can see two natural groups: {A, B, C} on the left and {D, E, F} on the right.
Now let’s perform hierarchical clustering with complete linkage:
# Compute distance matrix
dist_matrix <- dist(simple_data[, c("x", "y")], method = "euclidean")
# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")
# Plot the dendrogram
plot(hc_complete, labels = simple_data$label,
main = "Dendrogram: Complete Linkage",
xlab = "Observation", ylab = "Height (Dissimilarity)",
sub="",
cex = 1.2, cex.main = 1.5)Interpreting the Dendrogram:
If we want exactly 2 clusters, we can “cut” the dendrogram:
# Cut the dendrogram to get 2 clusters
clusters_2 <- cutree(hc_complete, k = 2)
# Add cluster assignments to data
simple_data$cluster <- as.factor(clusters_2)
# Visualize with cluster colors
ggplot(simple_data, aes(x = x, y = y, label = label, color = cluster)) +
geom_point(size = 4) +
geom_text(vjust = -1, size = 5, show.legend = FALSE) +
scale_color_manual(values = c("steelblue", "coral")) +
theme_minimal() +
labs(title = "Clustering Results: 2 Clusters",
x = "Feature 1", y = "Feature 2", color = "Cluster") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Perfect! The algorithm correctly identified our two natural groups.
The choice of linkage method can significantly affect clustering results. Let’s create a synthetic dataset and compare three common linkage methods.
# Create synthetic data with three distinct clusters
set.seed(123)
n_per_cluster <- 20
cluster1 <- data.frame(x = rnorm(n_per_cluster, mean = 0, sd = 0.5),
y = rnorm(n_per_cluster, mean = 0, sd = 0.5))
cluster2 <- data.frame(x = rnorm(n_per_cluster, mean = 4, sd = 0.5),
y = rnorm(n_per_cluster, mean = 0, sd = 0.5))
cluster3 <- data.frame(x = rnorm(n_per_cluster, mean = 2, sd = 0.5),
y = rnorm(n_per_cluster, mean = 3, sd = 0.5))
synth_data <- rbind(cluster1, cluster2, cluster3)
true_labels <- factor(rep(1:3, each = n_per_cluster))
# Visualize the data
ggplot(synth_data, aes(x = x, y = y, color = true_labels)) +
geom_point(size = 2.5, alpha = 0.7) +
scale_color_manual(values = c("steelblue", "coral", "forestgreen"),
name = "True Cluster") +
theme_minimal() +
labs(title = "Synthetic Dataset with Three Clusters",
x = "Feature 1", y = "Feature 2") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Now let’s apply hierarchical clustering with three different linkage methods:
# Compute distance matrix
dist_synth <- dist(synth_data, method = "euclidean")
# Perform clustering with different linkage methods
hc_single <- hclust(dist_synth, method = "single")
hc_complete <- hclust(dist_synth, method = "complete")
hc_average <- hclust(dist_synth, method = "average")
# Plot dendrograms
par(mfrow = c(3, 1), mar = c(4, 4, 3, 1))
plot(hc_single, labels = FALSE, main = "Single Linkage",
sub="", xlab = "", ylab = "Height", cex.main = 1.5)
abline(h = 1.5, col = "red", lty = 2, lwd = 2)
plot(hc_complete, labels = FALSE, main = "Complete Linkage",
sub="", xlab = "", ylab = "Height", cex.main = 1.5)
abline(h = 3.5, col = "red", lty = 2, lwd = 2)
plot(hc_average, labels = FALSE, main = "Average Linkage",
sub="", xlab = "Observation", ylab = "Height", cex.main = 1.5)
abline(h = 2.5, col = "red", lty = 2, lwd = 2)The red dashed lines show where we might cut to obtain 3 clusters. Notice how the dendrogram structures differ:
Let’s visualize the resulting clusters:
# Cut trees to get 3 clusters
clusters_single <- cutree(hc_single, k = 3)
clusters_complete <- cutree(hc_complete, k = 3)
clusters_average <- cutree(hc_average, k = 3)
# Create plotting data
plot_data <- data.frame(
x = rep(synth_data$x, 3),
y = rep(synth_data$y, 3),
cluster = factor(c(clusters_single, clusters_complete, clusters_average)),
method = factor(rep(c("Single", "Complete", "Average"),
each = nrow(synth_data)),
levels = c("Single", "Complete", "Average"))
)
# Plot
ggplot(plot_data, aes(x = x, y = y, color = cluster)) +
geom_point(size = 2, alpha = 0.7) +
facet_wrap(~ method, ncol = 3) +
scale_color_manual(values = c("steelblue", "coral", "forestgreen")) +
theme_minimal() +
labs(title = "Comparison of Linkage Methods",
x = "Feature 1", y = "Feature 2") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
strip.text = element_text(face = "bold", size = 12))Key Observations:
Now let’s apply hierarchical clustering to a real dataset. We’ll use
the wine dataset, which contains chemical analyses of wines
grown in the same region in Italy but derived from three different
cultivars
# Load the wine dataset from UCI Machine Learning Repository
# This dataset contains 178 wines with 13 chemical measurements
wine_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine <- read.csv(wine_url, header = FALSE)
colnames(wine) <- c("Cultivar", "Alcohol", "Malic_acid", "Ash",
"Alcalinity_of_ash", "Magnesium", "Total_phenols",
"Flavanoids", "Nonflavanoid_phenols",
"Proanthocyanins", "Color_intensity", "Hue",
"OD280_OD315", "Proline")
# Store true cultivar labels
true_cultivars <- wine$Cultivar
# Remove the cultivar label for clustering (unsupervised learning)
wine_features <- wine[, -1]
# Show dataset structure
cat("Wine Dataset Dimensions:", dim(wine_features), "\n")## Wine Dataset Dimensions: 178 13
## Number of wines per cultivar:
## true_cultivars
## 1 2 3
## 59 71 48
Before clustering, it’s important to standardize the features since they’re measured on different scales (e.g., Alcohol ranges from 11-15% while Proline ranges from 278-1680 mg/L):
# Standardize the features (mean = 0, sd = 1)
wine_scaled <- scale(wine_features)
# Verify standardization
cat("Mean of first feature:", round(mean(wine_scaled[, 1]), 6), "\n")## Mean of first feature: 0
## SD of first feature: 1
Now let’s perform hierarchical clustering:
# Compute distance matrix using Euclidean distance
wine_dist <- dist(wine_scaled, method = "euclidean")
# Perform hierarchical clustering with complete linkage
wine_hc <- hclust(wine_dist, method = "complete")
# Plot the dendrogram
plot(wine_hc, labels = FALSE, main = "Wine Dataset Dendrogram (Complete Linkage)",
xlab = "Wine Sample", ylab = "Height (Euclidean Distance)", sub="",
cex.main = 1.5)
abline(h = 10, col = "red", lty = 2, lwd = 2)
text(x = 140, y = 11, labels = "Cut height for 3 clusters", col = "red")The dendrogram suggests natural groupings in the data. Since we know there are 3 wine cultivars, let’s cut the tree at a height that produces 3 clusters:
# Cut tree to obtain 3 clusters
wine_clusters <- cutree(wine_hc, k = 3)
# Compare with true cultivar labels
comparison_table <- table(True_Cultivar = true_cultivars,
Predicted_Cluster = wine_clusters)
print(comparison_table)## Predicted_Cluster
## True_Cultivar 1 2 3
## 1 51 8 0
## 2 18 50 3
## 3 0 0 48
# Calculate clustering accuracy
# (matching predicted clusters to true labels for best alignment)
accuracy <- max(sum(comparison_table[1,]), sum(comparison_table[2,]),
sum(comparison_table[3,])) / sum(comparison_table)
cat("\nClustering accuracy (best alignment):", round(accuracy * 100, 1), "%\n")##
## Clustering accuracy (best alignment): 39.9 %
Hierarchical clustering successfully recovered the three wine cultivars with high accuracy, using only the chemical measurements (no cultivar labels were used in the clustering).
Let’s visualize the results using the first two principal components for dimensionality reduction:
# Perform PCA for visualization
wine_pca <- prcomp(wine_scaled)
pca_data <- data.frame(
PC1 = wine_pca$x[, 1],
PC2 = wine_pca$x[, 2],
True_Cultivar = factor(true_cultivars),
Predicted_Cluster = factor(wine_clusters)
)
# Plot true cultivars vs predicted clusters
p1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = True_Cultivar)) +
geom_point(size = 2.5, alpha = 0.7) +
scale_color_manual(values = c("steelblue", "coral", "forestgreen")) +
theme_minimal() +
labs(title = "True Wine Cultivars", color = "Cultivar") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
p2 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = Predicted_Cluster)) +
geom_point(size = 2.5, alpha = 0.7) +
scale_color_manual(values = c("steelblue", "coral", "forestgreen")) +
theme_minimal() +
labs(title = "Predicted Clusters", color = "Cluster") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
grid.arrange(p1, p2, ncol = 2,
top = "Wine Classification: True vs Predicted")The close correspondence between the true cultivars and predicted clusters demonstrates that hierarchical clustering successfully identified meaningful structure in the wine chemical data.
Hierarchical clustering works best when:
Be aware of these limitations:
Hierarchical clustering is a powerful and intuitive technique for discovering structure in unlabeled data. Its key strengths are:
Through our examples, we’ve seen how hierarchical clustering can:
The methodology is straightforward: compute pairwise dissimilarities, iteratively merge the most similar clusters using a chosen linkage method, and visualize the hierarchy with a dendrogram. While the algorithm is simple, the resulting tool is remarkably effective for exploratory data analysis.
As you continue your data science journey, hierarchical clustering will prove valuable for initial data exploration, identifying natural groupings, and generating hypotheses about data structure. Combined with domain knowledge and other analytical techniques, it’s an essential tool in the modern data analyst’s toolkit.