Hierarchical clustering is a family of methods that build a tree of clusters from data. It is especially handy when you want to see relationships at multiple levels (coarse to fine) or when you don’t want to pick the number of clusters in advance. The classic output is a dendrogram, the branching tree you often see next to heatmaps. This tutorial is written for readers with a first course in statistics and some basic R experience. For more background, see Section 12.4.2 of An Introduction to Statistical Learning (2e) (James et al. 2021).
We will:
Throughout, we will use base R functions like dist() and
hclust() (R Core Team
2025).
Hierarchical clustering starts from a distance (or dissimilarity) between observations. The most common is Euclidean distance on numeric features. Because distances are sensitive to scale, it’s standard to standardize (z-score) variables first when units differ.
Below we simulate three 2D clusters and compare dendrograms with and without scaling.
# Simulate three compact clusters in 2D
n <- 60
make_blob <- function(n, center, sd=0.35) {
cbind(rnorm(n, center[1], sd), rnorm(n, center[2], sd))
}
X1 <- make_blob(n, c(0, 0))
X2 <- make_blob(n, c(3, 0))
X3 <- make_blob(n, c(1.5, 2.5))
X <- rbind(X1, X2, X3)
colnames(X) <- c("x1", "x2")
X_df <- as.data.frame(X)
# Dendrogram without scaling (Euclidean)
d_raw <- dist(X_df, method = "euclidean")
hc_raw <- hclust(d_raw, method = "complete")
# Dendrogram with scaling
X_scaled <- scale(X_df)
d_scaled <- dist(X_scaled, method = "euclidean")
hc_scaled <- hclust(d_scaled, method = "complete")
par(mfrow = c(1,2))
plot(hc_raw, labels = FALSE, main = "No scaling", xlab = "", sub = "")
plot(hc_scaled, labels = FALSE, main = "With scaling", xlab = "", sub = "")
Figure 1. Complete-linkage dendrograms on synthetic data: left uses raw features, right uses standardized (z-scored) features.
par(mfrow = c(1,1))
toy <- matrix(c(0.00,0.10,
0.20,0.15,
1.75,1.80,
2.00,2.20,
1.90,1.60,
3.50,3.60), ncol = 2, byrow = TRUE)
colnames(toy) <- c("x1","x2")
rownames(toy) <- paste0("P", 1:nrow(toy))
# Helper: compute Euclidean distance matrix
dist_mat <- function(M) {
as.matrix(dist(M, method = "euclidean"))
}
D0 <- dist_mat(toy)
round(D0, 3)
## P1 P2 P3 P4 P5 P6
## P1 0.000 0.206 2.440 2.900 2.421 4.950
## P2 0.206 0.000 2.264 2.728 2.234 4.774
## P3 2.440 2.264 0.000 0.472 0.250 2.510
## P4 2.900 2.728 0.472 0.000 0.608 2.052
## P5 2.421 2.234 0.250 0.608 0.000 2.561
## P6 4.950 4.774 2.510 2.052 2.561 0.000
# Implement a single Lance–Williams update step for 'complete', 'single', 'average', 'ward.D2'
lw_update <- function(D, sizes, i, j, method = c("complete","single","average","ward.D2")) {
method <- match.arg(method)
n <- nrow(D)
keep <- setdiff(seq_len(n), c(i, j))
# distances from i and j to remaining clusters
dik <- D[i, keep]; djk <- D[j, keep]; dij <- D[i, j]
ni <- sizes[i]; nj <- sizes[j]; nk <- sizes[keep]
if (method == "complete") {
alpha_i <- 0.5; alpha_j <- 0.5; beta <- 0; gamma <- 0.5
} else if (method == "single") {
alpha_i <- 0.5; alpha_j <- 0.5; beta <- 0; gamma <- -0.5
} else if (method == "average") {
alpha_i <- ni / (ni + nj); alpha_j <- nj / (ni + nj); beta <- 0; gamma <- 0
} else if (method == "ward.D2") {
# squared distances are assumed in D for Ward.D2
alpha_i <- (ni + nk) / (ni + nj + nk)
alpha_j <- (nj + nk) / (ni + nj + nk)
beta <- - nk / (ni + nj + nk)
gamma <- 0
}
new_d <- alpha_i * dik + alpha_j * djk + beta * dij + gamma * abs(dik - djk)
# Construct new distance matrix after merging (i,j) -> i, drop j
D_new <- D[keep, keep, drop = FALSE]
# add row/col for new cluster at position first of keep
D_new <- cbind(D_new, new_d)
new_col <- c(new_d, 0)
D_new <- rbind(D_new, new_col)
rownames(D_new)[nrow(D_new)] <- colnames(D_new)[ncol(D_new)] <- paste0("(", rownames(D)[i], "+", rownames(D)[j], ")")
sizes_new <- c(sizes[keep], ni + nj)
list(D = D_new, sizes = sizes_new)
}
# Work two steps manually with COMPLETE linkage (no scaling) on the toy data
D <- dist_mat(toy)
sizes <- rep(1, nrow(D))
# Step 1: find closest pair
D_tmp <- D + diag(Inf, nrow(D)) # ignore diagonal
idx <- which(D_tmp == min(D_tmp), arr.ind = TRUE)[1,]
merge_pair1 <- sort(idx)
merge_pair1_names <- rownames(D)[merge_pair1]
h1 <- D[merge_pair1[1], merge_pair1[2]]
cat("Step 1 merge:", paste(merge_pair1_names, collapse = " + "), "at height", round(h1, 3), "\n")
## Step 1 merge: P1 + P2 at height 0.206
u1 <- lw_update(D, sizes, merge_pair1[1], merge_pair1[2], method = "complete")
# Step 2: find next closest pair in the updated matrix
D2 <- u1$D; sizes2 <- u1$sizes
D2_tmp <- D2 + diag(Inf, nrow(D2))
idx2 <- which(D2_tmp == min(D2_tmp), arr.ind = TRUE)[1,]
merge_pair2 <- sort(idx2)
merge_pair2_names <- rownames(D2)[merge_pair2]
h2 <- D2[merge_pair2[1], merge_pair2[2]]
cat("Step 2 merge:", paste(merge_pair2_names, collapse = " + "), "at height", round(h2, 3), "\n")
## Step 2 merge: P3 + P5 at height 0.25
# Compare with hclust on full toy dataset
hc_toy <- hclust(dist(toy), method = "complete")
plot(hc_toy, main = "Toy data (complete linkage)", sub = "", xlab = "")
abline(h = h1, col = "gray", lty = 2)
abline(h = h2, col = "gray", lty = 3)
Figure 2. Dendrogram for the toy dataset using complete linkage; we also show the first two merge steps computed manually with the Lance–Williams update.
methods <- c("single","complete","average","ward.D2")
par(mfrow = c(2,2))
for (m in methods) {
hc <- hclust(dist(X_scaled), method = m)
plot(hc, labels = FALSE, main = m, xlab = "", sub = "")
}
Figure 3. Comparing linkage methods on the synthetic blobs.
par(mfrow = c(1,1))
hc_avg <- hclust(dist(X_scaled), method = "average")
grp <- cutree(hc_avg, k = 3)
table(grp)
## grp
## 1 2 3
## 60 60 60
data(USJudgeRatings, package = "datasets")
Xj <- scale(as.matrix(USJudgeRatings)) # standardize columns
dj <- dist(Xj) # Euclidean distance
hcj <- hclust(dj, method = "average")
plot(hcj, main = "USJudgeRatings (average linkage)", sub = "", xlab = "", cex = 0.7)
rect.hclust(hcj, k = 3, border = "steelblue")
Figure 4. USJudgeRatings dendrogram (average linkage) on standardized variables.
grp_j <- cutree(hcj, k = 3)
cluster_means <- aggregate(as.data.frame(USJudgeRatings), by = list(cluster = grp_j), mean)
knitr::kable(round(cluster_means, 2), caption = "Cluster-wise mean ratings for U.S. judges.")
| cluster | CONT | INTG | DMNR | DILG | CFMG | DECI | PREP | FAMI | ORAL | WRIT | PHYS | RTEN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7.47 | 7.74 | 7.14 | 7.20 | 7.04 | 7.11 | 6.94 | 6.95 | 6.80 | 6.88 | 7.63 | 7.16 |
| 2 | 7.47 | 8.49 | 8.19 | 8.36 | 8.11 | 8.17 | 8.18 | 8.20 | 8.02 | 8.09 | 8.51 | 8.35 |
| 3 | 7.15 | 6.48 | 5.12 | 5.85 | 5.60 | 5.93 | 5.47 | 5.58 | 5.08 | 5.38 | 5.82 | 5.08 |
heatmap(Xj, Rowv = NA, Colv = NA, scale = "none", col = heat.colors(256), margins = c(6,6))
Figure 5. Heatmap of standardized ratings with dendrograms for judges (rows) and criteria (columns).
hc_avg <- hclust(dist(X_scaled), method = "average")
coph <- cophenetic(hc_avg)
cor(as.numeric(dist(X_scaled)), as.numeric(coph))
## [1] 0.9425066
Hierarchical clustering offers a transparent, flexible way to explore structure in data without presetting the number of clusters. See ISLR2 (James et al. 2021), and classic references on Ward’s method and Lance–Williams updates (Ward 1963; Murtagh and Legendre 2014; Lance and Williams 1967).
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 26100)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_India.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.8.8 evaluate_1.0.5
## [5] highr_0.11 cachem_1.0.6 rlang_1.1.3 cli_3.6.2
## [9] rstudioapi_0.16.0 jquerylib_0.1.4 bslib_0.4.0 rmarkdown_2.26
## [13] tools_4.2.1 xfun_0.43 yaml_2.3.8 fastmap_1.1.0
## [17] compiler_4.2.1 htmltools_0.5.8.1 knitr_1.46 sass_0.4.2