1 Why hierarchical clustering?

Hierarchical clustering is a family of methods that build a tree of clusters from data. It is especially handy when you want to see relationships at multiple levels (coarse to fine) or when you don’t want to pick the number of clusters in advance. The classic output is a dendrogram, the branching tree you often see next to heatmaps. This tutorial is written for readers with a first course in statistics and some basic R experience. For more background, see Section 12.4.2 of An Introduction to Statistical Learning (2e) (James et al. 2021).

We will:

Review distances and scaling (why they matter).
Walk through the agglomerative algorithm step-by-step (with code).
Compare popular linkage choices (single, complete, average, Ward.D2) on synthetic data.
Work a real-data example and interpret the dendrogram.
Offer practical tips and references.

Throughout, we will use base R functions like dist() and hclust() (R Core Team 2025).

2 Distances and scaling

Hierarchical clustering starts from a distance (or dissimilarity) between observations. The most common is Euclidean distance on numeric features. Because distances are sensitive to scale, it’s standard to standardize (z-score) variables first when units differ.

Below we simulate three 2D clusters and compare dendrograms with and without scaling.

# Simulate three compact clusters in 2D
n <- 60
make_blob <- function(n, center, sd=0.35) {
  cbind(rnorm(n, center[1], sd), rnorm(n, center[2], sd))
}
X1 <- make_blob(n, c(0, 0))
X2 <- make_blob(n, c(3, 0))
X3 <- make_blob(n, c(1.5, 2.5))
X <- rbind(X1, X2, X3)
colnames(X) <- c("x1", "x2")
X_df <- as.data.frame(X)

# Dendrogram without scaling (Euclidean)
d_raw <- dist(X_df, method = "euclidean")
hc_raw <- hclust(d_raw, method = "complete")

# Dendrogram with scaling
X_scaled <- scale(X_df)
d_scaled <- dist(X_scaled, method = "euclidean")
hc_scaled <- hclust(d_scaled, method = "complete")

par(mfrow = c(1,2))
plot(hc_raw, labels = FALSE, main = "No scaling", xlab = "", sub = "")
plot(hc_scaled, labels = FALSE, main = "With scaling", xlab = "", sub = "")

Figure 1. Complete-linkage dendrograms on synthetic data: left uses raw features, right uses standardized (z-scored) features.

par(mfrow = c(1,1))

3 How agglomerative clustering works (step-by-step)

toy <- matrix(c(0.00,0.10,
                0.20,0.15,
                1.75,1.80,
                2.00,2.20,
                1.90,1.60,
                3.50,3.60), ncol = 2, byrow = TRUE)
colnames(toy) <- c("x1","x2")
rownames(toy) <- paste0("P", 1:nrow(toy))

# Helper: compute Euclidean distance matrix
dist_mat <- function(M) {
  as.matrix(dist(M, method = "euclidean"))
}

D0 <- dist_mat(toy)
round(D0, 3)

##       P1    P2    P3    P4    P5    P6
## P1 0.000 0.206 2.440 2.900 2.421 4.950
## P2 0.206 0.000 2.264 2.728 2.234 4.774
## P3 2.440 2.264 0.000 0.472 0.250 2.510
## P4 2.900 2.728 0.472 0.000 0.608 2.052
## P5 2.421 2.234 0.250 0.608 0.000 2.561
## P6 4.950 4.774 2.510 2.052 2.561 0.000

# Implement a single Lance–Williams update step for 'complete', 'single', 'average', 'ward.D2'
lw_update <- function(D, sizes, i, j, method = c("complete","single","average","ward.D2")) {
  method <- match.arg(method)
  n <- nrow(D)
  keep <- setdiff(seq_len(n), c(i, j))
  # distances from i and j to remaining clusters
  dik <- D[i, keep]; djk <- D[j, keep]; dij <- D[i, j]
  ni <- sizes[i]; nj <- sizes[j]; nk <- sizes[keep]
  if (method == "complete") {
    alpha_i <- 0.5; alpha_j <- 0.5; beta <- 0; gamma <- 0.5
  } else if (method == "single") {
    alpha_i <- 0.5; alpha_j <- 0.5; beta <- 0; gamma <- -0.5
  } else if (method == "average") {
    alpha_i <- ni / (ni + nj); alpha_j <- nj / (ni + nj); beta <- 0; gamma <- 0
  } else if (method == "ward.D2") {
    # squared distances are assumed in D for Ward.D2
    alpha_i <- (ni + nk) / (ni + nj + nk)
    alpha_j <- (nj + nk) / (ni + nj + nk)
    beta <- - nk / (ni + nj + nk)
    gamma <- 0
  }
  new_d <- alpha_i * dik + alpha_j * djk + beta * dij + gamma * abs(dik - djk)
  # Construct new distance matrix after merging (i,j) -> i, drop j
  D_new <- D[keep, keep, drop = FALSE]
  # add row/col for new cluster at position first of keep
  D_new <- cbind(D_new, new_d)
  new_col <- c(new_d, 0)
  D_new <- rbind(D_new, new_col)
  rownames(D_new)[nrow(D_new)] <- colnames(D_new)[ncol(D_new)] <- paste0("(", rownames(D)[i], "+", rownames(D)[j], ")")
  sizes_new <- c(sizes[keep], ni + nj)
  list(D = D_new, sizes = sizes_new)
}

# Work two steps manually with COMPLETE linkage (no scaling) on the toy data
D <- dist_mat(toy)
sizes <- rep(1, nrow(D))
# Step 1: find closest pair
D_tmp <- D + diag(Inf, nrow(D)) # ignore diagonal
idx <- which(D_tmp == min(D_tmp), arr.ind = TRUE)[1,]
merge_pair1 <- sort(idx)
merge_pair1_names <- rownames(D)[merge_pair1]
h1 <- D[merge_pair1[1], merge_pair1[2]]
cat("Step 1 merge:", paste(merge_pair1_names, collapse = " + "), "at height", round(h1, 3), "\n")

## Step 1 merge: P1 + P2 at height 0.206

u1 <- lw_update(D, sizes, merge_pair1[1], merge_pair1[2], method = "complete")
# Step 2: find next closest pair in the updated matrix
D2 <- u1$D; sizes2 <- u1$sizes
D2_tmp <- D2 + diag(Inf, nrow(D2))
idx2 <- which(D2_tmp == min(D2_tmp), arr.ind = TRUE)[1,]
merge_pair2 <- sort(idx2)
merge_pair2_names <- rownames(D2)[merge_pair2]
h2 <- D2[merge_pair2[1], merge_pair2[2]]
cat("Step 2 merge:", paste(merge_pair2_names, collapse = " + "), "at height", round(h2, 3), "\n")

## Step 2 merge: P3 + P5 at height 0.25

# Compare with hclust on full toy dataset
hc_toy <- hclust(dist(toy), method = "complete")
plot(hc_toy, main = "Toy data (complete linkage)", sub = "", xlab = "")
abline(h = h1, col = "gray", lty = 2)
abline(h = h2, col = "gray", lty = 3)

Figure 2. Dendrogram for the toy dataset using complete linkage; we also show the first two merge steps computed manually with the Lance–Williams update.

4 Linkage choices side-by-side

methods <- c("single","complete","average","ward.D2")
par(mfrow = c(2,2))
for (m in methods) {
  hc <- hclust(dist(X_scaled), method = m)
  plot(hc, labels = FALSE, main = m, xlab = "", sub = "")
}

Figure 3. Comparing linkage methods on the synthetic blobs.

par(mfrow = c(1,1))

hc_avg <- hclust(dist(X_scaled), method = "average")
grp <- cutree(hc_avg, k = 3)
table(grp)

## grp
##  1  2  3 
## 60 60 60

5 Real-data example: U.S. judge ratings

data(USJudgeRatings, package = "datasets")
Xj <- scale(as.matrix(USJudgeRatings))  # standardize columns
dj <- dist(Xj)                          # Euclidean distance
hcj <- hclust(dj, method = "average")
plot(hcj, main = "USJudgeRatings (average linkage)", sub = "", xlab = "", cex = 0.7)
rect.hclust(hcj, k = 3, border = "steelblue")

Figure 4. USJudgeRatings dendrogram (average linkage) on standardized variables.

grp_j <- cutree(hcj, k = 3)

cluster_means <- aggregate(as.data.frame(USJudgeRatings), by = list(cluster = grp_j), mean)
knitr::kable(round(cluster_means, 2), caption = "Cluster-wise mean ratings for U.S. judges.")

Cluster-wise mean ratings for U.S. judges.
cluster	CONT	INTG	DMNR	DILG	CFMG	DECI	PREP	FAMI	ORAL	WRIT	PHYS	RTEN
1	7.47	7.74	7.14	7.20	7.04	7.11	6.94	6.95	6.80	6.88	7.63	7.16
2	7.47	8.49	8.19	8.36	8.11	8.17	8.18	8.20	8.02	8.09	8.51	8.35
3	7.15	6.48	5.12	5.85	5.60	5.93	5.47	5.58	5.08	5.38	5.82	5.08

heatmap(Xj, Rowv = NA, Colv = NA, scale = "none", col = heat.colors(256), margins = c(6,6))

Figure 5. Heatmap of standardized ratings with dendrograms for judges (rows) and criteria (columns).

6 Practical tips

Standardize when units differ (most real datasets).
Use Euclidean distance with Ward.D2 or average as strong default choices.
Watch for outliers.
To check how well the tree preserves original distances, compute the cophenetic correlation:

hc_avg <- hclust(dist(X_scaled), method = "average")
coph <- cophenetic(hc_avg)
cor(as.numeric(dist(X_scaled)), as.numeric(coph))

## [1] 0.9425066

7 Conclusion

Hierarchical clustering offers a transparent, flexible way to explore structure in data without presetting the number of clusters. See ISLR2 (James et al. 2021), and classic references on Ward’s method and Lance–Williams updates (Ward 1963; Murtagh and Legendre 2014; Lance and Williams 1967).

8 Reproducibility

sessionInfo()

## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 26100)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29     R6_2.5.1          jsonlite_1.8.8    evaluate_1.0.5   
##  [5] highr_0.11        cachem_1.0.6      rlang_1.1.3       cli_3.6.2        
##  [9] rstudioapi_0.16.0 jquerylib_0.1.4   bslib_0.4.0       rmarkdown_2.26   
## [13] tools_4.2.1       xfun_0.43         yaml_2.3.8        fastmap_1.1.0    
## [17] compiler_4.2.1    htmltools_0.5.8.1 knitr_1.46        sass_0.4.2

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. 2nd ed. New York: Springer. https://www.statlearning.com.

Lance, G. N., and W. T. Williams. 1967. “A General Theory of Classificatory Sorting Strategies.” The Computer Journal 9 (4): 373–80. https://doi.org/10.1093/comjnl/9.4.373.

Murtagh, Fionn, and Pierre Legendre. 2014. “Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?” Journal of Classification 31 (3): 274–95. https://doi.org/10.1007/s00357-014-9161-z.

R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Ward, Joe H. 1963. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association 58 (301): 236–44. https://doi.org/10.1080/01621459.1963.10500845.

Hierarchical Clustering: A Gentle, Practical Tutorial

Arin Rajiv Gupta

October 28, 2025