Multivariate Statistics - StatUa

Hierarchical Clustering - Exercise solutions

Decathlon data

Questions:

Use the 10 athletic events from the decathlon dataset.

Why must we use scale() here?

What would happen to the “distance” between two athletes if we kept the 1500m (seconds) and the Shot Put (meters) in their original units?

Run hierarchical clustering twice: once using Single Linkage (nearest neighbor) and once using Complete Linkage (furthest neighbor).

Compare the two dendrograms. Which one looks like a “chain” and which one looks like distinct “branches”?

Now use Ward’s Method. This method aims to minimize the variance within clusters.

Cut the tree into 3 clusters.

Look at the names of the athletes in each cluster. Do the Olympic medalists all end up in the same cluster, or are they spread out based on their specific strengths?

Now, transpose the dataset so that the events (variables) are clustered instead of the athletes.

Which events are grouped together in the dendrogram? Do these groupings correspond to meaningful categories such as running events, jumping events, or throwing events?

If two events (e.g., 100m and long jump) are clustered closely together, what does this tell you about the athletes’ performance in these events? How would you interpret this relationship in terms of underlying physical abilities?


library(FactoMineR)

data(decathlon)

decathlon_cont <- decathlon[,1:10]

decathlon_cont <- data.frame(scale(decathlon_cont))

summary(decathlon_cont)
#>      X100m            Long.jump          Shot.put         High.jump      
#>  Min.   :-2.12167   Min.   :-2.0544   Min.   :-2.1798   Min.   :-1.4258  
#>  1st Qu.:-0.56287   1st Qu.:-0.7269   1st Qu.:-0.7242   1st Qu.:-0.6389  
#>  Median :-0.06862   Median : 0.1264   Median : 0.1127   Median :-0.3016  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.53969   3rd Qu.: 0.6953   3rd Qu.: 0.5979   3rd Qu.: 0.7102  
#>  Max.   : 2.44067   Max.   : 2.2124   Max.   : 2.2839   Max.   : 1.9468  
#>      X400m          X110m.hurdle         Discus           Pole.vault     
#>  Min.   :-2.4330   Min.   :-1.3478   Min.   :-1.89636   Min.   :-2.0232  
#>  1st Qu.:-0.5950   1st Qu.:-0.8390   1st Qu.:-0.71809   1st Qu.:-0.9440  
#>  Median :-0.1876   Median :-0.2668   Median : 0.02498   Median : 0.1351  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.5927   3rd Qu.: 0.7930   3rd Qu.: 0.51642   3rd Qu.: 0.5668  
#>  Max.   : 3.1069   Max.   : 2.2556   Max.   : 2.16836   Max.   : 2.2934  
#>     Javeline             X1500m        
#>  Min.   :-1.658770   Min.   :-1.44989  
#>  1st Qu.:-0.631179   1st Qu.:-0.68575  
#>  Median : 0.008994   Median :-0.08351  
#>  Mean   : 0.000000   Mean   : 0.00000  
#>  3rd Qu.: 0.533149   3rd Qu.: 0.52043  
#>  Max.   : 2.528251   Max.   : 3.25318

dist_mat <- dist(decathlon_cont)

The decathlon variables are measured in very different units. For example, the 1500m event is measured in seconds, so it is scaled from 260 to 320, while the Shot Put is measured in meters, which is scaled between 12 and 17.

Distance-based methods (like clustering) are highly sensitive to scale. If we do not standardize, variables with larger numerical ranges dominate the distance calculation. As a consequence, the clustering would be driven almost entirely by events like the 1500m, ignoring others like Shot Put or High Jump.

If we compute distances without scaling, a small difference in 1500m time (e.g., 10 seconds), will outweigh a large difference in Shot Put (e.g., 3 meters), resulting in athletes being grouped mostly based on endurance. Other performance dimensions would be ignored, resulting in misleading clusters.

hc_single <- hclust(dist_mat, method = "single")

plot(hc_single, main = "Single Linkage Dendrogram")


hc_complete <- hclust(dist_mat, method = "complete")

plot(hc_complete, main = "Complete Linkage Dendrogram")

If we compare the Single Linkage (Nearest Neighbor) to Complete Linkage (Furthest Neighbor), we see that the structure of the dendrogram looks quite different. Single Linkage tends to create a “chain” structure, because clusters are formed by linking the nearest points. This means the linkage method is:

Sensitive to noise
Can connect very different observations through intermediate points

On the other hand, Complete Linkage produces compact, well-separated clusters, based on maximum distance within clusters. This results in:

More balanced cluster shapes
Less chaining effect


hc_ward <- hclust(dist_mat, method = "ward")
#> The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot(hc_ward, main = "Ward's Method Dendrogram", col=decathlon$Rank)

Ward’s method minimizes within-cluster variance, making it the most suitable for many real datasets. Ward’s method results in:

Clear, interpretable clusters
Balanced group sizes
Two main branches

library(tidyverse)

clusters <- cutree(hc_ward, k = 3)

When cutting the dendrogram into 3 clusters, we get:

Cluster 1: speed-focused athletes
Cluster 2: strength-focused athletes
Cluster 3: balanced/generalists

fviz_cluster(list(data = decathlon_cont, cluster = clusters))


plot(hc_ward, main = "Ward's Method Dendrogram")
rect.hclust(hc_ward, k=3, border=2:5)

The olympic medalists in this dataset, based on the Rank variable, are:

Sebrle
Clay
Karpov

rownames(decathlon)[decathlon$Rank==1]
#> [1] "SEBRLE" "Sebrle"
rownames(decathlon)[decathlon$Rank==2]
#> [1] "CLAY" "Clay"
rownames(decathlon)[decathlon$Rank==3]
#> [1] "KARPOV" "Karpov"

# Add cluster labels to dataset
decathlon_cont$cluster <- clusters

# Add the names of the athletes as a variable
decathlon_cont <- decathlon_cont %>%
  mutate(Athlete = rownames(decathlon_cont))

# View athletes per cluster
decathlon_cont %>%
  dplyr::select(Athlete, cluster) %>%
  arrange(cluster)
#>                 Athlete cluster
#> SEBRLE           SEBRLE       1
#> CLAY               CLAY       1
#> KARPOV           KARPOV       1
#> BERNARD         BERNARD       1
#> WARNERS         WARNERS       1
#> Warners         Warners       1
#> Nool               Nool       1
#> Schwarzl       Schwarzl       1
#> Pogorelov     Pogorelov       1
#> Schoenbeck   Schoenbeck       1
#> Averyanov     Averyanov       1
#> Ojaniemi       Ojaniemi       1
#> Drews             Drews       1
#> Terek             Terek       1
#> Turi               Turi       1
#> Korkizoglou Korkizoglou       1
#> YURKOV           YURKOV       2
#> ZSIVOCZKY     ZSIVOCZKY       2
#> McMULLEN       McMULLEN       2
#> MARTINEAU     MARTINEAU       2
#> HERNU             HERNU       2
#> BARRAS           BARRAS       2
#> NOOL               NOOL       2
#> BOURGUIGNON BOURGUIGNON       2
#> Macey             Macey       2
#> Zsivoczky     Zsivoczky       2
#> Hernu             Hernu       2
#> Bernard         Bernard       2
#> Barras           Barras       2
#> Smith             Smith       2
#> Smirnov         Smirnov       2
#> Qi                   Qi       2
#> Parkhomenko Parkhomenko       2
#> Gomez             Gomez       2
#> Lorenzo         Lorenzo       2
#> Karlivans     Karlivans       2
#> Uldal             Uldal       2
#> Casarsa         Casarsa       2
#> Sebrle           Sebrle       3
#> Clay               Clay       3
#> Karpov           Karpov       3

The fact that the medalists are grouped together in the clustering solution is actually a very meaningful and insightful result. Even though hierarchical clustering is an unsupervised method—meaning it has no prior knowledge of which athletes are top performers—it is still able to group individuals based on similarity in their performance across all events. The fact that the highest-performing athletes end up in the same cluster suggests that the model is capturing a strong underlying structure in the data, namely overall performance level. This indicates that elite decathletes are not simply specialists in one type of event, but tend to perform consistently well across multiple disciplines. In other words, what drives the clustering here is not just whether an athlete is more oriented toward speed or strength, but rather how strong they are overall.

data(decathlon)

decathlon_cont <- decathlon[,1:10]

decathlon_cont <- data.frame(scale(decathlon_cont))

decathlon_t <- t(decathlon_cont)

dist_mat_t <- dist(decathlon_t)

hc_t <- hclust(dist_mat_t, method = "ward") 
#> The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot(hc_t, main = "Clustering of Decathlon Events")
rect.hclust(hc_t, k=4, border=2:5)

When clustering the variables instead of the athletes, we are effectively grouping events based on how similarly athletes perform across them. The dendrogram reveals meaningful clusters that correspond to underlying physical abilities. For example, sprinting events such as the 100m and 400m cluster together, as they both rely heavily on speed and explosive power. Throwing events such as shot put and discuss also cluster together due to their reliance on strength and technique. However, we can also see that the jumping events, like the high jump and the long jump are not clustered together.

If two events, such as the 100m and long jump, are clustered closely together, this indicates that athletes who perform well in one of these events also tend to perform well in the other. This suggests a shared underlying ability—in this case, explosive power and speed. More generally, clustering variables provides insight into the latent structure of the dataset, showing how different performance measures relate to each other and helping to identify broader dimensions of athletic ability.

Penguin data

Questions:

Compute the Euclidean distance matrix for your scaled data.

Visualize the distance matrix as a heatmap. Do you already see “blocks” of similar penguins?

Perform hierarchical clustering using two different linkage methods: Complete Linkage and Ward’s Method.

Plot both dendrograms side-by-side. Which one produces more “balanced” clusters?

Cut the dendrogram into k=3 clusters.

Compare these 3 clusters to the actual species. Which species did the algorithm separate perfectly, and where did it get confused?

If the dendrogram shows two main branches, but you know there are three species, do you trust the math or your “theoretical knowledge” of biology?

data(penguins)

penguin <- na.omit(penguins) 

penguin_num <- penguin %>% 
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)  

penguin_scaled <- scale(penguin_num)

dist_mat_p <- dist(penguin_scaled)

dist_matrix <- as.matrix(dist_mat_p) 

heatmap(dist_matrix, symm = TRUE)

After computing the Euclidean distance matrix on the scaled data, the heatmap already reveals structure in the dataset. Blocks of darker and lighter regions indicate groups of penguins that are more similar to each other. You can see a clear block corresponding to the Gentoo penguins, which are more distinct in terms of body size and flipper length. This shows that even before clustering, the data contains visible group structure.

# Complete linkage 
hc_complete_p <- hclust(dist_mat_p, method = "complete") 
plot(hc_complete_p, main = "Complete Linkage")


# Ward's method 
hc_ward_p <- hclust(dist_mat_p, method = "ward")
#> The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(hc_ward_p, main = "Ward's Method")

When comparing the dendrograms, Complete Linkage and Ward’s Method produce noticeably different structures. Complete Linkage tends to create more elongated clusters and can be sensitive to outliers, while Ward’s Method produces more compact and balanced clusters by minimizing within-cluster variance. As a result, Ward’s Method typically yields a clearer and more interpretable clustering structure in this dataset.

clusters <- cutree(hc_ward_p, k = 3) 

# Compare with species 
table(clusters, penguin$species)
#>         
#> clusters Adelie Chinstrap Gentoo
#>        1    144         4      0
#>        2      2        64      0
#>        3      0         0    119

When cutting the dendrogram into three clusters, the results can be compared to the true species labels. The clustering algorithm separates the Gentoo penguins perfectly, as they are clearly distinct in terms of size-related variables. However, there is some confusion between Adelie and Chinstrap penguins, as these species have overlapping physical characteristics. This highlights an important point: clustering reflects similarity in the measured variables, not necessarily the true biological categories.

fviz_cluster(list(data = penguin_scaled, cluster = clusters))

If the dendrogram suggests two main branches while we know there are three species, the clustering algorithm is not “wrong”; it is simply reflecting the structure present in the data based on the variables used. However, as analysts, we should not blindly trust the output. Instead, we must combine statistical results with theoretical knowledge, in this case, biological understanding of species differences.

Hierarchical Clustering - exercise solutions

dr. Annelies Agten

2026-04-29

Multivariate Statistics - StatUa

Hierarchical Clustering - Exercise solutions

Decathlon data

Penguin data