Questions:
Use the 10 athletic events from the decathlon dataset.
Why must we use scale() here?
What would happen to the “distance” between two athletes if we kept the 1500m (seconds) and the Shot Put (meters) in their original units?
Run hierarchical clustering twice: once using Single Linkage (nearest neighbor) and once using Complete Linkage (furthest neighbor).
Compare the two dendrograms. Which one looks like a “chain” and which one looks like distinct “branches”?
Now use Ward’s Method. This method aims to minimize the variance within clusters.
Cut the tree into 3 clusters.
Look at the names of the athletes in each cluster. Do the Olympic medalists all end up in the same cluster, or are they spread out based on their specific strengths?
Now, transpose the dataset so that the events (variables) are clustered instead of the athletes.
Which events are grouped together in the dendrogram? Do these groupings correspond to meaningful categories such as running events, jumping events, or throwing events?
If two events (e.g., 100m and long jump) are clustered closely together, what does this tell you about the athletes’ performance in these events? How would you interpret this relationship in terms of underlying physical abilities?
library(FactoMineR)
data(decathlon)
decathlon_cont <- decathlon[,1:10]
decathlon_cont <- data.frame(scale(decathlon_cont))
summary(decathlon_cont)
#> X100m Long.jump Shot.put High.jump
#> Min. :-2.12167 Min. :-2.0544 Min. :-2.1798 Min. :-1.4258
#> 1st Qu.:-0.56287 1st Qu.:-0.7269 1st Qu.:-0.7242 1st Qu.:-0.6389
#> Median :-0.06862 Median : 0.1264 Median : 0.1127 Median :-0.3016
#> Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.53969 3rd Qu.: 0.6953 3rd Qu.: 0.5979 3rd Qu.: 0.7102
#> Max. : 2.44067 Max. : 2.2124 Max. : 2.2839 Max. : 1.9468
#> X400m X110m.hurdle Discus Pole.vault
#> Min. :-2.4330 Min. :-1.3478 Min. :-1.89636 Min. :-2.0232
#> 1st Qu.:-0.5950 1st Qu.:-0.8390 1st Qu.:-0.71809 1st Qu.:-0.9440
#> Median :-0.1876 Median :-0.2668 Median : 0.02498 Median : 0.1351
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
#> 3rd Qu.: 0.5927 3rd Qu.: 0.7930 3rd Qu.: 0.51642 3rd Qu.: 0.5668
#> Max. : 3.1069 Max. : 2.2556 Max. : 2.16836 Max. : 2.2934
#> Javeline X1500m
#> Min. :-1.658770 Min. :-1.44989
#> 1st Qu.:-0.631179 1st Qu.:-0.68575
#> Median : 0.008994 Median :-0.08351
#> Mean : 0.000000 Mean : 0.00000
#> 3rd Qu.: 0.533149 3rd Qu.: 0.52043
#> Max. : 2.528251 Max. : 3.25318
dist_mat <- dist(decathlon_cont)The decathlon variables are measured in very different units. For example, the 1500m event is measured in seconds, so it is scaled from 260 to 320, while the Shot Put is measured in meters, which is scaled between 12 and 17.
Distance-based methods (like clustering) are highly sensitive to scale. If we do not standardize, variables with larger numerical ranges dominate the distance calculation. As a consequence, the clustering would be driven almost entirely by events like the 1500m, ignoring others like Shot Put or High Jump.
If we compute distances without scaling, a small difference in 1500m time (e.g., 10 seconds), will outweigh a large difference in Shot Put (e.g., 3 meters), resulting in athletes being grouped mostly based on endurance. Other performance dimensions would be ignored, resulting in misleading clusters.
hc_single <- hclust(dist_mat, method = "single")
plot(hc_single, main = "Single Linkage Dendrogram")
hc_complete <- hclust(dist_mat, method = "complete")
plot(hc_complete, main = "Complete Linkage Dendrogram")If we compare the Single Linkage (Nearest Neighbor) to Complete Linkage (Furthest Neighbor), we see that the structure of the dendrogram looks quite different. Single Linkage tends to create a “chain” structure, because clusters are formed by linking the nearest points. This means the linkage method is:
On the other hand, Complete Linkage produces compact, well-separated clusters, based on maximum distance within clusters. This results in:
hc_ward <- hclust(dist_mat, method = "ward")
#> The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(hc_ward, main = "Ward's Method Dendrogram", col=decathlon$Rank)Ward’s method minimizes within-cluster variance, making it the most suitable for many real datasets. Ward’s method results in:
When cutting the dendrogram into 3 clusters, we get:
The olympic medalists in this dataset, based on the Rank variable, are:
rownames(decathlon)[decathlon$Rank==1]
#> [1] "SEBRLE" "Sebrle"
rownames(decathlon)[decathlon$Rank==2]
#> [1] "CLAY" "Clay"
rownames(decathlon)[decathlon$Rank==3]
#> [1] "KARPOV" "Karpov"
# Add cluster labels to dataset
decathlon_cont$cluster <- clusters
# Add the names of the athletes as a variable
decathlon_cont <- decathlon_cont %>%
mutate(Athlete = rownames(decathlon_cont))
# View athletes per cluster
decathlon_cont %>%
dplyr::select(Athlete, cluster) %>%
arrange(cluster)
#> Athlete cluster
#> SEBRLE SEBRLE 1
#> CLAY CLAY 1
#> KARPOV KARPOV 1
#> BERNARD BERNARD 1
#> WARNERS WARNERS 1
#> Warners Warners 1
#> Nool Nool 1
#> Schwarzl Schwarzl 1
#> Pogorelov Pogorelov 1
#> Schoenbeck Schoenbeck 1
#> Averyanov Averyanov 1
#> Ojaniemi Ojaniemi 1
#> Drews Drews 1
#> Terek Terek 1
#> Turi Turi 1
#> Korkizoglou Korkizoglou 1
#> YURKOV YURKOV 2
#> ZSIVOCZKY ZSIVOCZKY 2
#> McMULLEN McMULLEN 2
#> MARTINEAU MARTINEAU 2
#> HERNU HERNU 2
#> BARRAS BARRAS 2
#> NOOL NOOL 2
#> BOURGUIGNON BOURGUIGNON 2
#> Macey Macey 2
#> Zsivoczky Zsivoczky 2
#> Hernu Hernu 2
#> Bernard Bernard 2
#> Barras Barras 2
#> Smith Smith 2
#> Smirnov Smirnov 2
#> Qi Qi 2
#> Parkhomenko Parkhomenko 2
#> Gomez Gomez 2
#> Lorenzo Lorenzo 2
#> Karlivans Karlivans 2
#> Uldal Uldal 2
#> Casarsa Casarsa 2
#> Sebrle Sebrle 3
#> Clay Clay 3
#> Karpov Karpov 3The fact that the medalists are grouped together in the clustering solution is actually a very meaningful and insightful result. Even though hierarchical clustering is an unsupervised method—meaning it has no prior knowledge of which athletes are top performers—it is still able to group individuals based on similarity in their performance across all events. The fact that the highest-performing athletes end up in the same cluster suggests that the model is capturing a strong underlying structure in the data, namely overall performance level. This indicates that elite decathletes are not simply specialists in one type of event, but tend to perform consistently well across multiple disciplines. In other words, what drives the clustering here is not just whether an athlete is more oriented toward speed or strength, but rather how strong they are overall.
data(decathlon)
decathlon_cont <- decathlon[,1:10]
decathlon_cont <- data.frame(scale(decathlon_cont))
decathlon_t <- t(decathlon_cont)
dist_mat_t <- dist(decathlon_t)
hc_t <- hclust(dist_mat_t, method = "ward")
#> The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(hc_t, main = "Clustering of Decathlon Events")
rect.hclust(hc_t, k=4, border=2:5)When clustering the variables instead of the athletes, we are effectively grouping events based on how similarly athletes perform across them. The dendrogram reveals meaningful clusters that correspond to underlying physical abilities. For example, sprinting events such as the 100m and 400m cluster together, as they both rely heavily on speed and explosive power. Throwing events such as shot put and discuss also cluster together due to their reliance on strength and technique. However, we can also see that the jumping events, like the high jump and the long jump are not clustered together.
If two events, such as the 100m and long jump, are clustered closely together, this indicates that athletes who perform well in one of these events also tend to perform well in the other. This suggests a shared underlying ability—in this case, explosive power and speed. More generally, clustering variables provides insight into the latent structure of the dataset, showing how different performance measures relate to each other and helping to identify broader dimensions of athletic ability.
Questions:
Compute the Euclidean distance matrix for your scaled data.
Visualize the distance matrix as a heatmap. Do you already see “blocks” of similar penguins?
Perform hierarchical clustering using two different linkage methods: Complete Linkage and Ward’s Method.
Plot both dendrograms side-by-side. Which one produces more “balanced” clusters?
Cut the dendrogram into k=3 clusters.
Compare these 3 clusters to the actual species. Which species did the algorithm separate perfectly, and where did it get confused?
If the dendrogram shows two main branches, but you know there are three species, do you trust the math or your “theoretical knowledge” of biology?
data(penguins)
penguin <- na.omit(penguins)
penguin_num <- penguin %>%
dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)
penguin_scaled <- scale(penguin_num)
dist_mat_p <- dist(penguin_scaled)
dist_matrix <- as.matrix(dist_mat_p)
heatmap(dist_matrix, symm = TRUE)After computing the Euclidean distance matrix on the scaled data, the heatmap already reveals structure in the dataset. Blocks of darker and lighter regions indicate groups of penguins that are more similar to each other. You can see a clear block corresponding to the Gentoo penguins, which are more distinct in terms of body size and flipper length. This shows that even before clustering, the data contains visible group structure.
# Complete linkage
hc_complete_p <- hclust(dist_mat_p, method = "complete")
plot(hc_complete_p, main = "Complete Linkage")
# Ward's method
hc_ward_p <- hclust(dist_mat_p, method = "ward")
#> The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(hc_ward_p, main = "Ward's Method")When comparing the dendrograms, Complete Linkage and Ward’s Method produce noticeably different structures. Complete Linkage tends to create more elongated clusters and can be sensitive to outliers, while Ward’s Method produces more compact and balanced clusters by minimizing within-cluster variance. As a result, Ward’s Method typically yields a clearer and more interpretable clustering structure in this dataset.
clusters <- cutree(hc_ward_p, k = 3)
# Compare with species
table(clusters, penguin$species)
#>
#> clusters Adelie Chinstrap Gentoo
#> 1 144 4 0
#> 2 2 64 0
#> 3 0 0 119When cutting the dendrogram into three clusters, the results can be compared to the true species labels. The clustering algorithm separates the Gentoo penguins perfectly, as they are clearly distinct in terms of size-related variables. However, there is some confusion between Adelie and Chinstrap penguins, as these species have overlapping physical characteristics. This highlights an important point: clustering reflects similarity in the measured variables, not necessarily the true biological categories.
If the dendrogram suggests two main branches while we know there are three species, the clustering algorithm is not “wrong”; it is simply reflecting the structure present in the data based on the variables used. However, as analysts, we should not blindly trust the output. Instead, we must combine statistical results with theoretical knowledge, in this case, biological understanding of species differences.