Questions:
Use the scaled decathlon data. Run K-means with K=3. Look at the “cluster centers” (centroids).
Which cluster represents the “all-rounders” (average scores across all events) versus the “specialists”?
Use fviz_nbclust to calculate the Total Within-Cluster Sum of Squares (WSS) for K=1 to K=10.
Identify the “Elbow” in the plot. Does the math suggest 2, 3, or 4 clusters?
K-means starts with random points. Run the model twice with the same K but different nstart values.
Do you get the exact same clusters?
Why is nstart > 1 a “thoughtful decision” for a researcher?
Plot your K-means clusters on top of your PCA individual map.
Do the clusters follow the dimensions of the PCA? (e.g., does Cluster 1 align with the “Power” axis?)
If the Elbow plot suggests 2 clusters, but you know from sports theory that there are at least 3 distinct types of athletes (Sprinters, Throwers, and Jumpers), which K do you choose for your final report?
Transpose the scaled decathlon dataset so that the events are treated as observations
Run K-means clustering with K=3. Which events are grouped together?
Do these clusters correspond to meaningful categories such as running, jumping, and throwing events?
Use the Elbow method (WSS) to determine the optimal number of clusters for the variables.
Does the mathematical result support 2, 3, or more clusters? Does this align with your theoretical understanding of event types?
library(FactoMineR)
data(decathlon)
decathlon_cont <- decathlon[,1:10]
decathlon_cont <- data.frame(scale(decathlon_cont))When running K-means with K=3, the cluster centers (centroids) represent the average performance profile of each group of athletes. One cluster typically shows values close to the mean across all events, representing “all-rounders” who perform consistently but not exceptionally in any single discipline. The other clusters tend to show more extreme values in specific subsets of variables, corresponding to “specialists.” For example, one cluster may score higher on throwing events, reflecting strength-based athletes, while another may perform better in running and jumping events, indicating speed and explosive power.
set.seed(123)
k3 <- kmeans(decathlon_cont, centers = 3, nstart = 25)
# Cluster centers
k3$centers
#> X100m Long.jump Shot.put High.jump X400m X110m.hurdle
#> 1 0.94523250 -0.9033876 -0.09955168 -0.2828831 1.0854315 0.85839148
#> 2 -0.76759422 0.6734381 0.70893446 0.8658396 -0.5110044 -0.75100348
#> 3 -0.08525407 0.1303723 -0.50134550 -0.4913323 -0.3988826 -0.03360328
#> Discus Pole.vault Javeline X1500m
#> 1 -0.05026176 0.01520736 -0.28295068 0.5768280
#> 2 0.86848900 -0.14712442 0.24246557 0.2700889
#> 3 -0.66795099 0.10813307 0.01520973 -0.6520682
# Add clusters to dataset
decathlon_cont$cluster <- k3$clusterThe Elbow plot shows how the Total Within-Cluster Sum of Squares (WSS) decreases as the number of clusters increases. Initially, adding more clusters leads to a large reduction in WSS, but after a certain point, the improvement becomes marginal. This “elbow” appears 3 clusters. From a purely mathematical perspective, this suggests that a relatively small number of clusters captures most of the structure in the data. However, the exact location of the elbow is not always clear-cut and requires interpretation.
K-means clustering depends on initial starting points, which are chosen randomly. Running the algorithm with a single start (nstart = 1) can lead to suboptimal solutions because the algorithm may converge to a local minimum. When using a larger number of starts (e.g., nstart = 25), the algorithm runs multiple times with different initializations and selects the best solution. This increases stability and reliability. If the cluster assignments differ between runs, this highlights the importance of thoughtful parameter choices and reproducibility in statistical analysis. In our case, the number of starts does not have a big influence on the results, as we produce the same clusters, independent of the number of starts.
set.seed(123)
k_low <- kmeans(decathlon_cont, centers = 3, nstart = 1)
set.seed(123)
k_high <- kmeans(decathlon_cont, centers = 3, nstart = 25)
# Compare clusters
table(k_low$cluster, k_high$cluster)
#>
#> 1 2 3
#> 1 12 0 0
#> 2 0 16 0
#> 3 0 0 13When plotting the K-means clusters on the PCA map, the clusters roughly align with the main principal components. The first PC mainly captures the difference between groups 1 and 2 in our data, while the second PC mainly captures the difference between groups 1&2 versus 3.
This demonstrates that both PCA and K-means are capturing similar underlying structure in the data, albeit in different ways: PCA provides continuous dimensions, while K-means creates discrete groupings.
pca <- PCA(decathlon_cont, graph = FALSE)
fviz_pca_ind(pca, habillage = k3$cluster, addEllipses = TRUE, title = "K-means clusters on PCA map")If the Elbow method suggests two clusters, but domain knowledge indicates that there are at least three meaningful types of athletes (e.g., sprinters, throwers, and jumpers), this creates an important decision point. In such cases, it is often more appropriate to choose the number of clusters based on theoretical understanding rather than relying solely on the mathematical criterion. The goal of clustering is not just to optimize a numerical metric, but to produce meaningful and interpretable groupings.
data(decathlon)
decathlon_cont <- decathlon[,1:10]
decathlon_cont <- data.frame(scale(decathlon_cont))
decathlon_t <- t(decathlon_cont)After transposing the dataset, the events are treated as observations and grouped based on how similarly athletes perform across them.
set.seed(123)
k3_t <- kmeans(decathlon_t, centers = 3, nstart = 25)
# Cluster assignments (events)
k3_t$cluster
#> X100m Long.jump Shot.put High.jump X400m X110m.hurdle
#> 2 1 1 1 2 2
#> Discus Pole.vault Javeline X1500m
#> 1 3 1 3
# Cluster centers
k3_t$centers
#> SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY
#> 1 0.6652300 0.2277266 0.1804349 -0.2219033 0.6692579 -0.2807298 -0.1845074
#> 2 0.1685824 -0.7656004 -0.6968248 0.1008859 1.1631088 -0.3942658 -0.4286506
#> 3 1.0061527 1.2460597 1.1903769 1.0488588 -0.1887606 0.2437679 -1.0881263
#> McMULLEN MARTINEAU HERNU BARRAS NOOL BOURGUIGNON
#> 1 0.1263838 -0.3764098 -0.09666489 -0.59016244 -0.8380996 -1.177974
#> 2 -0.2876797 1.1939055 1.22100746 0.29236671 0.7837398 1.656657
#> 3 -0.3556823 -0.4415598 0.36374260 0.05110412 -0.7883800 1.006153
#> Sebrle Clay Karpov Macey Warners Zsivoczky Hernu
#> 1 1.9111568 1.6267123 1.2732104 1.0696893 0.1322375 0.73746358 0.1175438
#> 2 -0.9434197 -1.1666370 -1.8914341 -0.3561139 -1.3758689 0.06904417 -0.5431101
#> 3 0.4694639 0.3748453 -0.3313437 -1.2346067 0.2056550 -0.51856609 -0.5610130
#> Nool Bernard Schwarzl Pogorelov Schoenbeck Barras
#> 1 -0.1095822 0.4382065 -0.2456987 0.184326050 -0.009742227 0.1396943
#> 2 -0.3468431 -0.8388859 -0.2327794 -0.001402567 -0.114523490 -0.0463709
#> 3 1.0312635 -0.7681555 0.3730474 0.795851254 0.418492673 -0.8033631
#> Smith Averyanov Ojaniemi Smirnov Qi Drews
#> 1 0.1610558 -0.4621816 -0.007824931 -0.3514084 -0.03957384 -1.0140140
#> 2 -0.7087019 -0.6903707 -0.260963365 -0.1672841 0.21127840 -0.9029864
#> 3 -1.2807817 -0.2753171 -0.434142818 -0.7854156 -0.74592463 0.2210326
#> Parkhomenko Terek Gomez Turi Lorenzo Karlivans Korkizoglou
#> 1 0.17021432 -0.3641051 -0.363321 -0.5333829 -0.9768011 -0.5828699 -0.2370809
#> 2 0.78500994 0.2480656 -0.325339 0.4529844 0.5963031 0.9519591 0.5213624
#> 3 0.02108719 1.4523540 -1.051282 0.5380813 -1.1549796 -0.4872134 1.5142873
#> Uldal Casarsa
#> 1 -0.6937340 -0.07277187
#> 2 1.0214309 2.04836396
#> 3 -0.3574295 0.08036591Running K-means with K=3 results in clusters that correspond to meaningful categories of athletic performance. For example, sprinting events such as the 100m and 400m cluster together, as they both rely on speed and explosive power. Throwing events like shot put, discus and javelin are clustered together, reflecting strength-based performance. This demonstrates that the clustering captures underlying physical abilities rather than arbitrary groupings.
When clustering variables instead of athletes, the cluster centers (centroids) represent the average pattern of each group of events across all athletes. In other words, each centroid describes a “typical profile” of how athletes perform on the events within that cluster. These centers can be interpreted as latent performance dimensions: for example, a cluster center might reflect overall strength-based performance (high values for throwing events) or speed-based performance (high values for sprinting events). Thus, rather than describing individuals, the centroids summarize relationships between variables and help identify underlying constructs in the data.
Using the Elbow method, we examine how the total within-cluster sum of squares (WSS) decreases as the number of clusters increases. In this case the plot does not show a clear ‘elbow’, making it hard to determine the number of clusters to choose. That’s why in this case it is more important to base the decision on ‘theoretical knowledge’ rather than on mathematical principles.