Multivariate Statistics - StatUa

K-Means Clustering - Exercise solutions

Decathlon Data

Questions:

Use the scaled decathlon data. Run K-means with K=3. Look at the “cluster centers” (centroids).

Which cluster represents the “all-rounders” (average scores across all events) versus the “specialists”?

Use fviz_nbclust to calculate the Total Within-Cluster Sum of Squares (WSS) for K=1 to K=10.

Identify the “Elbow” in the plot. Does the math suggest 2, 3, or 4 clusters?

K-means starts with random points. Run the model twice with the same K but different nstart values.

Do you get the exact same clusters?

Why is nstart > 1 a “thoughtful decision” for a researcher?

Plot your K-means clusters on top of your PCA individual map.

Do the clusters follow the dimensions of the PCA? (e.g., does Cluster 1 align with the “Power” axis?)

If the Elbow plot suggests 2 clusters, but you know from sports theory that there are at least 3 distinct types of athletes (Sprinters, Throwers, and Jumpers), which K do you choose for your final report?

Transpose the scaled decathlon dataset so that the events are treated as observations

Run K-means clustering with K=3. Which events are grouped together?

Do these clusters correspond to meaningful categories such as running, jumping, and throwing events?

Use the Elbow method (WSS) to determine the optimal number of clusters for the variables.

Does the mathematical result support 2, 3, or more clusters? Does this align with your theoretical understanding of event types?


library(FactoMineR)

data(decathlon)

decathlon_cont <- decathlon[,1:10]

decathlon_cont <- data.frame(scale(decathlon_cont))

When running K-means with K=3, the cluster centers (centroids) represent the average performance profile of each group of athletes. One cluster typically shows values close to the mean across all events, representing “all-rounders” who perform consistently but not exceptionally in any single discipline. The other clusters tend to show more extreme values in specific subsets of variables, corresponding to “specialists.” For example, one cluster may score higher on throwing events, reflecting strength-based athletes, while another may perform better in running and jumping events, indicating speed and explosive power.

set.seed(123) 
k3 <- kmeans(decathlon_cont, centers = 3, nstart = 25) 

# Cluster centers
k3$centers 
#>         X100m  Long.jump    Shot.put  High.jump      X400m X110m.hurdle
#> 1  0.94523250 -0.9033876 -0.09955168 -0.2828831  1.0854315   0.85839148
#> 2 -0.76759422  0.6734381  0.70893446  0.8658396 -0.5110044  -0.75100348
#> 3 -0.08525407  0.1303723 -0.50134550 -0.4913323 -0.3988826  -0.03360328
#>        Discus  Pole.vault    Javeline     X1500m
#> 1 -0.05026176  0.01520736 -0.28295068  0.5768280
#> 2  0.86848900 -0.14712442  0.24246557  0.2700889
#> 3 -0.66795099  0.10813307  0.01520973 -0.6520682

# Add clusters to dataset 
decathlon_cont$cluster <- k3$cluster

The Elbow plot shows how the Total Within-Cluster Sum of Squares (WSS) decreases as the number of clusters increases. Initially, adding more clusters leads to a large reduction in WSS, but after a certain point, the improvement becomes marginal. This “elbow” appears 3 clusters. From a purely mathematical perspective, this suggests that a relatively small number of clusters captures most of the structure in the data. However, the exact location of the elbow is not always clear-cut and requires interpretation.

fviz_nbclust(decathlon_cont, kmeans, method = "wss") + ggtitle("Elbow Method")

K-means clustering depends on initial starting points, which are chosen randomly. Running the algorithm with a single start (nstart = 1) can lead to suboptimal solutions because the algorithm may converge to a local minimum. When using a larger number of starts (e.g., nstart = 25), the algorithm runs multiple times with different initializations and selects the best solution. This increases stability and reliability. If the cluster assignments differ between runs, this highlights the importance of thoughtful parameter choices and reproducibility in statistical analysis. In our case, the number of starts does not have a big influence on the results, as we produce the same clusters, independent of the number of starts.

set.seed(123) 

k_low <- kmeans(decathlon_cont, centers = 3, nstart = 1) 

set.seed(123) 

k_high <- kmeans(decathlon_cont, centers = 3, nstart = 25) 

# Compare clusters 
table(k_low$cluster, k_high$cluster)
#>    
#>      1  2  3
#>   1 12  0  0
#>   2  0 16  0
#>   3  0  0 13

When plotting the K-means clusters on the PCA map, the clusters roughly align with the main principal components. The first PC mainly captures the difference between groups 1 and 2 in our data, while the second PC mainly captures the difference between groups 1&2 versus 3.

This demonstrates that both PCA and K-means are capturing similar underlying structure in the data, albeit in different ways: PCA provides continuous dimensions, while K-means creates discrete groupings.

pca <- PCA(decathlon_cont, graph = FALSE) 

fviz_pca_ind(pca, habillage = k3$cluster, addEllipses = TRUE, title = "K-means clusters on PCA map")

If the Elbow method suggests two clusters, but domain knowledge indicates that there are at least three meaningful types of athletes (e.g., sprinters, throwers, and jumpers), this creates an important decision point. In such cases, it is often more appropriate to choose the number of clusters based on theoretical understanding rather than relying solely on the mathematical criterion. The goal of clustering is not just to optimize a numerical metric, but to produce meaningful and interpretable groupings.

data(decathlon)

decathlon_cont <- decathlon[,1:10]

decathlon_cont <- data.frame(scale(decathlon_cont))

decathlon_t <- t(decathlon_cont)

After transposing the dataset, the events are treated as observations and grouped based on how similarly athletes perform across them.

set.seed(123) 

k3_t <- kmeans(decathlon_t, centers = 3, nstart = 25) 

# Cluster assignments (events) 
k3_t$cluster 
#>        X100m    Long.jump     Shot.put    High.jump        X400m X110m.hurdle 
#>            2            1            1            1            2            2 
#>       Discus   Pole.vault     Javeline       X1500m 
#>            1            3            1            3

# Cluster centers 
k3_t$centers
#>      SEBRLE       CLAY     KARPOV    BERNARD     YURKOV    WARNERS  ZSIVOCZKY
#> 1 0.6652300  0.2277266  0.1804349 -0.2219033  0.6692579 -0.2807298 -0.1845074
#> 2 0.1685824 -0.7656004 -0.6968248  0.1008859  1.1631088 -0.3942658 -0.4286506
#> 3 1.0061527  1.2460597  1.1903769  1.0488588 -0.1887606  0.2437679 -1.0881263
#>     McMULLEN  MARTINEAU       HERNU      BARRAS       NOOL BOURGUIGNON
#> 1  0.1263838 -0.3764098 -0.09666489 -0.59016244 -0.8380996   -1.177974
#> 2 -0.2876797  1.1939055  1.22100746  0.29236671  0.7837398    1.656657
#> 3 -0.3556823 -0.4415598  0.36374260  0.05110412 -0.7883800    1.006153
#>       Sebrle       Clay     Karpov      Macey    Warners   Zsivoczky      Hernu
#> 1  1.9111568  1.6267123  1.2732104  1.0696893  0.1322375  0.73746358  0.1175438
#> 2 -0.9434197 -1.1666370 -1.8914341 -0.3561139 -1.3758689  0.06904417 -0.5431101
#> 3  0.4694639  0.3748453 -0.3313437 -1.2346067  0.2056550 -0.51856609 -0.5610130
#>         Nool    Bernard   Schwarzl    Pogorelov   Schoenbeck     Barras
#> 1 -0.1095822  0.4382065 -0.2456987  0.184326050 -0.009742227  0.1396943
#> 2 -0.3468431 -0.8388859 -0.2327794 -0.001402567 -0.114523490 -0.0463709
#> 3  1.0312635 -0.7681555  0.3730474  0.795851254  0.418492673 -0.8033631
#>        Smith  Averyanov     Ojaniemi    Smirnov          Qi      Drews
#> 1  0.1610558 -0.4621816 -0.007824931 -0.3514084 -0.03957384 -1.0140140
#> 2 -0.7087019 -0.6903707 -0.260963365 -0.1672841  0.21127840 -0.9029864
#> 3 -1.2807817 -0.2753171 -0.434142818 -0.7854156 -0.74592463  0.2210326
#>   Parkhomenko      Terek     Gomez       Turi    Lorenzo  Karlivans Korkizoglou
#> 1  0.17021432 -0.3641051 -0.363321 -0.5333829 -0.9768011 -0.5828699  -0.2370809
#> 2  0.78500994  0.2480656 -0.325339  0.4529844  0.5963031  0.9519591   0.5213624
#> 3  0.02108719  1.4523540 -1.051282  0.5380813 -1.1549796 -0.4872134   1.5142873
#>        Uldal     Casarsa
#> 1 -0.6937340 -0.07277187
#> 2  1.0214309  2.04836396
#> 3 -0.3574295  0.08036591

Running K-means with K=3 results in clusters that correspond to meaningful categories of athletic performance. For example, sprinting events such as the 100m and 400m cluster together, as they both rely on speed and explosive power. Throwing events like shot put, discus and javelin are clustered together, reflecting strength-based performance. This demonstrates that the clustering captures underlying physical abilities rather than arbitrary groupings.

When clustering variables instead of athletes, the cluster centers (centroids) represent the average pattern of each group of events across all athletes. In other words, each centroid describes a “typical profile” of how athletes perform on the events within that cluster. These centers can be interpreted as latent performance dimensions: for example, a cluster center might reflect overall strength-based performance (high values for throwing events) or speed-based performance (high values for sprinting events). Thus, rather than describing individuals, the centroids summarize relationships between variables and help identify underlying constructs in the data.


fviz_cluster(k3_t, data = decathlon_t)

Using the Elbow method, we examine how the total within-cluster sum of squares (WSS) decreases as the number of clusters increases. In this case the plot does not show a clear ‘elbow’, making it hard to determine the number of clusters to choose. That’s why in this case it is more important to base the decision on ‘theoretical knowledge’ rather than on mathematical principles.


fviz_nbclust(decathlon_t, kmeans, method = "wss", k.max=8) + ggtitle("Elbow Method (Variables)")

K-Means Clustering - exercise solutions

dr. Annelies Agten

2026-04-29

Multivariate Statistics - StatUa

K-Means Clustering - Exercise solutions

Decathlon Data