The best of the best - analyzing athletes performance using clustering

Introduction

Clustering is an unsupervised learning method, which groups data into clusters based on their similarity. The main difference between unsupervised and supervised learning algorithms is that USL deals with unlabeled data. Clusters that emerge during the segmentation process can tell us a lot about data we’re working on.

It’s possible to differentiate many clustering techniques, but among the most popular ones include k-means, PAM and hierarchical clustering. In this paper k-means and PAM were used to analyze data on Alpine Skiing.

Scraping the data

The data was scraped from a webpage, which compiles data on Alpine Skiing. The idea was to scrape overall winners of the World Cup among women in 4 different disciplines: Giant Slalom (GS), Slalom (SL), Downhill (DH) and Super-G (SG). Overall winner title is presented to the athlete who collected most points in each season within each discipline. The Ski-DB database provides a lot of statistics, including number of wins, top3 and top10 finishes and points gathered during the season. It dates back to 1966 when the World Cup was first introduced, with three disciplines: GS, SL and DH. Later on, in 1985 Super-G race was added.

The scraping was done with help of online resources. In order to scrape with RSelenium, Java, a standalone version of Selenium and ChromeDriver need to be installed first and placed in the R project folder.

Next step is opening command prompt (Windows key + R on Windows) and typing in “cmd”. In the command prompt first navigate to your folder with “cd path/to/your/folder”. The prompt should update and now reflect correct directory. Run the Selenium serever command “java -jar selenium-server-standalone-x.x.x.jar -port 4444”, replacing Xs with the standalone version of Selenium you downloaded. You should now be able to scrape using RSelenium!

# Loading needed libraries

library(RSelenium)
library(rvest)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(gridExtra)

## 
## Dołączanie pakietu: 'gridExtra'
## 
## Następujący obiekt został zakryty z 'package:dplyr':
## 
##     combine

library(clustertend)

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

library(ggfortify)
library(ggplot2)

Since the Ski-DB website is a dynamic one, we need to scrape the data using RemoteDriver. This tool allows us to emulate user behaviour on the website.

remote_driver <- remoteDriver(remoteServerAddr = 'localhost',
                           browserName = 'chrome',
                           port=4444L)

Our goal is to scrape four different tables, from four URL links. Each table however is built the same way, allowing for creation of a scraping function, through which we can later pass the URLs. This saves time, as well as makes the code cleaner. The function first opens RemoteDriver and navigates to the URL. It locates the table on a webpage, using its’ specific id value. Later the full HTML content of the table is retrieved and stored in ‘table_html’ as a string. Next step is parsing the HTML into a list of data frames, which can be later extracted. Since there’s just one table, we’re extracting only the first element from the list of tables and storing it in ‘data’.

At this point we can close RemoteDriver and proceed to modyfing the data. First row can be deleted, as it’s empty in all tables. Variable ‘AGE’ contains years and months at the end of the season, which we can convert to just years for clarity.

scrape_clean_func <- function(url){
  
  remote_driver$open()
  remote_driver$navigate(url)
  
  table_element <- remote_driver$findElement(using = "id", value = "fd-table-5")
  table_html <- table_element$getElementAttribute("outerHTML")[[1]]
  
  table_parsed <- read_html(table_html) %>% html_table(fill = TRUE)
  data <- table_parsed[[1]]
  
  remote_driver$close()
  
  data <- data[-(1),] %>% 
    mutate(AGE = substr(AGE, 1, 2))
  
  return(data)
}

Now we can pass the URLs through the function and save them into csv files, completing the scraping process.

urls <- list(
  GS = 'https://ski-db.com/db/stats/overall_f_gs.php',
  SL = 'https://ski-db.com/db/stats/overall_f_sl.php',
  DH = 'https://ski-db.com/db/stats/overall_f_dh.php',
  SG = 'https://ski-db.com/db/stats/overall_f_sg.php'
)

results <- lapply(urls, scrape_clean_func)
GS <- results$GS
SL <- results$SL
DH <- results$DH
SG <- results$SG

write_csv(GS, 'GS.csv')
write_csv(SL, 'SL.csv')
write_csv(DH, 'DH.csv')
write_csv(SG, 'SG.csv')

Clustering the data

Next step is to load the data into R and combine them into one data frame, adding a ‘discipline’ column, in which the information about the discipline will be stored. As in some seasons athletes won overall champion title without winning a single race, we replace NA values in the ‘WIN’ column with zeros.

GS <- read.csv('GS.csv', sep = ',')
SL <- read.csv('SL.csv', sep = ',')
DH <- read.csv('DH.csv', sep = ',')
SG <- read.csv('SG.csv', sep = ',')

bestAll <- bind_rows(GS %>% mutate(discipline = 'GS'),
                     SL %>% mutate(discipline = 'SL'),
                     DH %>% mutate(discipline = 'DH'),
                     SG %>% mutate(discipline = 'SG'))

bestAll$WIN[is.na(bestAll$WIN)] <- 0 # replacing NA with 0

As the goal is to cluster athletes based on their performance, we need to select relevant columns. It makes sense to pick number of wins, top3 and top10 finishes. It would be also good to pick the number of points collected, but instead of using ‘PTS’, it’s best to choose ‘WPTS’. The ‘WPTS’ column uses the original points system, which was later changed by FIS, allowing for fair comparison between different seasons.

best_clustering <- bestAll %>% select(WIN, TOP3, TOP10, WPTS)

As we’re working on multiple variables, that are considered across different scales, we need to use scale() function. This ensures that each variable contributes fairly to the clustering process, without one overshadowing other. In our case, it would be possible that ‘WPTS’ would dominate the process, as it has values above 100. The scale() function doesn’t change the data itself, rather it changes the scale[^1]. An easy way to visualize the way it works, is to imagine stretching or compressing an axis.

best_normalized <- best_clustering %>% 
  scale()

Optimal number of clusters

Before proceeding to clustering, it’s best to check the optimal number of clusters using the silhouette method. Among many available, this one is the most popular. Optimal number of clusters would be the one with maximum silhouette. The code below compares four different clustering methods: K-means, PAM, CLARA and hierarchical clustering.

km_opt <- fviz_nbclust(best_normalized, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
pam_opt <- fviz_nbclust(best_normalized, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 
cla_opt <- fviz_nbclust(best_normalized, FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
hc_opt <- fviz_nbclust(best_normalized, FUNcluster = hcut, method = "silhouette") + theme_classic() 

grid.arrange(km_opt, pam_opt, cla_opt, hc_opt, ncol=2)

Our analysis shows us that the most optimal number of clusters in all four methods is two. However since the silhouette for three clusters doesn’t differ much from two clusters, I think it’s better to use three clusters. Dividing data into three groups, instead of two, will give us more information.

K-means, PAM and CLARA are all flat clustering methods. In flat clustering the machine needs to be told how many clusters should data be grouped into. In hierarchical clustering however the machine decides on its own how many clusters to create[^2].

In this project only two flat clustering methods will be used: K-means and PAM. As CLARA is a PAM algorithm designed for large datasets and our dataset falls on the smaller side, there’s no point in using CLARA.

How well can the data be clustered? Hopkins Statistic

In order to determine if the data set we are working on is clusterable, we can use the Hopkins statistic. It will assess the clustering tendency of our data set. The null hypothesis in Hopkins Statistic is that data is uniformly distributed, which means no meaningful clusters can be found. Rejecting null hypothesis tells us that there are meaningful clusers within the data set.

get_clust_tendency(best_normalized, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))

## $hopkins_stat
## [1] 0.9660533
## 
## $plot

get_clust_tendency(best_normalized, 3, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))

## $hopkins_stat
## [1] 0.8614701
## 
## $plot

Given the results we can say, that the data set is highly clusterable, with H=0.96 for two clusters and H=0.86 for three clusters, both far above the 0.5 threshold.

Choosing a method

K-means

In K-means new center point is created for each cluster using means. It’s perhaps the most popular clustering method. This algorithm uses distance metrics for clustering. The default distance is Euclidean, but it’s possible to also use Manhattan. Both are applied below.

K-means partitions the data space into a predetermined number of clusters. Data is first assigned to the nearest randomly selected cluster center. Then the centers are recalculated and data points are reassigned. This process is repeated until centroids stop moving, producing final clusters. K-means is a method sensitive to outliers, which is why it’s particularly important to identify and remove them beforehand.

# Kmeans - euclidean distance

km1<-eclust(best_normalized, "kmeans", hc_metric="euclidean",k=3)

fviz_silhouette(km1)

##   cluster size ave.sil.width
## 1       1   97          0.30
## 2       2   62          0.30
## 3       3   60          0.38

# kmeans - manhattan distance

km2<-eclust(best_normalized, "kmeans", hc_metric="manhattan",k=3)

fviz_silhouette(km2)

##   cluster size ave.sil.width
## 1       1   97          0.30
## 2       2   62          0.30
## 3       3   60          0.38

From these results we can see, that there’s no difference between the two distances in case of this dataset. For both Euclidean and Manhattan distance the Silhouette statistic is equal to 0.32. Judging the clusters themselves we can also notice that they are the same, with cluster 1 containing 97 observations, cluster 2 62 and cluster 3 60 in both cases.

PAM

In PAM an existing data point is chosen as a center (medoid), hence the name Partitioning Around Medoids. It commonly uses Manhattan distance, but others are also a possibility. Similarly to K-means, the algorithm selects random points as medoids and assignes points to the nearest one based on a chosen distance metric. It later tries to find better cluster centers, by looking for a point that would have the lowest sum of distances from points to medoids. PAM is more robust to outliers comapred to K-means, as it uses actual data points.

# PAM - manhattan

pam1<-eclust(best_normalized, "pam", hc_metric="manhattan", k=3)

fviz_silhouette(pam1)

##   cluster size ave.sil.width
## 1       1   64          0.31
## 2       2  101          0.30
## 3       3   54          0.39

# PAM - euclidean

pam2<-eclust(best_normalized, "pam", hc_metric = "euclidean", k=3)

fviz_silhouette(pam2)

##   cluster size ave.sil.width
## 1       1   64          0.31
## 2       2  101          0.30
## 3       3   54          0.39

In case of PAM, Silhouette statistic for both Euclidean and Manhattan distance is the same as it was for K-means, equal to 0.32. Cluster sizes did change however, with cluster 2 now grouping 101 observations, cluster 1 64 and cluster 3 54.

Which method to choose?

Results we obtain tell us that the distance metric made no difference in clustering quality, as Silhuette score stayed the same in all cases. K-means and PAM did however produce different sizes of clusters. Each algorithm and each distance metric has it’s own advantages and disadvantages.

K-means would be advisable if computational efficiency is key, as it’s faster than PAM. It also provides stable cluster sizes, since PAM slightly changed cluster distribution. PAM would be a great choice in a dataset prone to extreme values, as it’s resiliant to outliers. For large datasets the best choice would be Manhattan distance, as it’s generally faster. Euclidean is more sensitive to outliers and requires feature scaling, so if data is not normalized Manhattan is safer.

Given this information the best choice in this particular case would be K-means with Euclidean distance, as the dataset is not large and the features were normalized.

Merging Clusters with Original Data

bestAll$Clusters <- km1$cluster
head(bestAll)

##    Season Races            Winner NAT PTS AGE WIN TOP3 TOP10 WPTS discipline
## 1 2023/24    11  Gut Behrami Lara SUI 771  32   4    7    11  175         GS
## 2 2022/23    10  Shiffrin Mikaela USA 800  28   7    7     9  187         GS
## 3 2021/22     9      Worley Tessa FRA 567  32   2    4     9  129         GS
## 4 2020/21     8     Bassino Marta ITA 546  25   4    5     7  130         GS
## 5 2019/20     6 Brignone Federica ITA 407  29   2    3     6   92         GS
## 6 2018/19     8  Shiffrin Mikaela USA 615  24   4    6     8  149         GS
##   Clusters
## 1        2
## 2        2
## 3        1
## 4        1
## 5        3
## 6        1

Performance analysis

Initial goal of this paper was to group athletes by performance and check which belong to which category. We can summarise information about the clusters using the code below. It gives us mean of points, median of number of wins and minimum and maximum number of appearances on the podium, as well as in the top 10 for each cluster.

Cluster summary

cluster_summary <- bestAll %>%
  group_by(Clusters) %>%
  summarise(
    WPTS_mean = round(mean(WPTS),2),
    WIN_median = median(WIN),
    TOP3_min = min(TOP3),
    TOP3_max = max(TOP3),
    TOP10_min = min(TOP10),
    TOP10_max = max(TOP10)
  )

print(cluster_summary)

## # A tibble: 3 × 7
##   Clusters WPTS_mean WIN_median TOP3_min TOP3_max TOP10_min TOP10_max
##      <int>     <dbl>      <dbl>    <int>    <int>     <int>     <int>
## 1        1     125.           3        3        7         5         9
## 2        2     175.           5        6       12         6        12
## 3        3      89.9          2        2        5         3         8

Cluster 1 has an average of 125 points and median of three wins, between 3 and 7 podiums and between 5 and 9 top 10 finishes. In cluster 2 athletes have on average almost 175 points, with median number of wins equal to 6 and between 6 and 12 top 3 finishes. What’s interesting is that the minimum and maximum values for appearing in the top 10 are also equal to 6 and 12. Cluster 3 is hands down the “worst” one, with almost 90 points on average, 2 wins, between 2 and 5 podiums and between 3 and 8 top 10 finishes.

It’s clear that cluster 2 gathers “best performing” athletes, as values for all variables are significantly higher than in the other clusters. To gain more insight we can explore clusters further by visualizing data. For clarity, it’s important to set cluster colors to ensure that each cluster is always visualized with the same color.

cluster_colors <- c("1" = "red", "2" = "blue", "3" = "green")

bestAll$Clusters <- as.factor(bestAll$Clusters)

autoplot(prcomp(bestAll[, c("WPTS", "WIN", "TOP3", "TOP10")], scale. = TRUE), 
         data = bestAll, 
         colour = 'Clusters', 
         shape = TRUE) +
  scale_color_manual(values = cluster_colors) +
  theme_minimal() +
  labs(title = "PCA Projection of Clusters")

The scatter plot above shows the projection of clustered data into a two dimensional space. The first principal component explains 78.7% of the variance and PC2 adds further differentiation, explaining 16.5%. The clusters are not strongly separated, which is to be expected. The dataset in question compiles information about best athletes, so we shouldn’t expect vast differences between data points.

Boxplots and outliers

boxplotWPTS <- ggplot(bestAll, aes(x = factor(Clusters), y = WPTS, fill = factor(Clusters))) +
  geom_boxplot() +
  scale_fill_manual(values = cluster_colors) +
  theme_minimal() +
  labs(title = "WPTS Across Clusters")

boxplotWIN <- ggplot(bestAll, aes(x = factor(Clusters), y = WIN, fill = factor(Clusters))) +
  geom_boxplot() +
  scale_fill_manual(values = cluster_colors) +
  theme_minimal() +
  labs(title = "Wins Across Clusters")

boxplotTOP3 <- ggplot(bestAll, aes(x = factor(Clusters), y = TOP3, fill = factor(Clusters))) +
  geom_boxplot() +
  scale_fill_manual(values = cluster_colors) +
  theme_minimal() +
  labs(title = "Top3 Finishes Across Clusters")

boxplotTOP10 <- ggplot(bestAll, aes(x = factor(Clusters), y = TOP10, fill = factor(Clusters))) +
  geom_boxplot() +
  scale_fill_manual(values = cluster_colors) +
  theme_minimal() +
  labs(title = "Top10 Finishes Across Clusters")

grid.arrange(boxplotWPTS, boxplotWIN, boxplotTOP3, boxplotTOP10, ncol=2)

These boxplots show the distribution of key performance metrics across the three clusters: points, wins, podiums and top10 finishes. Boxplots confirm what we already knew: that cluster 2 is “the best” one, with highest statistics for all four metrics. Cluster 2 is the middle one and cluster 3 has the lowest values, confirming it’s the weakest group.

Boxplots also reveal outliers, which we can take a closer look at, using codes below.

# Outliers for WPTS in Cluster 2

Q1_C2_WPTS <- quantile(bestAll$WPTS[bestAll$Clusters == 2], 0.25, na.rm = TRUE)
Q3_C2_WPTS <- quantile(bestAll$WPTS[bestAll$Clusters == 2], 0.75, na.rm = TRUE)
IQR_C2_WPTS <- Q3_C2_WPTS - Q1_C2_WPTS

lower_C2_WPTS <- Q1_C2_WPTS - 1.5 * IQR_C2_WPTS
upper_C2_WPTS <- Q3_C2_WPTS + 1.5 * IQR_C2_WPTS

outliers_C2_WPTS <- bestAll %>%
  filter(Clusters == 2 & (WPTS < lower_C2_WPTS | WPTS > upper_C2_WPTS))


print(outliers_C2_WPTS)

##    Season Races           Winner NAT  PTS AGE WIN TOP3 TOP10 WPTS discipline
## 1 2022/23    11 Shiffrin Mikaela USA  945  28   6   10    11  233         SL
## 2 2018/19    12 Shiffrin Mikaela USA 1160  24  10   12    12  290         SL
## 3 2017/18    12 Shiffrin Mikaela USA  980  23   9   10    10  245         SL
##   Clusters
## 1        2
## 2        2
## 3        2

Outliers are extreme values, that often appear due to errors, but this is not the case. As we are working on data about best athletes, very high values might suggest highest-performing ones. An “outlier” in cluster 2 points distribution is Mikaela Shiffrin, American Alpine Ski legend. Her exceptional performance in slalom races in three different seasons (2017/18, 2018/19 and 2022/23) made her an outlier, as she gathered more points than anyone else.

Outliers in performance-based dataset might often indicate dominance rather than data errors and can be useful in identifying exceptional indiviuals.

# Outliers for WIN in Cluster 2

Q1_C2_WIN <- quantile(bestAll$WIN[bestAll$Clusters == 2], 0.25, na.rm = TRUE)
Q3_C2_WIN <- quantile(bestAll$WIN[bestAll$Clusters == 2], 0.75, na.rm = TRUE)
IQR_C2_WIN <- Q3_C2_WIN - Q1_C2_WIN

lower_C2_WIN <- Q1_C2_WIN - 1.5 * IQR_C2_WIN
upper_C2_WIN <- Q3_C2_WIN + 1.5 * IQR_C2_WIN

outliers_C2_WIN <- bestAll %>%
  filter(Clusters == 2 & (WIN < lower_C2_WIN | WIN > upper_C2_WIN))

print(outliers_C2_WIN)

##    Season Races           Winner NAT  PTS AGE WIN TOP3 TOP10 WPTS discipline
## 1 2018/19    12 Shiffrin Mikaela USA 1160  24  10   12    12  290         SL
##   Clusters
## 1        2

Mikaela Shiffrin is also “an outlier” when it comes to CLuster 2’s WIN distribution. In season 2018/19 she won 10 slalom races and was on the podium for the other two.

# Outliers for WIN in Cluster 3

Q1_C3_WIN <- quantile(bestAll$WIN[bestAll$Clusters == 3], 0.25, na.rm = TRUE)
Q3_C3_WIN <- quantile(bestAll$WIN[bestAll$Clusters == 3], 0.75, na.rm = TRUE)
IQR_C3_WIN <- Q3_C3_WIN - Q1_C3_WIN

lower_C3_WIN <- Q1_C3_WIN - 1.5 * IQR_C3_WIN
upper_C3_WIN <- Q3_C3_WIN + 1.5 * IQR_C3_WIN

outliers_C3_WIN <- bestAll %>%
  filter(Clusters == 3 & (WIN < lower_C3_WIN | WIN > upper_C3_WIN))

print(outliers_C3_WIN)

##    Season Races          Winner NAT PTS AGE WIN TOP3 TOP10 WPTS discipline
## 1 1968/69     7 Cochran Marilyn USA  60  19   0    5     6  101         GS
##   Clusters
## 1        3

The argument for outliers revealing exceptional athletes cannot be used in the case of cluster 3’s WIN distribution. Marilyn Cochran won the overall best title in giant slalom in season 1968/69 without winning a singe GS race. Cochran won only three World Cup races during her professional career, all in the 1970s. However her consistent high placements in GS races (5 podiums over 7 races) in the 1968/69 season secured her the top spot in overall ranking.

# Outliers for TOP3 in Cluster 1

Q1_C1_TOP3 <- quantile(bestAll$TOP3[bestAll$Clusters == 1], 0.25, na.rm = TRUE)
Q3_C1_TOP3 <- quantile(bestAll$TOP3[bestAll$Clusters == 1], 0.75, na.rm = TRUE)
IQR_C1_TOP3 <- Q3_C1_TOP3 - Q1_C1_TOP3

lower_C1_TOP3 <- Q1_C1_TOP3 - 1.5 * IQR_C1_TOP3
upper_C1_TOP3 <- Q3_C1_TOP3 + 1.5 * IQR_C1_TOP3

outliers_C1_TOP3 <- bestAll %>%
  filter(Clusters == 1 & (TOP3 < lower_C1_TOP3 | TOP3 > upper_C1_TOP3))

print(outliers_C1_TOP3)

##    Season Races            Winner NAT PTS AGE WIN TOP3 TOP10 WPTS discipline
## 1 2021/22     9 Brignone Federica ITA 506  31   3    3     8  110         SG
##   Clusters
## 1        1

Federica Brignone won 2021/22 Super-G title with only three podiums, making her an outlier in Cluster 1 TOP3 distribution. However all of her top3 finishes were also wins, securing her the top spot.

# Outliers for TOP3 in cluster 2

Q1_C2_TOP3 <- quantile(bestAll$TOP3[bestAll$Clusters == 2], 0.25, na.rm = TRUE)
Q3_C2_TOP3 <- quantile(bestAll$TOP3[bestAll$Clusters == 2], 0.75, na.rm = TRUE)
IQR_C2_TOP3 <- Q3_C2_TOP3 - Q1_C2_TOP3

lower_C2_TOP3 <- Q1_C2_TOP3 - 1.5 * IQR_C2_TOP3
upper_C2_TOP3 <- Q3_C2_TOP3 + 1.5 * IQR_C2_TOP3

outliers_C2_TOP3 <- bestAll %>%
  filter(Clusters == 2 & (TOP3 < lower_C2_TOP3 | TOP3 > upper_C2_TOP3))

print(outliers_C2_TOP3)

##    Season Races           Winner NAT  PTS AGE WIN TOP3 TOP10 WPTS discipline
## 1 2022/23    11 Shiffrin Mikaela USA  945  28   6   10    11  233         SL
## 2 2018/19    12 Shiffrin Mikaela USA 1160  24  10   12    12  290         SL
## 3 2017/18    12 Shiffrin Mikaela USA  980  23   9   10    10  245         SL
##   Clusters
## 1        2
## 2        2
## 3        2

Mikaela Shiffrin emerges as an outlier once again in Cluster 2 TOP3 distribution. Her status as an Alpine Skiing legend cannot be denied, as as of January 2025 Shiffrin is chasing her 100th podium. She was about to reach this milestone in Killington, USA in November 2024, where she unfortunately sustained an injury that took her out of racing for 60 days.

# Outliers for TOP10 in CLuster 3

Q1_C3_TOP10 <- quantile(bestAll$TOP10[bestAll$Clusters == 3], 0.25, na.rm = TRUE)
Q3_C3_TOP10 <- quantile(bestAll$TOP10[bestAll$Clusters == 3], 0.75, na.rm = TRUE)
IQR_C3_TOP10 <- Q3_C3_TOP10 - Q1_C3_TOP10

lower_C3_TOP10 <- Q1_C3_TOP10 - 1.5 * IQR_C3_TOP10
upper_C3_TOP10 <- Q3_C3_TOP10 + 1.5 * IQR_C3_TOP10

outliers_C3_TOP10 <- bestAll %>%
  filter(Clusters == 3 & (TOP10 < lower_C3_TOP10 | TOP10 > upper_C3_TOP10))

print(outliers_C3_TOP10)

##      Season Races             Winner NAT PTS AGE WIN TOP3 TOP10 WPTS discipline
## 1   1998/99     8       Egger Sabine AUT 425  21   1    2     8   84         SL
## 2   1967/68     6       Mir Isabelle FRA  70  19   2    3     3   70         DH
## 3 1967/68 æ     6          Pall Olga AUT  70  20   2    3     3   70         DH
## 4   1966/67     4 Goitschel Marielle FRA  56  21   2    2     3   56         DH
## 5   1988/89     4       Merle Carole FRA  75  25   3    3     3   75         SG
## 6   1987/88     4     Figini Michela SUI  65  21   2    3     3   65         SG
##   Clusters
## 1        3
## 2        3
## 3        3
## 4        3
## 5        3
## 6        3

Outliers in Cluster 3 TOP10 finishes mainly come from few races in DH and SG disciplines. Several athletes have 3 top10 finishes, which might seem low, but seasons were only 4 or 6 races long. Sabine Egger stands out with 8 top10 finishes, which might have placed her in a different cluster if it wasn’t for her other matrics, which were low, resulting in her place in cluster 3.

The best of the best

Having established that cluster 2 groups best athletes, we can now check who belongs in this group.

bestAll %>% filter(Clusters == 2) %>% 
  group_by(Winner) %>% 
  summarise(count=n()) %>% 
  arrange(desc(count))

## # A tibble: 33 × 2
##    Winner               count
##    <chr>                <int>
##  1 Shiffrin Mikaela         9
##  2 Vonn Lindsey             5
##  3 Proell Annemarie         4
##  4 Schild Marlies           4
##  5 Kostelic Janica          3
##  6 Schneider Vreni          3
##  7 Dorfmeister Michaela     2
##  8 Goetschl Renate          2
##  9 Hess Erika               2
## 10 Morerod Lise Marie       2
## # ℹ 23 more rows

Previously mentioned Mikaela Shiffrin takes the top spot with 9 appearances in Cluster 2. This proves she is one of, if not the best, athlete in the history of Alpine Skiing Women World Cup.

Conclusions

This paper explores the use of clustering, an unsupervised learning technique, to analyze performance of elite athletes. Individuals were grouped based on different metrics, allowing to separate the best ones from the rest. Clustering can be used to provide a deeper understanding of athletes performance and to help identify key factors that contribute to success.

[^1]: Information about scale() function from: https://stackoverflow.com/questions/20256028/understanding-scale-in-r [^2]: Information about flat and hierarchical clustering: https://pythonprogramming.net/flat-clustering-machine-learning-python-scikit-learn/

Clustering

Zuzanna Herniczek

2025-02-09