This study examines how unsupervised learning can be used to group ATP players without relying solely on ranking. Using match data from the 2023 season, players are clustered based on overall and surface specific win rates, match volume, and average opponent ranking. A seven-cluster solution is selected using Elbow and silhouette analysis. The results are further checked with hierarchical clustering to ensure consistency. The findings show that ranking alone does not fully describe player performance, and clustering helps reveal meaningful differences in how players compete.
Although professional tennis performance is frequently summarized by ranking, this metric does not fully capture the underlying factors contributing to performance. Players with comparable rankings may vary in surface specialization, match volume, or consistency throughout the season. This study employs unsupervised learning techniques to identify performance patterns among ATP players during the 2023 season using season-level indicators and compares K-means and hierarchical clustering methods to assess the consistency of player segmentation.
The ATP 2023 match dataset is first loaded into the environment. It includes player rankings, match outcomes, dates, and surface types. A fixed random seed ensures that clustering results remain consistent when the analysis is repeated.
# 0) Load Data
df <- read.csv("atp_tennis.csv", stringsAsFactors = FALSE)
#Cleaning and Filter Year 2023
date_col <- intersect(names(df), c("Date", "date", "tourney_date", "match_date"))[1]
if (is.na(date_col)) stop("No date column found. Check column names in df.")
df <- df %>%
mutate(
match_date = as.Date(.data[[date_col]])
) %>%
filter(!is.na(match_date)) %>%
mutate(year = year(match_date)) %>%
filter(year == 2023) %>%
#Surface Cleaning
mutate(
Surface = if ("Surface" %in% names(.)) Surface else NA_character_,
Surface = str_to_title(str_trim(Surface)),
Surface = ifelse(Surface %in% c("Hard", "Clay", "Grass"), Surface, NA_character_)
) %>%
filter(!is.na(Surface))
required_cols <- c("Player_1", "Player_2", "Winner", "Rank_1", "Rank_2", "Surface")
missing_cols <- setdiff(required_cols, names(df))
if (length(missing_cols) > 0) {
stop(paste("Missing required columns:", paste(missing_cols, collapse = ", ")))
}
#Long Format: 2 Rows per Match
base <- df %>%
transmute(
match_date,
Surface,
p1 = Player_1,
p2 = Player_2,
r1 = suppressWarnings(as.numeric(Rank_1)),
r2 = suppressWarnings(as.numeric(Rank_2)),
Winner
) %>%
filter(!is.na(r1), !is.na(r2), r1 > 0, r2 > 0) %>%
#Existence of Players
filter(!is.na(p1), !is.na(p2), p1 != "", p2 != "")
long <- dplyr::bind_rows(
base %>%
transmute(
match_date, Surface,
player = p1, opponent = p2,
player_rank = r1, opp_rank = r2,
win = as.integer(Winner == p1)
),
base %>%
transmute(
match_date, Surface,
player = p2, opponent = p1,
player_rank = r2, opp_rank = r1,
win = as.integer(Winner == p2)
)
)
#Minimum Matches Threshold to Avoid Tiny-Sample.
min_matches <- 10
season <- long %>%
group_by(player) %>%
summarise(
matches = n(),
wins = sum(win, na.rm = TRUE),
win_pct = wins / matches,
avg_rank = mean(player_rank, na.rm = TRUE),
avg_opp_rank = mean(opp_rank, na.rm = TRUE),
hard_matches = sum(Surface == "Hard"),
clay_matches = sum(Surface == "Clay"),
grass_matches = sum(Surface == "Grass"),
hard_win_pct = ifelse(hard_matches > 0, sum(win[Surface == "Hard"]) / hard_matches, NA_real_),
clay_win_pct = ifelse(clay_matches > 0, sum(win[Surface == "Clay"]) / clay_matches, NA_real_),
grass_win_pct = ifelse(grass_matches > 0, sum(win[Surface == "Grass"]) / grass_matches, NA_real_)
) %>%
ungroup() %>%
#Applying Filters
filter(matches >= min_matches) %>%
filter(avg_rank <= 100)
The clustering features include overall win percentage, surface-specific win percentages (hard, clay, grass), match volume, and average opponent ranking. Since distance-based clustering is sensitive to scale, features are standardized. If a player has no matches on a given surface (rare but possible under filtering), the surface win rate is imputed from the player’s overall win percentage to avoid dropping cases and maintain a neutral baseline.
#Features For Clustering
season_features <- season %>%
mutate(
hard_win_pct = ifelse(is.na(hard_win_pct), win_pct, hard_win_pct),
clay_win_pct = ifelse(is.na(clay_win_pct), win_pct, clay_win_pct),
grass_win_pct = ifelse(is.na(grass_win_pct), win_pct, grass_win_pct)
) %>%
select(player, win_pct, hard_win_pct, clay_win_pct, grass_win_pct, avg_opp_rank, matches)
X <- season_features %>%
select(-player) %>%
scale() %>%
as.matrix()
- The number of clusters is selected using Elbow and Silhouette
methods.
- Ward’s hierarchical clustering is applied for comparison.
- The similarity between results is measured using the Adjusted Rand
Index (ARI).
#Choose k (Elbow + Silhouette)
set.seed(42)
k_grid <- 2:8
#Elbow
wss <- sapply(k_grid, function(k) {
kmeans(X, centers = k, nstart = 50, iter.max = 100)$tot.withinss
})
elbow_df <- data.frame(k = k_grid, tot_withinss = wss)
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() + geom_point() +
labs(title = "Elbow Method (K-means)", x = "k", y = "Total Within-Cluster Sum of Squares")
Based on the graph, the WSS value decreases noticeably from k = 2 to k = 6, after which the rate of decrease slows down and the curve begins to flatten. That means elbow point occurs around k = 6–7. Beyond this point, adding more clusters provides only limited additional improvement to the model, indicating that the optimal number of clusters likely lies within this range.
#Silhouette
sil_avg <- sapply(k_grid, function(k) {
km <- kmeans(X, centers = k, nstart = 50, iter.max = 100)
mean(silhouette(km$cluster, dist(X))[, 3])
})
sil_df <- data.frame(k = k_grid, silhouette = sil_avg)
ggplot(sil_df, aes(x = k, y = silhouette)) +
geom_line() + geom_point() +
labs(title = "Average Silhouette (K-means)", x = "k", y = "Avg Silhouette")
Based on the silhouette plot, the average silhouette score reaches its highest value at k = 7. That means the cluster separation and internal cohesion are strongest at this point compared to other tested values of k. Since higher silhouette values reflect better-defined cluster structures, the seven-cluster solution appears to provide the most balanced and meaningful segmentation of the data.
#Pick k
k_best <- sil_df$k[which.max(sil_df$silhouette)]
k_best
## [1] 7
#K-means with Chosen k
km <- kmeans(X, centers = k_best, nstart = 100, iter.max = 200)
season_km <- season_features %>%
mutate(cluster_kmeans = factor(km$cluster))
cluster_labels <- c(
"1"="Mid-Tier Balanced",
"2"="Low-Performance Group",
"3"="Grass-Limited / Mixed",
"4"="Elite all-surface Dominant",
"5"="Clay-Weak / Participation-Driven",
"6"="High-Volume, Mixed Opposition",
"7"="Fast-Surface Leaning"
)
season_km <- season_km %>%
mutate(cluster_label = recode(as.character(cluster_kmeans), !!!cluster_labels))
#Hierarchical Clustering (Ward)
hc <- hclust(dist(X), method = "ward.D2")
plot(hc,
labels = FALSE,
main = "Hierarchical Clustering Dendrogram",
xlab = "",
sub = "")
hc_cut <- cutree(hc, k = k_best)
season_km <- season_km %>%
mutate(cluster_hclust = factor(hc_cut))
Based on the hierarchical clustering dendrogram, the data exhibit a clear multi branch structure, with several groups merging at relatively higher linkage distances. When the tree is cut at the level corresponding to seven clusters, the partition appears consistent with the K-means solution. The separation between major branches suggests that the identified clusters are not arbitrary, but reflect meaningful structural differences in season-level performance patterns.
#ARI
ari <- mclust::adjustedRandIndex(as.integer(season_km$cluster_kmeans),
as.integer(season_km$cluster_hclust))
surface_counts_summary <- season_km %>%
left_join(
season %>% select(player, hard_matches, clay_matches, grass_matches),
by = "player"
) %>%
group_by(cluster_kmeans) %>%
summarise(
n_players = n(),
avg_hard_matches = mean(hard_matches),
avg_clay_matches = mean(clay_matches),
avg_grass_matches = mean(grass_matches)
) %>%
arrange(cluster_kmeans)
surface_counts_summary
## # A tibble: 7 × 5
## cluster_kmeans n_players avg_hard_matches avg_clay_matches avg_grass_matches
## <fct> <int> <dbl> <dbl> <dbl>
## 1 1 22 21.0 13.3 5.27
## 2 2 18 10.6 14.2 3.06
## 3 3 17 19 11.6 1.94
## 4 4 11 42 17.9 8
## 5 5 9 18.1 5 5.22
## 6 6 9 27.1 20.7 6.44
## 7 7 8 38.4 6.5 9.25
The results indicate clear differences in surface participation patterns across clusters. For example, some clusters (such as clusters 4 and 7) show relatively high average numbers of hard-court matches, suggesting that players in these groups competed more intensively on faster surfaces during the season. In contrast, certain clusters (such as cluster 3) have notably low average grass-court participation, indicating limited involvement on that surface. Differences are also observed in clay-court activity; for instance, cluster 6 displays higher average clay match counts, while other clusters present a more balanced distribution across surfaces.
Overall, these findings suggest that the clustering structure captures not only overall performance levels but also meaningful variations in surface participation patterns, reflecting structural differences in seasonal playing profiles rather than random variation.
#PCA visualization
pca <- prcomp(X, center = TRUE, scale. = FALSE)
pca_df <- data.frame(
player = season_km$player,
PC1 = pca$x[, 1],
PC2 = pca$x[, 2],
cluster_kmeans = season_km$cluster_kmeans
) %>%
dplyr::filter(!is.na(PC1), !is.na(PC2), !is.na(cluster_kmeans))
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_kmeans)) +
geom_point(size = 2) +
labs(title = "PCA Projection of Season Features",
x = "PC1", y = "PC2") +
theme_minimal()
Looking at the PCA plot, the clusters appear to be separated to a noticeable extent along the two principal components. In particular, the groups located on the right-hand side (positive PC1 values) are clearly separated from those positioned on the left, suggesting that PC1 captures an important dimension of performance differences among players.
Cluster 4 (Elite all-surface Dominant) appears more distinct in the upper region of the plot, reflecting stronger overall performance characteristics. Cluster 2 (Low-Performance Group) is mainly concentrated on the right side with moderate dispersion, while Cluster 5 (Clay-Weak / Participation-Driven) and Cluster 3 (Grass-Limited / Mixed) occupy more central regions with partial overlap.
On the left-hand side, Cluster 6 (High-Volume, Mixed Opposition) and Cluster 1 (Mid-Tier Balanced) form relatively compact groupings, indicating internal similarity within these segments. Cluster 7 (Fast-Surface Leaning) is positioned closer to the center but still shows separation along PC2.
Although some overlap exists—particularly among mid-level clusters—the overall structure indicates that the segmentation captures meaningful multidimensional differences rather than reflecting a single linear ranking pattern.
#Cluster Summaries
cluster_summary <- season_km %>%
group_by(cluster_kmeans) %>%
summarise(
n_players = n(),
avg_win_pct = mean(win_pct),
avg_hard_win = mean(hard_win_pct),
avg_clay_win = mean(clay_win_pct),
avg_grass_win = mean(grass_win_pct),
avg_opp_rank = mean(avg_opp_rank),
avg_matches = mean(matches)
) %>%
arrange(cluster_kmeans)
print(cluster_summary)
## # A tibble: 7 × 8
## cluster_kmeans n_players avg_win_pct avg_hard_win avg_clay_win avg_grass_win
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 22 0.515 0.465 0.544 0.562
## 2 2 18 0.376 0.206 0.480 0.201
## 3 3 17 0.507 0.521 0.521 0.0412
## 4 4 11 0.729 0.708 0.735 0.740
## 5 5 9 0.348 0.403 0.0457 0.372
## 6 6 9 0.588 0.527 0.615 0.568
## 7 7 8 0.589 0.637 0.250 0.594
## # ℹ 2 more variables: avg_opp_rank <dbl>, avg_matches <dbl>
#Top Players Inside Each Cluster
top_in_cluster <- season_km %>%
arrange(cluster_kmeans, desc(win_pct)) %>%
group_by(cluster_kmeans) %>%
slice_head(n = 8) %>%
select(cluster_kmeans, player, win_pct, hard_win_pct, clay_win_pct, grass_win_pct, avg_opp_rank, matches)
print(top_in_cluster)
## # A tibble: 56 × 8
## # Groups: cluster_kmeans [7]
## cluster_kmeans player win_pct hard_win_pct clay_win_pct grass_win_pct
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Garin C. 0.559 0.5 0.611 0.559
## 2 1 Davidovich Fo… 0.556 0.594 0.529 0.4
## 3 1 Djere L. 0.556 0.565 0.542 0.571
## 4 1 Karatsev A. 0.55 0.542 0.636 0.4
## 5 1 Lehecka J. 0.549 0.531 0.5 0.714
## 6 1 Wawrinka S. 0.548 0.542 0.533 0.667
## 7 1 Kwon S.W. 0.545 0.545 0.545 0.545
## 8 1 Struff J.L. 0.541 0.375 0.643 0.714
## 9 2 Varillas J. P. 0.5 0 0.556 0.5
## 10 2 Munar J. 0.438 0.125 0.545 0.5
## # ℹ 46 more rows
## # ℹ 2 more variables: avg_opp_rank <dbl>, matches <int>
cluster_summary
## # A tibble: 7 × 8
## cluster_kmeans n_players avg_win_pct avg_hard_win avg_clay_win avg_grass_win
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 22 0.515 0.465 0.544 0.562
## 2 2 18 0.376 0.206 0.480 0.201
## 3 3 17 0.507 0.521 0.521 0.0412
## 4 4 11 0.729 0.708 0.735 0.740
## 5 5 9 0.348 0.403 0.0457 0.372
## 6 6 9 0.588 0.527 0.615 0.568
## 7 7 8 0.589 0.637 0.250 0.594
## # ℹ 2 more variables: avg_opp_rank <dbl>, avg_matches <dbl>
The cluster summaries highlight clear performance differences across the seven groups. Cluster 4 stands out with the highest overall win percentage (0.729) and consistently strong surface-specific win rates, indicating an elite and well-balanced performance profile.
Cluster 2 shows the lowest overall win percentage (0.376) and relatively weak hard and grass court performance, suggesting a lower-performance segment.
Cluster 3 presents moderate overall success but extremely low grass-court win rates, confirming its interpretation as a grass-limited group.
Cluster 6 and Cluster 7 display relatively high win percentages with more balanced surface performance, although they differ in match volume and opposition strength.
Cluster 1 represents a mid-tier balanced group with stable performance across all surfaces, while Cluster 5 appears more participation-driven, with moderate overall success but noticeable surface imbalances.
Overall, the summaries demonstrate that the clustering structure meaningfully differentiates players not only by overall success but also by surface specialization and performance consistency.
To sum up, this study grouped ATP players from the 2023 season based on season-level performance measures derived from match data. The seven-cluster structure revealed clearly differentiated player profiles, including an elite group performing strongly across all surfaces, clusters shaped by surface specialization, and segments with comparatively lower overall success. The strong consistency between the K-means and hierarchical results indicates that the identified grouping is stable and methodologically reliable.
Overall, the findings show that tennis performance cannot be fully explained by ranking alone. Instead, it reflects multiple dimensions such as surface performance, match participation, and competitive consistency. The clustering approach therefore helps to highlight meaningful structural differences among players that are not immediately visible through traditional ranking metrics.