Ski jumping performance is influenced by multiple factors beyond jump distance alone, including in-run speed, style points, wind compensation, and gate adjustments. As a result, performance data describing athletes is inherently multidimensional and often highly correlated. This creates challenges for both interpretation and exploratory analysis.
In this article, I use historical ski jumping competition data from a publicly available Kaggle dataset (covering seasons from 2009 onward) to construct aggregated performance profiles for individual athletes. The goal of the analysis is not to rank competitors, but to identify groups of athletes with similar performance characteristics using unsupervised learning techniques.
To support clustering, I explore dimension reduction methods such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS), comparing their impact on cluster structure and quality. While PCA provides insight into variance and correlation patterns among variables, MDS is used to preserve pairwise similarities between athletes, which proves useful for clustering-based exploration.
The analysis focuses on combining dimension reduction with clustering and on interpreting the resulting groups in terms of original performance features. Emphasis is placed on methodological transparency and on understanding how analytical choices influence the final segmentation of athletes.
The dataset consists of multiple tables describing athletes, competitions, and individual jump results. These files are loaded separately and later combined into a single analytical dataset.
names <- read.csv("all_names.csv", sep=",", dec=".", stringsAsFactors=TRUE)
competitions <- read.csv("all_comps.csv", sep=",", dec=".", stringsAsFactors=TRUE)
results <- read.csv("all_results.csv", sep=",", dec=".", stringsAsFactors=TRUE)
results <- results[ ,c('speed', 'dist', 'note_points', 'id', 'wind_comp', 'gate_points', 'codex', 'round')]
To focus on recent performance patterns and ensure temporal consistency, the analysis is restricted to competitions held from the 2021 season onward. In addition, athlete identifiers are cleaned to remove missing values and duplicated entries.
competitions <- competitions %>%
filter(season >= 2021)
names_clean <- names %>%
filter(!is.na(codex)) %>%
distinct(codex, .keep_all = TRUE)
Next, the competition metadata and athlete information are merged with the jump-level results. This step produces a unified dataset in which each observation corresponds to a single jump, enriched with contextual and athlete-level information.
df <- results %>%
inner_join(competitions[, c("id", "k.point", "season")], by = "id") %>%
inner_join(names_clean, by = "codex")
head(df[ ,c( "id" ,"season",'codex', 'round', 'name', 'speed', 'dist', 'k.point', 'note_points', 'wind_comp', 'gate_points')])
## id season codex round name speed dist
## 1 2021JP3001RLQ 2021 5253 qualification eisenbichler markus 104.4 225.5
## 2 2021JP3001RLQ 2021 5288 qualification hayboeck michael 105.3 242.5
## 3 2021JP3001RLQ 2021 6880 qualification granerud halvor egner 103.9 221.5
## 4 2021JP3001RLQ 2021 5567 qualification geiger karl 104.6 228.0
## 5 2021JP3001RLQ 2021 6098 qualification tande daniel andre 104.2 229.5
## 6 2021JP3001RLQ 2021 4321 qualification stoch kamil 103.5 226.0
## k.point note_points wind_comp gate_points
## 1 200 55.0 7.7 9.2
## 2 200 49.5 -4.5 0.0
## 3 200 55.0 5.6 9.2
## 4 200 54.5 -1.9 9.2
## 5 200 55.5 -4.8 9.2
## 6 200 54.5 -2.9 9.2
Since raw performance variables are influenced by competition-specific conditions, several relative and standardized measures are constructed. Jump distances are transformed into within-round z-scores to account for hill and condition effects, gate compensation is centered relative to the round average, and in-run speed is expressed relative to the mean speed observed in a given competition. Additionally, a binary indicator is created to capture whether a jump exceeds the hill’s K-point.
df <- df %>%
group_by(id, round) %>%
mutate(
dist_zscore = (dist - mean(dist, na.rm=TRUE)) / sd(dist, na.rm=TRUE),
rel_gate = gate_points - mean(gate_points, na.rm=TRUE)
) %>%
group_by(id) %>%
mutate(
rel_speed = speed / mean(speed, na.rm=TRUE)
) %>%
ungroup() %>%
mutate(
beyond_k = if_else(dist >= k.point, 1, 0)
)
These transformations allow jump-level outcomes to be compared across different competitions and conditions, forming the basis for subsequent aggregation into athlete-level performance profiles.
After constructing jump-level performance measures, the next step is to aggregate the data at the athlete level. Rather than analyzing individual jumps, the goal is to obtain stable summaries that reflect typical performance characteristics of each athlete across multiple competitions.
To ensure data quality and robustness, observations with missing values in key performance variables are removed. The data is then grouped by athlete identifier and name, and a set of summary statistics is computed. These include average standardized distance performance, relative in-run speed, style points, wind compensation, and gate-related tactical adjustments. In addition, the proportion of jumps exceeding the hill’s K-point is calculated as an indicator of consistency at longer distances.
df_final <- df %>%
filter(!is.na(dist_zscore), !is.na(rel_speed), !is.na(note_points), !is.na(wind_comp), !is.na(gate_points)) %>%
group_by(codex, name) %>%
summarise(
avg_performance = mean(dist_zscore),
avg_speed_rel = mean(rel_speed),
avg_style = mean(note_points),
avg_wind = mean(wind_comp),
avg_gate_tactical = mean(rel_gate),
share_beyond_k = mean(beyond_k),
n_jumps = n(),
.groups = "drop"
) %>%
filter(n_jumps >= 10)
Athletes with fewer than ten recorded jumps are excluded to reduce the influence of outliers and to ensure that the resulting profiles are based on a sufficient number of observations. The final dataset contains one row per athlete and serves as the basis for subsequent exploratory and clustering analyses.
Before applying dimension reduction and clustering methods, the selected performance features are extracted and standardized. Standardization ensures that all variables contribute equally to distance-based methods and prevents features with larger numerical ranges from dominating the analysis.
df_features <- df_final %>%
select(
avg_performance,
avg_speed_rel,
avg_style,
avg_wind,
avg_gate_tactical,
share_beyond_k
)
df_scaled <- scale(df_features)
head(df_scaled)
## avg_performance avg_speed_rel avg_style avg_wind avg_gate_tactical
## [1,] 0.9197448 0.2574233 0.15860200 0.14080574 1.5360571
## [2,] 0.1611022 0.0505097 -0.02944119 0.37829962 -0.9767621
## [3,] -3.1084342 -0.2422965 -1.94177696 -0.02581361 -0.5399374
## [4,] 0.2356514 0.6883660 0.33886162 0.69277162 0.2242367
## [5,] 0.7773441 -0.1179216 1.44937098 0.90205708 0.7937005
## [6,] 1.1308910 0.6058229 1.14197730 0.23920247 0.1941370
## share_beyond_k
## [1,] -0.7751964
## [2,] -0.8020197
## [3,] -1.0434294
## [4,] 0.8268323
## [5,] 1.6593092
## [6,] 1.6316507
The standardized athlete-level feature matrix constitutes the final input for dimension reduction techniques such as PCA and MDS, as well as for the clustering procedures explored in the next sections.
Before applying dimension reduction and clustering methods, it is useful to check whether the athlete-level data shows any meaningful structure that could justify clustering. If the observations were randomly distributed, clustering results would be unreliable regardless of the method used. For this purpose, the Hopkins statistic is computed on the standardized performance features.
set.seed(123)
hopkins_stat <- hopkins(df_scaled, m = nrow(df_scaled) - 1)
hopkins_stat
## [1] 0.9990957
The Hopkins statistic equals 0.999, indicating a very strong departure from randomness. This result provides clear evidence that the data contains a pronounced clustering structure and justifies the use of dimension reduction and clustering techniques in the analysis.
The correlation matrix of standardized performance features reveals a clear structure within the data. Strong positive correlations are observed between average performance, style points, and the share of jumps exceeding the K-point, indicating that these variables capture closely related aspects of jump quality and effectiveness.
In contrast, relative in-run speed and wind compensation show weak correlations with most other features, suggesting that they represent more independent or context-driven dimensions of performance. Gate-related adjustments occupy an intermediate position, exhibiting moderate associations with performance outcomes.
The presence of strongly correlated performance features indicates redundancy in the original feature space. In addition, distance-based clustering methods are sensitive to correlated variables, which may distort similarity relationships between athletes. Dimension reduction is therefore introduced to summarize the data into a smaller set of latent dimensions while preserving its essential structure.
Principal Component Analysis (PCA) is applied to explore the variance structure of the standardized performance features and to assess the potential for dimensionality reduction. The component loadings indicate that the leading principal components are mainly driven by variables related to jump quality and effectiveness, confirming the presence of redundancy among the original features.
The scree plot shows a rapid decline in explained variance after the first components, suggesting that most information is captured by a small number of components.
pca <- prcomp(df_scaled, center = TRUE, scale. = TRUE)
round(pca$rotation, 3)
## PC1 PC2 PC3 PC4 PC5 PC6
## avg_performance -0.509 0.167 0.095 0.064 -0.833 -0.071
## avg_speed_rel -0.231 -0.146 0.875 -0.346 0.190 -0.060
## avg_style -0.513 -0.094 -0.005 0.389 0.263 0.713
## avg_wind -0.194 -0.853 -0.311 -0.354 -0.113 -0.020
## avg_gate_tactical -0.347 0.463 -0.330 -0.708 0.203 0.120
## share_beyond_k -0.516 0.004 -0.142 0.315 0.383 -0.685
pca_data <- pca$x[, 1:4]
fviz_eig(pca, addlabels = TRUE, barfill = "gray", barcolor = "black")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
The cumulative scree plot indicates that approximately 90% of the total variance is explained by the first four principal components. This confirms that the six-dimensional feature space can be effectively summarized using a lower-dimensional representation with minimal information loss.
While PCA summarizes variance in the data, it does not explicitly preserve pairwise distances between observations. Since clustering methods rely on distance relationships, Multidimensional Scaling (MDS) is applied as an alternative low-dimensional representation.
MDS is performed on the Euclidean distance matrix of the standardized performance features and projects athletes into a two-dimensional space while minimizing distortion of inter-athlete distances.
# Distance matrix
dist_raw <- dist(df_scaled, method = "euclidean")
# Metric MDS (2 dimensions)
mds_fit <- mds(dist_raw, ndim = 2)
# Configuration
X_mds <- mds_fit$conf
# Stress value
mds_fit$stress
## [1] 0.1683938
The obtained stress value (≈ 0.17) indicates a moderate but acceptable level of distortion, suggesting that the two-dimensional configuration provides a reasonable approximation of the original six-dimensional feature space. While some information loss is unavoidable, the MDS map preserves the main distance relationships between athletes and offers an interpretable low-dimensional visualization.
plot(
X_mds[,1], X_mds[,2],
pch = 19, col = "gray50",
xlab = "MDS Dimension 1",
ylab = "MDS Dimension 2",
main = "Athletes in MDS Space"
)
The absence of clearly separated groups at this stage suggests that clustering structure, if present, is not trivial and requires formal clustering methods to be identified. The MDS space therefore serves as a suitable foundation for subsequent clustering analysis.
To assess the impact of dimension reduction on clustering quality, k-means clustering was applied in three feature spaces: the original standardized feature space, the PCA-reduced space, and the two-dimensional MDS space. Clustering performance was evaluated using the average silhouette width and the Calinski–Harabasz index.
X_raw <- df_scaled
X_pca <- pca_data
X_mds <- X_mds
cluster_quality <- function(X, k_range = 2:8) {
sil <- sapply(k_range, function(k){
km <- kmeans(X, centers = k, nstart = 50)
mean(silhouette(km$cluster, dist(X))[, 3])
})
ch <- sapply(k_range, function(k){
km <- kmeans(X, centers = k, nstart = 50)
calinhara(X, km$cluster)
})
data.frame(k = k_range, silhouette = sil, CH = ch)
}
raw_results <- cluster_quality(X_raw)
pca_results <- cluster_quality(X_pca)
mds_results <- cluster_quality(X_mds)
comparison_ext <- data.frame(
k = raw_results$k,
sil_raw = raw_results$silhouette,
sil_pca = pca_results$silhouette,
sil_mds = mds_results$silhouette,
CH_raw = raw_results$CH,
CH_pca = pca_results$CH,
CH_mds = mds_results$CH
)
print(comparison_ext)
## k sil_raw sil_pca sil_mds CH_raw CH_pca CH_mds
## 1 2 0.3102306 0.3311604 0.4006431 202.8749 224.5985 314.5117
## 2 3 0.2383430 0.2472587 0.3682627 156.1521 171.6385 266.9319
## 3 4 0.2513684 0.2657572 0.3362056 138.7429 155.9833 262.3063
## 4 5 0.2463712 0.2660749 0.3414546 129.2932 148.2052 260.9390
## 5 6 0.2310756 0.2240178 0.3496848 118.3344 135.8614 255.5226
## 6 7 0.1975816 0.2226186 0.3162200 111.0756 129.4851 254.0792
## 7 8 0.1933792 0.2111580 0.3357214 106.7926 124.9399 259.5679
Across all tested numbers of clusters, clustering in the MDS space consistently achieves the highest silhouette values, with the strongest separation observed for k = 2 (silhouette ≈ 0.40). In contrast, clustering without dimension reduction yields the lowest silhouette values, while PCA provides a moderate improvement but remains clearly below MDS.
A similar pattern is observed for the Calinski–Harabasz index, which attains its highest values in the MDS representation for all values of k. Although silhouette values gradually decrease as the number of clusters increases, the MDS-based clustering remains more stable than the alternatives, indicating better preservation of meaningful distance relationships between athletes.
Overall, these results suggest that Multidimensional Scaling provides a feature space that is more suitable for distance-based clustering than both the original and PCA-reduced representations. Consequently, the MDS space is selected for the final clustering and interpretation of athlete performance profiles.
Based on the previous comparison, the two-dimensional MDS representation was selected as the final feature space for clustering, as it consistently yielded higher internal validity scores than both the raw and PCA-reduced spaces. Hierarchical clustering with Ward’s linkage is applied in this space to identify groups of athletes with similar performance profiles.
Hierarchical clustering is particularly well suited for exploratory analysis, as it does not require fixing the number of clusters in advance and allows the structure of the data to be examined at multiple levels of granularity.
# Hierarchical clustering in MDS space
dist_mds <- dist(X_mds, method = "euclidean")
hc_mds <- hclust(dist_mds, method = "ward.D2")
plot(
hc_mds,
labels = FALSE,
hang = -1,
main = "Hierarchical Clustering Dendrogram (MDS Space)",
xlab = "",
ylab = "Height"
)
To assess the suitability of hierarchical clustering for the final segmentation, average silhouette width was computed for different numbers of clusters.
sil_values <- sapply(2:6, function(k){
cl <- cutree(hc_mds, k = k)
mean(silhouette(cl, dist_mds)[, 3])
})
sil_df <- data.frame(
k = 2:6,
silhouette = sil_values
)
plot(
sil_df$k, sil_df$silhouette,
type = "b", pch = 19,
xlab = "Number of clusters",
ylab = "Average silhouette width",
main = "Silhouette Analysis for Hierarchical Clustering (MDS)"
)
The silhouette coefficient reaches its maximum for k = 2, indicating a strong but highly coarse partition that primarily separates athletes by overall performance level. For larger values of k, silhouette values decrease substantially, suggesting limited separation between clusters.
Although hierarchical clustering provides a useful exploratory view of the data, its clustering quality is consistently lower than that achieved by k-means in the same MDS space. This indicates that the underlying structure of the data is better approximated by compact, centroid-based clusters rather than a hierarchical organization.
Consequently, k-means clustering is selected as the final clustering method, while hierarchical clustering is retained as an exploratory step to support understanding of the data structure.
Based on the previous comparison, k-means clustering is applied in the two-dimensional MDS space, which demonstrated the highest internal validity across all tested configurations. The final number of clusters is selected based on the silhouette criterion and interpretability considerations.
Silhouette analysis indicates that the highest average silhouette width is obtained for k = 2, corresponding to a strong but very coarse partition that primarily separates athletes by overall performance level. While well separated, this solution offers limited insight into qualitative differences between performance profiles.
A local maximum of the silhouette coefficient is observed around k = 6, suggesting a reasonable balance between cluster separation and interpretability. Rather than optimizing silhouette alone, the choice of k = 6 is motivated by the analytical goal of identifying distinct performance archetypes. The six-cluster solution captures meaningful differences across multiple dimensions, including performance quality, consistency, style scores, wind sensitivity, gate effects, and competitive exposure.
set.seed(123)
k_final <- 6
km_mds <- kmeans(X_mds, centers = k_final, nstart = 50)
table(km_mds$cluster)
##
## 1 2 3 4 5 6
## 108 18 78 29 56 91
The resulting clusters are visualized in the MDS space. Each point represents an athlete, and colors indicate cluster membership.
Although the clusters are not perfectly separated in the two-dimensional projection, they form compact regions with distinct centers. The observed overlap reflects the continuous nature of athlete performance profiles rather than a failure of the clustering procedure. In practice, ski jumping performance varies gradually across athletes, and sharp boundaries between groups are not expected.
To interpret the identified clusters, cluster labels are merged back into the athlete-level dataset, and average values of the original performance features are computed for each cluster.
| cluster | n_athletes | avg_jumps_per_athlete | avg_performance | avg_speed_rel | avg_style | avg_wind | avg_gate_tactical | share_beyond_k |
|---|---|---|---|---|---|---|---|---|
| 1 | 108 | 73.630 | 0.302 | 1.001 | 52.262 | -0.155 | 0.033 | 0.618 |
| 2 | 18 | 91.556 | 0.955 | 1.000 | 51.861 | -0.379 | 1.498 | 0.681 |
| 3 | 78 | 42.795 | -0.351 | 1.000 | 49.381 | 0.820 | -0.414 | 0.172 |
| 4 | 29 | 19.414 | -0.804 | 0.995 | 45.945 | -5.550 | -0.186 | 0.052 |
| 5 | 56 | 29.411 | -1.065 | 0.997 | 45.712 | -0.655 | -0.417 | 0.013 |
| 6 | 91 | 39.571 | -0.064 | 1.000 | 49.189 | -2.515 | -0.119 | 0.163 |
The six-cluster solution reveals meaningful heterogeneity in ski jumping performance profiles, extending beyond a simple ranking of athletes by overall quality.
Cluster 2 represents a small group of elite athletes. They combine the highest average standardized performance with the largest share of jumps exceeding the K-point and the highest average number of jumps, indicating both exceptional quality and sustained participation at the top competitive level. The strongly positive gate-related adjustments suggest that these athletes often compete under more demanding starting conditions.
Cluster 1 consists of high-performing and highly consistent athletes. While their average performance is lower than that of Cluster 2, they still achieve a high share of jumps beyond the K-point and display strong style scores. This group appears to represent stable top-level competitors slightly below the absolute elite.
Cluster 6 forms a broad group of average performers. Their standardized performance is close to zero, with moderate style scores and a limited share of jumps beyond the K-point. These athletes likely represent the competitive middle of the field, regularly qualifying for competitions but without consistent top results.
Cluster 3 includes athletes with below-average performance but relatively neutral wind conditions. Despite moderate participation, their share of jumps beyond the K-point remains low, suggesting limitations in distance or consistency rather than external conditions.
Clusters 4 and 5 capture lower-performing athletes with the weakest outcomes across most performance dimensions. Cluster 4 is characterized by extremely unfavorable wind compensation and very low participation, while Cluster 5 shows the lowest overall performance and minimal success in exceeding the K-point. These clusters likely correspond to athletes at early career stages or those struggling to maintain competitiveness at the highest level.
Overall, the six-cluster solution highlights a gradual performance continuum rather than sharply separated groups. The observed overlap between clusters is consistent with the continuous nature of athletic performance, while the cluster profiles provide a structured and interpretable segmentation of athletes based on quality, consistency, and competitive exposure.
The clustering results reveal that differences between ski jumpers are not driven by a single dominant factor, but rather by consistent combinations of performance characteristics. The strongest athletes are distinguished not only by high standardized jump distance and style scores, but also by a high share of jumps exceeding the K-point and a large number of attempts in the analyzed period, indicating both effectiveness and long-term competitiveness. In contrast, weaker clusters are characterized by lower distance performance, poorer style evaluations, and a very small proportion of jumps beyond the K-point, reflecting limited ability to consistently achieve long flights.
Importantly, the analysis highlights the role of context-related variables, such as wind compensation and gate-related tactical adjustments. Stronger clusters tend to operate closer to neutral wind conditions and rely less on extreme wind compensation, suggesting that their performance is less dependent on favorable external conditions. Similarly, lower absolute values of gate tactical adjustments among top clusters indicate greater robustness to starting gate changes, while weaker clusters appear more sensitive to gate and wind-related variability.
Overall, this study demonstrates that unsupervised learning methods, combined with dimension reduction, can provide meaningful insights into ski jumping performance beyond simple rankings. Rather than classifying athletes as “good” or “bad,” the proposed framework identifies distinct performance profiles, capturing how athletes succeed or struggle under varying competitive conditions. As such, the analysis offers a useful exploratory perspective that can support performance evaluation, coaching insights, and further sport-specific modeling.
This study has several limitations. First, athlete performance is represented using aggregated averages, which provide stable profiles but may hide short-term fluctuations and differences in form across competitions. Second, the Multidimensional Scaling (MDS) representation reduces the original feature space to two dimensions, which inevitably leads to some information loss. Therefore, the identified clusters should be interpreted as approximate groupings rather than exact or definitive categories.
Future work could extend the analysis by incorporating additional explanatory variables that are not available in the current dataset. In particular, collecting information about athletes’ nationality would allow investigation of whether competitors from certain countries tend to share similar performance profiles, potentially reflecting differences in training systems or coaching approaches. Adding further contextual or training-related variables could also help better explain why athletes fall into specific clusters and strengthen the practical interpretation of the results.