Ski jumping is a sport, in which the athletes slide down from a ramp and aim to jump to the farthest possible distance with a solid technique. Their points are determined by a combination of the distance they achieve and the technical points they get from the five judges with the highest and the lowest one getting excluded.
We collected the data from the publicly available FIS database. It contains all FIS Men’s Individual World Cup races except the Four Hills Tournament, which we left out due to its match up format that can violate robustness.
head(ski_jumping)
## # A tibble: 6 × 23
## race race_location hill_type round name nationality speed distance_meter
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 Nizhny T… RUS LH 2 GEIG… GER 86.1 134.
## 2 Nizhny T… RUS LH 1 GEIG… GER 86.1 133
## 3 Nizhny T… RUS LH 2 KOBA… JPN 85.6 128
## 4 Nizhny T… RUS LH 1 KOBA… JPN 85.4 131
## 5 Nizhny T… RUS LH 2 GRAN… NOR 85.6 133
## 6 Nizhny T… RUS LH 1 GRAN… NOR 85.3 125
## # … with 15 more variables: distance_points <dbl>, judge_A <dbl>,
## # judge_B <dbl>, judge_C <dbl>, judge_D <dbl>, judge_E <dbl>,
## # technical_points <dbl>, gate_number <dbl>, gate_points <dbl>, wind <dbl>,
## # wind_points <dbl>, total_points <dbl>, total_race_points <dbl>, rank <dbl>,
## # overall_race_rank <dbl>
In the following analysis, we often distinguish between the large and the flying hills. Therefore first, we divide the data into two as large (LH) and flying hills (FH), respectively. Some summary statistics of the selected variables can be found below.
LH <- ski_jumping %>%
filter(hill_type == 'LH')
FH <- ski_jumping %>%
filter(hill_type == 'FH')
LH %>%
select(speed, distance_meter, distance_points, technical_points, total_points, total_race_points) %>%
summary
## speed distance_meter distance_points technical_points
## Min. :84.60 Min. : 85.5 Min. :-20.10 Min. :20.50
## 1st Qu.:87.60 1st Qu.:117.5 1st Qu.: 51.90 1st Qu.:51.00
## Median :89.10 Median :124.0 Median : 63.60 Median :52.50
## Mean :89.17 Mean :123.4 Mean : 61.46 Mean :52.54
## 3rd Qu.:90.80 3rd Qu.:130.0 3rd Qu.: 72.60 3rd Qu.:54.00
## Max. :93.50 Max. :152.0 Max. :108.60 Max. :58.50
## total_points total_race_points
## Min. : 15.2 Min. : 15.2
## 1st Qu.:103.0 1st Qu.:105.7
## Median :117.7 Median :129.3
## Mean :114.4 Mean :158.7
## 3rd Qu.:128.7 3rd Qu.:235.4
## Max. :170.1 Max. :324.5
FH %>%
select(speed, distance_meter, distance_points, technical_points, total_points, total_race_points) %>%
summary
## speed distance_meter distance_points technical_points
## Min. :100.0 Min. :148.0 Min. : 57.6 Min. :23.00
## 1st Qu.:102.2 1st Qu.:203.4 1st Qu.:124.0 1st Qu.:52.50
## Median :103.5 Median :215.0 Median :138.0 Median :54.00
## Mean :103.2 Mean :213.7 Mean :136.4 Mean :53.76
## 3rd Qu.:104.0 3rd Qu.:225.0 3rd Qu.:150.0 3rd Qu.:55.50
## Max. :106.9 Max. :246.0 Max. :175.2 Max. :59.00
## total_points total_race_points
## Min. : 85.3 Min. : 85.3
## 1st Qu.:176.7 1st Qu.:188.0
## Median :189.7 Median :230.6
## Mean :190.6 Mean :279.5
## 3rd Qu.:207.8 3rd Qu.:379.0
## Max. :252.0 Max. :468.2
Before going into any analysis, we visualize the relationship between the speed and the distance in meters for large and flying hills, respectively.
ggplot(LH, aes(speed, distance_meter)) + geom_point() + geom_smooth(method = 'lm', col='red', se=FALSE) + labs(title = 'Relationship between speed and distance for large hills', x = 'Speed', y = 'Distance') + theme(plot.title = element_text(hjust = 0.5, face = 'bold'), axis.title.x = element_text(face='bold'), axis.title.y = element_text(face = 'bold'))
## `geom_smooth()` using formula 'y ~ x'
ggplot(FH, aes(speed, distance_meter)) + geom_point() + geom_smooth(method = 'lm', col='red', se=FALSE) + labs(title = 'Relationship between speed and distance for flying hills', x = 'Speed', y = 'Distance') + theme(plot.title = element_text(hjust = 0.5, face = 'bold'), axis.title.x = element_text(face='bold'), axis.title.y = element_text(face = 'bold'))
## `geom_smooth()` using formula 'y ~ x'
We observe a positive linear trend between the speed and the distance for both large and flying hills, even though large outliers occur.
To formally test for the correlation, we first check whether the variables speed and distance are normally distributed for large and flying hills, respectively.
shapiro.test(LH$speed)
##
## Shapiro-Wilk normality test
##
## data: LH$speed
## W = 0.97931, p-value = 3.517e-14
shapiro.test(FH$speed)
##
## Shapiro-Wilk normality test
##
## data: FH$speed
## W = 0.96272, p-value = 2.065e-06
shapiro.test(LH$distance_meter)
##
## Shapiro-Wilk normality test
##
## data: LH$distance_meter
## W = 0.98883, p-value = 1.491e-09
shapiro.test(FH$distance_meter)
##
## Shapiro-Wilk normality test
##
## data: FH$distance_meter
## W = 0.98247, p-value = 0.002235
The Shapiro-Wilk normality test indicates that we cannot assume normality for speed and distance neither for large nor for flying hills, which is plausible since the hills have different sizes. However, note that within the races, speed and distance are often normally distributed. Below, we put the first race in Lahti as an example.
lahti1 <- LH %>%
filter(race == 'Lahti 1')
shapiro.test(lahti1$distance_meter)
##
## Shapiro-Wilk normality test
##
## data: lahti1$distance_meter
## W = 0.97905, p-value = 0.1876
shapiro.test(lahti1$speed)
##
## Shapiro-Wilk normality test
##
## data: lahti1$speed
## W = 0.97583, p-value = 0.115
Since speed and distance are not normally distributed, we use a non-parametric test, the Kendall test, to test for the correlation between the speed and the distance for large and flying hills, respectively.
cor.test(LH$speed, LH$distance_meter, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: LH$speed and LH$distance_meter
## z = 13.213, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.2269388
cor.test(FH$speed, FH$distance_meter, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: FH$speed and FH$distance_meter
## z = 6.1933, p-value = 5.892e-10
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.2583555
The correlation coefficients are positive but small. However, since the p-values are lower than the significance level of 0.05, we can conclude that the speed and the distance are significantly correlated for both large and flying hills.
Next, we analyse the relationship between the distance and the technical points. First, we run a correlation test using the Kendall test and the flying hill data:
cor.test(FH$distance_meter, FH$technical_points, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: FH$distance_meter and FH$technical_points
## z = 13.262, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.5630862
We observe a significant moderate correlation between the distance and the technical points, which is straightforward, since a ski jumper with a better performance means larger distances and nicer telemarks. Things get more interesting, when we filter the data for distances over 230 meters (which is rare and highly above the average) and rerun the correlation test:
large_dist <- FH %>%
filter(distance_meter >= 230)
cor.test(large_dist$distance_meter, large_dist$technical_points, method = "kendall")
## Warning in cor.test.default(large_dist$distance_meter,
## large_dist$technical_points, : Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: large_dist$distance_meter and large_dist$technical_points
## z = -1.2, p-value = 0.2301
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.1259485
For the distances that are over 230 meters, we get a very small but a negative correlation between the distance and the technical points, which indicates that for every additional meter above a certain threshold (~230m), the ski jumper gives up from his telemark and technical points. Hence, above a certain threshold, there exists a trade-off between the technical points and the distance.
When we filter the data for the distances that are above 235 meters, the correlation coefficient gets negatively more large:
larger_dist <- FH %>%
filter(distance_meter >= 235)
cor.test(larger_dist$distance_meter, larger_dist$technical_points, method = "kendall")
## Warning in cor.test.default(larger_dist$distance_meter,
## larger_dist$technical_points, : Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: larger_dist$distance_meter and larger_dist$technical_points
## z = -1.4738, p-value = 0.1405
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.2316945
The following boxplots illustrate the variables speed and distance for each participating country.
ggplot(LH, aes(distance_meter, nationality)) + geom_boxplot() + labs(x = 'Distance', y = 'Nationality') + theme(plot.title = element_text(hjust = 0.5, face = 'bold'), axis.title.x = element_text(face='bold'), axis.title.y = element_text(face = 'bold'))
ggplot(LH, aes(speed, nationality)) + geom_boxplot() + labs(x = 'Speed', y = 'Nationality') + theme(plot.title = element_text(hjust = 0.5, face = 'bold'), axis.title.x = element_text(face='bold'), axis.title.y = element_text(face = 'bold'))
The last question that we are interested in is whether the technical points of an athlete vary based on the location of the race, in other words, does an athlete get higher points, if he is competing in his home country?
We first look at Germany:
germans_notin_germany <- ski_jumping %>%
filter(nationality == "GER", race_location != "GER")
germans_in_germany <- ski_jumping %>%
filter(nationality == "GER", race_location == "GER") %>%
filter(name %in% germans_notin_germany$name)
set.seed(83)
sample_germans <- sample_n(germans_notin_germany, 83)
wilcox.test(sample_germans$technical_points, germans_in_germany$technical_points, paired = TRUE, alternative = "less")
##
## Wilcoxon signed rank test with continuity correction
##
## data: sample_germans$technical_points and germans_in_germany$technical_points
## V = 1242, p-value = 0.2169
## alternative hypothesis: true location shift is less than 0
It turns out that there is no significant difference between the technical points of the German ski jumpers that they get in Germany and outside of Germany.
Next, we take a look at Slovenia:
slovenians_notin_slovenia <- ski_jumping %>%
filter(nationality == "SLO", race_location != "SLO")
slovenians_in_slovenia <- ski_jumping %>%
filter(nationality == "SLO", race_location == "SLO") %>%
filter(name %in% slovenians_notin_slovenia$name)
set.seed(29)
sample_slovenians <- sample_n(slovenians_notin_slovenia, 29)
wilcox.test(sample_slovenians$technical_points, slovenians_in_slovenia$technical_points, paired = TRUE, alternative = "less")
## Warning in wilcox.test.default(sample_slovenians$technical_points,
## slovenians_in_slovenia$technical_points, : cannot compute exact p-value with
## ties
## Warning in wilcox.test.default(sample_slovenians$technical_points,
## slovenians_in_slovenia$technical_points, : cannot compute exact p-value with
## zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: sample_slovenians$technical_points and slovenians_in_slovenia$technical_points
## V = 105, p-value = 0.02228
## alternative hypothesis: true location shift is less than 0
The results indicate that the Slovenians got significantly higher points in Slovenia compared to what they got in any other country.
We observe similar results for Norway and Austria, that they got significantly higher points in their home countries. However, an interesting case would be Poland; during the 2021-22 season, Polish ski jumpers received significantly lower points in Poland compared to other races on the calendar.
The reason that ski jumpers often get higher technical points in their home countries could be due to the fact they perform better in their home countries (because of several factors like crowd support or being used to the hill), or it could be the case that the judges give points biased towards to the host country, or (and most probably) a combination of these. However, we leave the causation for a later study.