A Simple Analysis of the 2021-22 Ski Jumping Season

Ski jumping is a sport, in which the athletes slide down from a ramp and aim to jump to the farthest possible distance with a solid technique. Their points are determined by a combination of the distance they achieve and the technical points they get from the five judges with the highest and the lowest one getting excluded.

Dataset

We collected the data from the publicly available FIS database. It contains all FIS Men’s Individual World Cup races except the Four Hills Tournament, which we left out due to its match up format that can violate robustness.

head(ski_jumping)

## # A tibble: 6 × 23
##   race      race_location hill_type round name  nationality speed distance_meter
##   <chr>     <chr>         <chr>     <dbl> <chr> <chr>       <dbl>          <dbl>
## 1 Nizhny T… RUS           LH            2 GEIG… GER          86.1           134.
## 2 Nizhny T… RUS           LH            1 GEIG… GER          86.1           133 
## 3 Nizhny T… RUS           LH            2 KOBA… JPN          85.6           128 
## 4 Nizhny T… RUS           LH            1 KOBA… JPN          85.4           131 
## 5 Nizhny T… RUS           LH            2 GRAN… NOR          85.6           133 
## 6 Nizhny T… RUS           LH            1 GRAN… NOR          85.3           125 
## # … with 15 more variables: distance_points <dbl>, judge_A <dbl>,
## #   judge_B <dbl>, judge_C <dbl>, judge_D <dbl>, judge_E <dbl>,
## #   technical_points <dbl>, gate_number <dbl>, gate_points <dbl>, wind <dbl>,
## #   wind_points <dbl>, total_points <dbl>, total_race_points <dbl>, rank <dbl>,
## #   overall_race_rank <dbl>

In the following analysis, we often distinguish between the large and the flying hills. Therefore first, we divide the data into two as large (LH) and flying hills (FH), respectively. Some summary statistics of the selected variables can be found below.

LH <- ski_jumping %>%
  filter(hill_type == 'LH')

FH <- ski_jumping %>%
  filter(hill_type == 'FH')

LH %>%
  select(speed, distance_meter, distance_points, technical_points, total_points, total_race_points) %>%
  summary

##      speed       distance_meter  distance_points  technical_points
##  Min.   :84.60   Min.   : 85.5   Min.   :-20.10   Min.   :20.50   
##  1st Qu.:87.60   1st Qu.:117.5   1st Qu.: 51.90   1st Qu.:51.00   
##  Median :89.10   Median :124.0   Median : 63.60   Median :52.50   
##  Mean   :89.17   Mean   :123.4   Mean   : 61.46   Mean   :52.54   
##  3rd Qu.:90.80   3rd Qu.:130.0   3rd Qu.: 72.60   3rd Qu.:54.00   
##  Max.   :93.50   Max.   :152.0   Max.   :108.60   Max.   :58.50   
##   total_points   total_race_points
##  Min.   : 15.2   Min.   : 15.2    
##  1st Qu.:103.0   1st Qu.:105.7    
##  Median :117.7   Median :129.3    
##  Mean   :114.4   Mean   :158.7    
##  3rd Qu.:128.7   3rd Qu.:235.4    
##  Max.   :170.1   Max.   :324.5

FH %>%
  select(speed, distance_meter, distance_points, technical_points, total_points, total_race_points) %>%
  summary

##      speed       distance_meter  distance_points technical_points
##  Min.   :100.0   Min.   :148.0   Min.   : 57.6   Min.   :23.00   
##  1st Qu.:102.2   1st Qu.:203.4   1st Qu.:124.0   1st Qu.:52.50   
##  Median :103.5   Median :215.0   Median :138.0   Median :54.00   
##  Mean   :103.2   Mean   :213.7   Mean   :136.4   Mean   :53.76   
##  3rd Qu.:104.0   3rd Qu.:225.0   3rd Qu.:150.0   3rd Qu.:55.50   
##  Max.   :106.9   Max.   :246.0   Max.   :175.2   Max.   :59.00   
##   total_points   total_race_points
##  Min.   : 85.3   Min.   : 85.3    
##  1st Qu.:176.7   1st Qu.:188.0    
##  Median :189.7   Median :230.6    
##  Mean   :190.6   Mean   :279.5    
##  3rd Qu.:207.8   3rd Qu.:379.0    
##  Max.   :252.0   Max.   :468.2

Relationship between speed and distance

Before going into any analysis, we visualize the relationship between the speed and the distance in meters for large and flying hills, respectively.

ggplot(LH, aes(speed, distance_meter)) + geom_point() + geom_smooth(method = 'lm', col='red', se=FALSE) + labs(title = 'Relationship between speed and distance for large hills', x = 'Speed', y = 'Distance') + theme(plot.title = element_text(hjust = 0.5, face = 'bold'), axis.title.x = element_text(face='bold'), axis.title.y = element_text(face = 'bold'))

## `geom_smooth()` using formula 'y ~ x'

ggplot(FH, aes(speed, distance_meter)) + geom_point() + geom_smooth(method = 'lm', col='red', se=FALSE) + labs(title = 'Relationship between speed and distance for flying hills', x = 'Speed', y = 'Distance') + theme(plot.title = element_text(hjust = 0.5, face = 'bold'), axis.title.x = element_text(face='bold'), axis.title.y = element_text(face = 'bold'))

## `geom_smooth()` using formula 'y ~ x'

We observe a positive linear trend between the speed and the distance for both large and flying hills, even though large outliers occur.

To formally test for the correlation, we first check whether the variables speed and distance are normally distributed for large and flying hills, respectively.

shapiro.test(LH$speed)

## 
##  Shapiro-Wilk normality test
## 
## data:  LH$speed
## W = 0.97931, p-value = 3.517e-14

shapiro.test(FH$speed)

## 
##  Shapiro-Wilk normality test
## 
## data:  FH$speed
## W = 0.96272, p-value = 2.065e-06

shapiro.test(LH$distance_meter)

## 
##  Shapiro-Wilk normality test
## 
## data:  LH$distance_meter
## W = 0.98883, p-value = 1.491e-09

shapiro.test(FH$distance_meter)

## 
##  Shapiro-Wilk normality test
## 
## data:  FH$distance_meter
## W = 0.98247, p-value = 0.002235

The Shapiro-Wilk normality test indicates that we cannot assume normality for speed and distance neither for large nor for flying hills, which is plausible since the hills have different sizes. However, note that within the races, speed and distance are often normally distributed. Below, we put the first race in Lahti as an example.

lahti1 <- LH %>%
  filter(race == 'Lahti 1')
shapiro.test(lahti1$distance_meter)

## 
##  Shapiro-Wilk normality test
## 
## data:  lahti1$distance_meter
## W = 0.97905, p-value = 0.1876

shapiro.test(lahti1$speed)

## 
##  Shapiro-Wilk normality test
## 
## data:  lahti1$speed
## W = 0.97583, p-value = 0.115

Since speed and distance are not normally distributed, we use a non-parametric test, the Kendall test, to test for the correlation between the speed and the distance for large and flying hills, respectively.

cor.test(LH$speed, LH$distance_meter, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  LH$speed and LH$distance_meter
## z = 13.213, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.2269388

cor.test(FH$speed, FH$distance_meter, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  FH$speed and FH$distance_meter
## z = 6.1933, p-value = 5.892e-10
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.2583555

The correlation coefficients are positive but small. However, since the p-values are lower than the significance level of 0.05, we can conclude that the speed and the distance are significantly correlated for both large and flying hills.

Relationship between distance and technical points

Next, we analyse the relationship between the distance and the technical points. First, we run a correlation test using the Kendall test and the flying hill data:

cor.test(FH$distance_meter, FH$technical_points, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  FH$distance_meter and FH$technical_points
## z = 13.262, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.5630862

We observe a significant moderate correlation between the distance and the technical points, which is straightforward, since a ski jumper with a better performance means larger distances and nicer telemarks. Things get more interesting, when we filter the data for distances over 230 meters (which is rare and highly above the average) and rerun the correlation test:

large_dist <- FH %>%
  filter(distance_meter >= 230)
cor.test(large_dist$distance_meter, large_dist$technical_points, method = "kendall")

## Warning in cor.test.default(large_dist$distance_meter,
## large_dist$technical_points, : Cannot compute exact p-value with ties

## 
##  Kendall's rank correlation tau
## 
## data:  large_dist$distance_meter and large_dist$technical_points
## z = -1.2, p-value = 0.2301
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.1259485

For the distances that are over 230 meters, we get a very small but a negative correlation between the distance and the technical points, which indicates that for every additional meter above a certain threshold (~230m), the ski jumper gives up from his telemark and technical points. Hence, above a certain threshold, there exists a trade-off between the technical points and the distance.

When we filter the data for the distances that are above 235 meters, the correlation coefficient gets negatively more large:

larger_dist <- FH %>%
  filter(distance_meter >= 235)
cor.test(larger_dist$distance_meter, larger_dist$technical_points, method = "kendall")

## Warning in cor.test.default(larger_dist$distance_meter,
## larger_dist$technical_points, : Cannot compute exact p-value with ties

## 
##  Kendall's rank correlation tau
## 
## data:  larger_dist$distance_meter and larger_dist$technical_points
## z = -1.4738, p-value = 0.1405
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.2316945

Nationality analysis

The following boxplots illustrate the variables speed and distance for each participating country.

ggplot(LH, aes(distance_meter, nationality)) + geom_boxplot() + labs(x = 'Distance', y = 'Nationality') + theme(plot.title = element_text(hjust = 0.5, face = 'bold'), axis.title.x = element_text(face='bold'), axis.title.y = element_text(face = 'bold'))

ggplot(LH, aes(speed, nationality)) + geom_boxplot() + labs(x = 'Speed', y = 'Nationality') + theme(plot.title = element_text(hjust = 0.5, face = 'bold'), axis.title.x = element_text(face='bold'), axis.title.y = element_text(face = 'bold'))

The last question that we are interested in is whether the technical points of an athlete vary based on the location of the race, in other words, does an athlete get higher points, if he is competing in his home country?

We first look at Germany:

germans_notin_germany <- ski_jumping %>%
  filter(nationality == "GER", race_location != "GER")

germans_in_germany <- ski_jumping %>%
  filter(nationality == "GER", race_location == "GER") %>%
  filter(name %in% germans_notin_germany$name)

set.seed(83)
sample_germans <- sample_n(germans_notin_germany, 83)
wilcox.test(sample_germans$technical_points, germans_in_germany$technical_points, paired = TRUE, alternative = "less")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  sample_germans$technical_points and germans_in_germany$technical_points
## V = 1242, p-value = 0.2169
## alternative hypothesis: true location shift is less than 0

It turns out that there is no significant difference between the technical points of the German ski jumpers that they get in Germany and outside of Germany.

Next, we take a look at Slovenia:

slovenians_notin_slovenia <- ski_jumping %>%
  filter(nationality == "SLO", race_location != "SLO")

slovenians_in_slovenia <- ski_jumping %>%
  filter(nationality == "SLO", race_location == "SLO") %>%
  filter(name %in% slovenians_notin_slovenia$name)

set.seed(29)
sample_slovenians <- sample_n(slovenians_notin_slovenia, 29)

wilcox.test(sample_slovenians$technical_points, slovenians_in_slovenia$technical_points, paired = TRUE, alternative = "less")

## Warning in wilcox.test.default(sample_slovenians$technical_points,
## slovenians_in_slovenia$technical_points, : cannot compute exact p-value with
## ties

## Warning in wilcox.test.default(sample_slovenians$technical_points,
## slovenians_in_slovenia$technical_points, : cannot compute exact p-value with
## zeroes

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  sample_slovenians$technical_points and slovenians_in_slovenia$technical_points
## V = 105, p-value = 0.02228
## alternative hypothesis: true location shift is less than 0

The results indicate that the Slovenians got significantly higher points in Slovenia compared to what they got in any other country.

We observe similar results for Norway and Austria, that they got significantly higher points in their home countries. However, an interesting case would be Poland; during the 2021-22 season, Polish ski jumpers received significantly lower points in Poland compared to other races on the calendar.

The reason that ski jumpers often get higher technical points in their home countries could be due to the fact they perform better in their home countries (because of several factors like crowd support or being used to the hill), or it could be the case that the judges give points biased towards to the host country, or (and most probably) a combination of these. However, we leave the causation for a later study.

A Simple Analysis of the 2021-22 Ski Jumping Season

Daniels Süha Özkaya (danielssuhaozkaya@gmail.com) & Lara Deniz Alper (lara.alper@upf.edu)

2022-08-11

Dataset

Relationship between speed and distance

Relationship between distance and technical points

Nationality analysis