Research goal

I like sports. Sure you do too. Watching ski jumps in Vikersundon on TV in Februrary of 2016 made me fancy: is there any correlation between velocity and distance of jump. I use data collected by myself. You can see some data here: http://www.fis-ski.com/ski-jumping/events-and-places/results/, https://en.wikipedia.org/wiki/Ski_jumping.

Data

The ski jump is divided into four parts: in-run, take-off (jump), flight and landing. Velocity is the maximum speed of skier during takе-off (км/hour). Distance is the distance of flight between take-off and landing (мeter).

load(file="ski.dat")
names(ski)<- list("Velocity", "Distance", "Country")
head(ski)
##   Velocity Distance Country
## 1     99.5    213.5     CZE
## 2    100.0    214.5     USA
## 3     99.7    219.5     AUT
## 4     99.3    203.0     CZE
## 5     99.6    196.5     SLO
## 6    100.1    199.0     POL
attach(ski)

Preliminary results

summary(ski)
##     Velocity         Distance        Country 
##  Min.   : 97.80   Min.   :181.5   NOR    :7  
##  1st Qu.: 98.80   1st Qu.:204.8   SLO    :6  
##  Median : 99.20   Median :216.0   AUT    :5  
##  Mean   : 99.24   Mean   :216.1   GER    :5  
##  3rd Qu.: 99.70   3rd Qu.:225.8   POL    :5  
##  Max.   :101.00   Max.   :249.0   CZE    :4  
##                                   (Other):7
hist(Velocity)

shapiro.test(Velocity)
## 
##  Shapiro-Wilk normality test
## 
## data:  Velocity
## W = 0.9599, p-value = 0.177
hist(Distance)

shapiro.test(Distance)
## 
##  Shapiro-Wilk normality test
## 
## data:  Distance
## W = 0.98704, p-value = 0.926
cor.test(Velocity,Distance)
## 
##  Pearson's product-moment correlation
## 
## data:  Velocity and Distance
## t = 0.64376, df = 37, p-value = 0.5237
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2174940  0.4072393
## sample estimates:
##       cor 
## 0.1052453

Visualising data

boxplot(data=ski,Velocity~Country, col=Country,horizontal = T)
title(main="Ski 2016: Velocity versus Country",xlab = "Velocity, km/h", ylab = "Country")

boxplot(data=ski,Distance~Country, col=Country,horizontal = T)
title(main="Ski 2016: Distance versus Country",xlab = "Distance, m", ylab = "Country")

ANOVA

fit.1<-aov(Distance~Country)
summary(fit.1)
##             Df Sum Sq Mean Sq F value Pr(>F)
## Country     10   3135   313.5   1.538  0.178
## Residuals   28   5708   203.9
fit.2<-aov(Velocity~Country)
summary(fit.2)
##             Df Sum Sq Mean Sq F value Pr(>F)
## Country     10  4.383  0.4383   0.889  0.555
## Residuals   28 13.807  0.4931

Clusters

set.seed(1234)
fit.k<-kmeans(ski[,-3],centers = 4,iter.max = 1000,nstart = 20)
fit.k
## K-means clustering with 4 clusters of sizes 10, 14, 5, 10
## 
## Cluster means:
##   Velocity Distance
## 1 99.08000 195.7500
## 2 99.27857 214.6071
## 3 98.94000 240.0000
## 4 99.48000 226.4500
## 
## Clustering vector:
##  [1] 2 2 2 1 1 1 4 2 4 4 4 2 1 4 1 2 1 2 2 1 1 2 2 2 4 4 3 2 4 4 3 3 4 3 1
## [36] 2 2 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 362.3810 189.0529 108.0320 100.8610
##  (between_SS / total_SS =  91.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
plot((ski[,-3]),col=fit.k$cluster)
points(fit.k$centers, col = 1:4, pch = 8) 
abline(h = 216)

Linear model for leaders

ski.2<-data.frame(ski,fit.k$cluster)
ski.4<-ski.2[which(ski.2$fit.k.cluster==3),]
plot(ski.4$Velocity,ski.4$Distance,main="Correlation velocity ~ distance",xlab = "Velocity, km/h", ylab = "Distance, m")

cor.test(ski.4$Velocity,ski.4$Distance)
## 
##  Pearson's product-moment correlation
## 
## data:  ski.4$Velocity and ski.4$Distance
## t = 4.3045, df = 3, p-value = 0.02308
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2503881 0.9953200
## sample estimates:
##       cor 
## 0.9277141
fit.lm<-lm(data=ski.4,Distance~Velocity)
summary(fit.lm)
## 
## Call:
## lm(formula = Distance ~ Velocity, data = ski.4)
## 
## Residuals:
##     27     31     32     34     38 
## -2.511  1.574 -1.644  1.403  1.177 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -525.300    177.792  -2.955   0.0598 .
## Velocity       7.735      1.797   4.305   0.0231 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.224 on 3 degrees of freedom
## Multiple R-squared:  0.8607, Adjusted R-squared:  0.8142 
## F-statistic: 18.53 on 1 and 3 DF,  p-value: 0.02308
abline(fit.lm, col="red")
grid()

ski.4
##    Velocity Distance Country fit.k.cluster
## 27     99.2    239.5     NOR             3
## 31     99.9    249.0     NOR             3
## 32     98.7    236.5     GER             3
## 34     98.5    238.0     AUT             3
## 38     98.4    237.0     SLO             3
predict(fit.lm, data.frame(Velocity=100.5),interval = "confidence")
##        fit      lwr      upr
## 1 252.0666 242.6005 261.5326

Conclusions

  1. Velocity and Distance have got normal distribution.

  2. No significant correlation was found between Velocity and Distance on the whole.

  3. We see no significant difference by anova between Countries as far as Velocity and Distance are concerned.

  4. Some interesting facts, produced by kmeans, concern clusters.

  5. The cluster №3 is the cluster of leaders and presents strong 92% correlation between Velocity and Distance.

  6. The correlation in the cluster of leaders can be a proof of the fact that a jumper, who caught the wind, has a good chance to win the competition.

  7. The linear model fit.lm shows significant results.

  8. To beat the 2015 world record of 251.5 meters one must catch the wind with a velocity of 105.5 km/h.