I like sports. Sure you do too. Watching ski jumps in Vikersundon on TV in Februrary of 2016 made me fancy: is there any correlation between velocity and distance of jump. I use data collected by myself. You can see some data here: http://www.fis-ski.com/ski-jumping/events-and-places/results/, https://en.wikipedia.org/wiki/Ski_jumping.
The ski jump is divided into four parts: in-run, take-off (jump), flight and landing. Velocity is the maximum speed of skier during takе-off (км/hour). Distance is the distance of flight between take-off and landing (мeter).
load(file="ski.dat")
names(ski)<- list("Velocity", "Distance", "Country")
head(ski)
## Velocity Distance Country
## 1 99.5 213.5 CZE
## 2 100.0 214.5 USA
## 3 99.7 219.5 AUT
## 4 99.3 203.0 CZE
## 5 99.6 196.5 SLO
## 6 100.1 199.0 POL
attach(ski)
summary(ski)
## Velocity Distance Country
## Min. : 97.80 Min. :181.5 NOR :7
## 1st Qu.: 98.80 1st Qu.:204.8 SLO :6
## Median : 99.20 Median :216.0 AUT :5
## Mean : 99.24 Mean :216.1 GER :5
## 3rd Qu.: 99.70 3rd Qu.:225.8 POL :5
## Max. :101.00 Max. :249.0 CZE :4
## (Other):7
hist(Velocity)
shapiro.test(Velocity)
##
## Shapiro-Wilk normality test
##
## data: Velocity
## W = 0.9599, p-value = 0.177
hist(Distance)
shapiro.test(Distance)
##
## Shapiro-Wilk normality test
##
## data: Distance
## W = 0.98704, p-value = 0.926
cor.test(Velocity,Distance)
##
## Pearson's product-moment correlation
##
## data: Velocity and Distance
## t = 0.64376, df = 37, p-value = 0.5237
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2174940 0.4072393
## sample estimates:
## cor
## 0.1052453
boxplot(data=ski,Velocity~Country, col=Country,horizontal = T)
title(main="Ski 2016: Velocity versus Country",xlab = "Velocity, km/h", ylab = "Country")
boxplot(data=ski,Distance~Country, col=Country,horizontal = T)
title(main="Ski 2016: Distance versus Country",xlab = "Distance, m", ylab = "Country")
fit.1<-aov(Distance~Country)
summary(fit.1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 10 3135 313.5 1.538 0.178
## Residuals 28 5708 203.9
fit.2<-aov(Velocity~Country)
summary(fit.2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 10 4.383 0.4383 0.889 0.555
## Residuals 28 13.807 0.4931
set.seed(1234)
fit.k<-kmeans(ski[,-3],centers = 4,iter.max = 1000,nstart = 20)
fit.k
## K-means clustering with 4 clusters of sizes 10, 14, 5, 10
##
## Cluster means:
## Velocity Distance
## 1 99.08000 195.7500
## 2 99.27857 214.6071
## 3 98.94000 240.0000
## 4 99.48000 226.4500
##
## Clustering vector:
## [1] 2 2 2 1 1 1 4 2 4 4 4 2 1 4 1 2 1 2 2 1 1 2 2 2 4 4 3 2 4 4 3 3 4 3 1
## [36] 2 2 3 1
##
## Within cluster sum of squares by cluster:
## [1] 362.3810 189.0529 108.0320 100.8610
## (between_SS / total_SS = 91.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
plot((ski[,-3]),col=fit.k$cluster)
points(fit.k$centers, col = 1:4, pch = 8)
abline(h = 216)
ski.2<-data.frame(ski,fit.k$cluster)
ski.4<-ski.2[which(ski.2$fit.k.cluster==3),]
plot(ski.4$Velocity,ski.4$Distance,main="Correlation velocity ~ distance",xlab = "Velocity, km/h", ylab = "Distance, m")
cor.test(ski.4$Velocity,ski.4$Distance)
##
## Pearson's product-moment correlation
##
## data: ski.4$Velocity and ski.4$Distance
## t = 4.3045, df = 3, p-value = 0.02308
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2503881 0.9953200
## sample estimates:
## cor
## 0.9277141
fit.lm<-lm(data=ski.4,Distance~Velocity)
summary(fit.lm)
##
## Call:
## lm(formula = Distance ~ Velocity, data = ski.4)
##
## Residuals:
## 27 31 32 34 38
## -2.511 1.574 -1.644 1.403 1.177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -525.300 177.792 -2.955 0.0598 .
## Velocity 7.735 1.797 4.305 0.0231 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.224 on 3 degrees of freedom
## Multiple R-squared: 0.8607, Adjusted R-squared: 0.8142
## F-statistic: 18.53 on 1 and 3 DF, p-value: 0.02308
abline(fit.lm, col="red")
grid()
ski.4
## Velocity Distance Country fit.k.cluster
## 27 99.2 239.5 NOR 3
## 31 99.9 249.0 NOR 3
## 32 98.7 236.5 GER 3
## 34 98.5 238.0 AUT 3
## 38 98.4 237.0 SLO 3
predict(fit.lm, data.frame(Velocity=100.5),interval = "confidence")
## fit lwr upr
## 1 252.0666 242.6005 261.5326
Velocity and Distance have got normal distribution.
No significant correlation was found between Velocity and Distance on the whole.
We see no significant difference by anova
between Countries as far as Velocity and Distance are concerned.
Some interesting facts, produced by kmeans
, concern clusters.
The cluster №3 is the cluster of leaders and presents strong 92% correlation between Velocity and Distance.
The correlation in the cluster of leaders can be a proof of the fact that a jumper, who caught the wind, has a good chance to win the competition.
The linear model fit.lm
shows significant results.
To beat the 2015 world record of 251.5 meters one must catch the wind with a velocity of 105.5 km/h.