Let’s compare height and weight as we will surely find some correlation there!
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 64263 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 64263 rows containing non-finite values (stat_smooth).
## Warning: Removed 64263 rows containing missing values (geom_point).
I stuck the color in here to see the differences between the sexes. Finding that there most likely will be a significant difference in the regression between the sexes, I added the regression lines as well, ‘lm’ is the linear method. The histograms on the outside are all about looking at the normality of the data.
I will check the normality in another way. QQ-Plots!
ggplot(data, aes(sample = Height)) +
stat_qq() +
stat_qq_line()
## Warning: Removed 60171 rows containing non-finite values (stat_qq).
## Warning: Removed 60171 rows containing non-finite values (stat_qq_line).
ggplot(data, aes(sample = Weight)) +
stat_qq() +
stat_qq_line()
## Warning: Removed 62875 rows containing non-finite values (stat_qq).
## Warning: Removed 62875 rows containing non-finite values (stat_qq_line).
QQ-Plots are all about looking at the normality of the datasets. Height fits very well on the line but Weight does not! I wonder if it will do better by restricting the genders.
ggplot(data[ which(data$Sex == 'F'),],aes(sample = Weight)) +
stat_qq() +
stat_qq_line()
## Warning: Removed 7751 rows containing non-finite values (stat_qq).
## Warning: Removed 7751 rows containing non-finite values (stat_qq_line).
ggplot(data[ which(data$Sex == 'M'),],aes(sample = Weight)) +
stat_qq() +
stat_qq_line()
## Warning: Removed 55124 rows containing non-finite values (stat_qq).
## Warning: Removed 55124 rows containing non-finite values (stat_qq_line).
Both deviate from the line. Essentially that suggests that our data is not normal.
I am going to continue anyway but I should be wary of my statistics, in particularly my \(p\) value!
fit <- lm(Weight ~ Height + Sex, data)
summary(fit)
##
## Call:
## lm(formula = Weight ~ Height + Sex, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.719 -4.971 -0.932 3.533 134.281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.036e+02 3.405e-01 -304.3 <2e-16 ***
## Height 9.748e-01 2.019e-03 482.8 <2e-16 ***
## SexM 4.938e+00 4.554e-02 108.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.434 on 206850 degrees of freedom
## (64263 observations deleted due to missingness)
## Multiple R-squared: 0.6536, Adjusted R-squared: 0.6536
## F-statistic: 1.952e+05 on 2 and 206850 DF, p-value: < 2.2e-16
All the variables are significant! Being male adds about 5 kilos!
I don’t know about you but I am interested to see if the year makes a difference. Maybe we are breeding larger athletes? I’ll tack on sport too!
fit <- lm(Weight ~ Height + Sex + Year + Sport, data)
summary(fit)
##
## Call:
## lm(formula = Weight ~ Height + Sex + Year + Sport, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.879 -4.286 -0.540 3.381 125.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.076e+02 1.752e+00 -61.397 < 2e-16 ***
## Height 1.008e+00 2.303e-03 437.638 < 2e-16 ***
## SexM 3.691e+00 4.676e-02 78.931 < 2e-16 ***
## Year 1.302e-03 8.901e-04 1.463 0.14349
## SportArchery -1.597e+00 1.966e-01 -8.125 4.51e-16 ***
## SportArt Competitions 1.787e+00 1.420e+00 1.259 0.20809
## SportAthletics -5.852e+00 1.033e-01 -56.642 < 2e-16 ***
## SportBadminton -4.364e+00 2.239e-01 -19.489 < 2e-16 ***
## SportBaseball 2.952e+00 2.749e-01 10.738 < 2e-16 ***
## SportBasketball -4.434e+00 1.602e-01 -27.685 < 2e-16 ***
## SportBeach Volleyball -5.499e+00 3.378e-01 -16.276 < 2e-16 ***
## SportBiathlon -6.073e+00 1.451e-01 -41.853 < 2e-16 ***
## SportBobsleigh 7.640e+00 1.858e-01 41.111 < 2e-16 ***
## SportBoxing -7.565e+00 1.479e-01 -51.138 < 2e-16 ***
## SportCanoeing -1.331e+00 1.380e-01 -9.644 < 2e-16 ***
## SportCross Country Skiing -5.952e+00 1.277e-01 -46.628 < 2e-16 ***
## SportCurling -2.879e-01 3.819e-01 -0.754 0.45091
## SportCycling -5.563e+00 1.269e-01 -43.840 < 2e-16 ***
## SportDiving -4.338e+00 1.921e-01 -22.576 < 2e-16 ***
## SportEquestrianism -5.744e+00 1.456e-01 -39.439 < 2e-16 ***
## SportFencing -5.011e+00 1.325e-01 -37.825 < 2e-16 ***
## SportFigure Skating -6.766e+00 2.146e-01 -31.524 < 2e-16 ***
## SportFootball -4.233e+00 1.459e-01 -29.017 < 2e-16 ***
## SportFreestyle Skiing -2.268e+00 2.655e-01 -8.543 < 2e-16 ***
## SportGolf -1.091e+00 7.272e-01 -1.500 0.13360
## SportGymnastics -4.423e+00 1.118e-01 -39.553 < 2e-16 ***
## SportHandball -6.612e-01 1.617e-01 -4.089 4.33e-05 ***
## SportHockey -3.100e+00 1.484e-01 -20.890 < 2e-16 ***
## SportIce Hockey 2.398e+00 1.457e-01 16.459 < 2e-16 ***
## SportJudo 5.619e+00 1.596e-01 35.194 < 2e-16 ***
## SportLacrosse -7.472e+00 5.298e+00 -1.410 0.15841
## SportLuge 1.461e+00 2.228e-01 6.560 5.41e-11 ***
## SportModern Pentathlon -7.451e+00 2.313e-01 -32.218 < 2e-16 ***
## SportMotorboating -4.033e+00 7.491e+00 -0.538 0.59030
## SportNordic Combined -9.725e+00 2.485e-01 -39.130 < 2e-16 ***
## SportRhythmic Gymnastics -1.540e+01 3.171e-01 -48.544 < 2e-16 ***
## SportRowing -3.253e+00 1.286e-01 -25.291 < 2e-16 ***
## SportRugby 3.876e+00 1.372e+00 2.824 0.00474 **
## SportRugby Sevens 5.328e+00 4.452e-01 11.969 < 2e-16 ***
## SportSailing -1.749e+00 1.431e-01 -12.226 < 2e-16 ***
## SportShooting 1.261e+00 1.289e-01 9.784 < 2e-16 ***
## SportShort Track Speed Skating -4.036e+00 2.161e-01 -18.675 < 2e-16 ***
## SportSkeleton -8.040e-01 5.758e-01 -1.396 0.16258
## SportSki Jumping -1.147e+01 1.939e-01 -59.146 < 2e-16 ***
## SportSnowboarding -1.970e+00 2.641e-01 -7.460 8.68e-14 ***
## SportSoftball 1.670e+00 3.696e-01 4.518 6.25e-06 ***
## SportSpeed Skating -2.520e+00 1.470e-01 -17.149 < 2e-16 ***
## SportSwimming -6.463e+00 1.096e-01 -58.995 < 2e-16 ***
## SportSynchronized Swimming -8.999e+00 2.747e-01 -32.764 < 2e-16 ***
## SportTable Tennis -4.640e+00 2.008e-01 -23.105 < 2e-16 ***
## SportTaekwondo -6.996e+00 3.213e-01 -21.775 < 2e-16 ***
## SportTennis -6.481e+00 1.924e-01 -33.681 < 2e-16 ***
## SportTrampolining -5.433e+00 6.212e-01 -8.746 < 2e-16 ***
## SportTriathlon -1.010e+01 3.402e-01 -29.684 < 2e-16 ***
## SportTug-Of-War 8.862e+00 1.602e+00 5.534 3.14e-08 ***
## SportVolleyball -6.586e+00 1.644e-01 -40.069 < 2e-16 ***
## SportWater Polo 2.619e-02 1.730e-01 0.151 0.87968
## SportWeightlifting 1.226e+01 1.674e-01 73.219 < 2e-16 ***
## SportWrestling 3.266e+00 1.409e-01 23.179 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.49 on 206794 degrees of freedom
## (64263 observations deleted due to missingness)
## Multiple R-squared: 0.7269, Adjusted R-squared: 0.7268
## F-statistic: 9489 on 58 and 206794 DF, p-value: < 2.2e-16
Year is not statistically significant. Interestingly some sports make a big difference and a few don’t matter! Can you predict which ones will matter and why?