Scatter Plot

Let’s compare height and weight as we will surely find some correlation there!

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 64263 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 64263 rows containing non-finite values (stat_smooth).
## Warning: Removed 64263 rows containing missing values (geom_point).

I stuck the color in here to see the differences between the sexes. Finding that there most likely will be a significant difference in the regression between the sexes, I added the regression lines as well, ‘lm’ is the linear method. The histograms on the outside are all about looking at the normality of the data.

I will check the normality in another way. QQ-Plots!

ggplot(data, aes(sample = Height)) + 
  stat_qq() +
  stat_qq_line()
## Warning: Removed 60171 rows containing non-finite values (stat_qq).
## Warning: Removed 60171 rows containing non-finite values (stat_qq_line).

ggplot(data, aes(sample = Weight)) + 
  stat_qq() +
  stat_qq_line()
## Warning: Removed 62875 rows containing non-finite values (stat_qq).
## Warning: Removed 62875 rows containing non-finite values (stat_qq_line).

QQ-Plots are all about looking at the normality of the datasets. Height fits very well on the line but Weight does not! I wonder if it will do better by restricting the genders.

ggplot(data[ which(data$Sex == 'F'),],aes(sample = Weight)) +
  stat_qq() +
  stat_qq_line()
## Warning: Removed 7751 rows containing non-finite values (stat_qq).
## Warning: Removed 7751 rows containing non-finite values (stat_qq_line).

ggplot(data[ which(data$Sex == 'M'),],aes(sample = Weight)) +
  stat_qq() +
  stat_qq_line()
## Warning: Removed 55124 rows containing non-finite values (stat_qq).
## Warning: Removed 55124 rows containing non-finite values (stat_qq_line).

Both deviate from the line. Essentially that suggests that our data is not normal.

I am going to continue anyway but I should be wary of my statistics, in particularly my \(p\) value!

fit <- lm(Weight ~ Height + Sex, data)
summary(fit)
## 
## Call:
## lm(formula = Weight ~ Height + Sex, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.719  -4.971  -0.932   3.533 134.281 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.036e+02  3.405e-01  -304.3   <2e-16 ***
## Height       9.748e-01  2.019e-03   482.8   <2e-16 ***
## SexM         4.938e+00  4.554e-02   108.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.434 on 206850 degrees of freedom
##   (64263 observations deleted due to missingness)
## Multiple R-squared:  0.6536, Adjusted R-squared:  0.6536 
## F-statistic: 1.952e+05 on 2 and 206850 DF,  p-value: < 2.2e-16

All the variables are significant! Being male adds about 5 kilos!

I don’t know about you but I am interested to see if the year makes a difference. Maybe we are breeding larger athletes? I’ll tack on sport too!

fit <- lm(Weight ~ Height + Sex + Year + Sport, data)
summary(fit)
## 
## Call:
## lm(formula = Weight ~ Height + Sex + Year + Sport, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.879  -4.286  -0.540   3.381 125.202 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.076e+02  1.752e+00 -61.397  < 2e-16 ***
## Height                          1.008e+00  2.303e-03 437.638  < 2e-16 ***
## SexM                            3.691e+00  4.676e-02  78.931  < 2e-16 ***
## Year                            1.302e-03  8.901e-04   1.463  0.14349    
## SportArchery                   -1.597e+00  1.966e-01  -8.125 4.51e-16 ***
## SportArt Competitions           1.787e+00  1.420e+00   1.259  0.20809    
## SportAthletics                 -5.852e+00  1.033e-01 -56.642  < 2e-16 ***
## SportBadminton                 -4.364e+00  2.239e-01 -19.489  < 2e-16 ***
## SportBaseball                   2.952e+00  2.749e-01  10.738  < 2e-16 ***
## SportBasketball                -4.434e+00  1.602e-01 -27.685  < 2e-16 ***
## SportBeach Volleyball          -5.499e+00  3.378e-01 -16.276  < 2e-16 ***
## SportBiathlon                  -6.073e+00  1.451e-01 -41.853  < 2e-16 ***
## SportBobsleigh                  7.640e+00  1.858e-01  41.111  < 2e-16 ***
## SportBoxing                    -7.565e+00  1.479e-01 -51.138  < 2e-16 ***
## SportCanoeing                  -1.331e+00  1.380e-01  -9.644  < 2e-16 ***
## SportCross Country Skiing      -5.952e+00  1.277e-01 -46.628  < 2e-16 ***
## SportCurling                   -2.879e-01  3.819e-01  -0.754  0.45091    
## SportCycling                   -5.563e+00  1.269e-01 -43.840  < 2e-16 ***
## SportDiving                    -4.338e+00  1.921e-01 -22.576  < 2e-16 ***
## SportEquestrianism             -5.744e+00  1.456e-01 -39.439  < 2e-16 ***
## SportFencing                   -5.011e+00  1.325e-01 -37.825  < 2e-16 ***
## SportFigure Skating            -6.766e+00  2.146e-01 -31.524  < 2e-16 ***
## SportFootball                  -4.233e+00  1.459e-01 -29.017  < 2e-16 ***
## SportFreestyle Skiing          -2.268e+00  2.655e-01  -8.543  < 2e-16 ***
## SportGolf                      -1.091e+00  7.272e-01  -1.500  0.13360    
## SportGymnastics                -4.423e+00  1.118e-01 -39.553  < 2e-16 ***
## SportHandball                  -6.612e-01  1.617e-01  -4.089 4.33e-05 ***
## SportHockey                    -3.100e+00  1.484e-01 -20.890  < 2e-16 ***
## SportIce Hockey                 2.398e+00  1.457e-01  16.459  < 2e-16 ***
## SportJudo                       5.619e+00  1.596e-01  35.194  < 2e-16 ***
## SportLacrosse                  -7.472e+00  5.298e+00  -1.410  0.15841    
## SportLuge                       1.461e+00  2.228e-01   6.560 5.41e-11 ***
## SportModern Pentathlon         -7.451e+00  2.313e-01 -32.218  < 2e-16 ***
## SportMotorboating              -4.033e+00  7.491e+00  -0.538  0.59030    
## SportNordic Combined           -9.725e+00  2.485e-01 -39.130  < 2e-16 ***
## SportRhythmic Gymnastics       -1.540e+01  3.171e-01 -48.544  < 2e-16 ***
## SportRowing                    -3.253e+00  1.286e-01 -25.291  < 2e-16 ***
## SportRugby                      3.876e+00  1.372e+00   2.824  0.00474 ** 
## SportRugby Sevens               5.328e+00  4.452e-01  11.969  < 2e-16 ***
## SportSailing                   -1.749e+00  1.431e-01 -12.226  < 2e-16 ***
## SportShooting                   1.261e+00  1.289e-01   9.784  < 2e-16 ***
## SportShort Track Speed Skating -4.036e+00  2.161e-01 -18.675  < 2e-16 ***
## SportSkeleton                  -8.040e-01  5.758e-01  -1.396  0.16258    
## SportSki Jumping               -1.147e+01  1.939e-01 -59.146  < 2e-16 ***
## SportSnowboarding              -1.970e+00  2.641e-01  -7.460 8.68e-14 ***
## SportSoftball                   1.670e+00  3.696e-01   4.518 6.25e-06 ***
## SportSpeed Skating             -2.520e+00  1.470e-01 -17.149  < 2e-16 ***
## SportSwimming                  -6.463e+00  1.096e-01 -58.995  < 2e-16 ***
## SportSynchronized Swimming     -8.999e+00  2.747e-01 -32.764  < 2e-16 ***
## SportTable Tennis              -4.640e+00  2.008e-01 -23.105  < 2e-16 ***
## SportTaekwondo                 -6.996e+00  3.213e-01 -21.775  < 2e-16 ***
## SportTennis                    -6.481e+00  1.924e-01 -33.681  < 2e-16 ***
## SportTrampolining              -5.433e+00  6.212e-01  -8.746  < 2e-16 ***
## SportTriathlon                 -1.010e+01  3.402e-01 -29.684  < 2e-16 ***
## SportTug-Of-War                 8.862e+00  1.602e+00   5.534 3.14e-08 ***
## SportVolleyball                -6.586e+00  1.644e-01 -40.069  < 2e-16 ***
## SportWater Polo                 2.619e-02  1.730e-01   0.151  0.87968    
## SportWeightlifting              1.226e+01  1.674e-01  73.219  < 2e-16 ***
## SportWrestling                  3.266e+00  1.409e-01  23.179  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.49 on 206794 degrees of freedom
##   (64263 observations deleted due to missingness)
## Multiple R-squared:  0.7269, Adjusted R-squared:  0.7268 
## F-statistic:  9489 on 58 and 206794 DF,  p-value: < 2.2e-16

Year is not statistically significant. Interestingly some sports make a big difference and a few don’t matter! Can you predict which ones will matter and why?