The purpose of this analysis is to demonstrate three concepts for my physiology of exercise class related to the vertical jump test.
It is important to analyze vertical jump based on power, not inches or centimeters. This is true due to the wide variation in power associated with the subjects body mass.
Scatter plots allow us to visualize bivariate distributions, and a regression model helps us to try to identify sources of that variation - in essense to “explain variance”
We come full circle and see that if we start with an equation to calculate a dependent variable (in this case power), and we then use the variables from that equation in a regression analysis to predict that dependent variable, we explain all the variance (R^2 = 1.0 (or very near to it likely due to rounding errors)). (Assuming it is a linear equation and you run a linear model.)
For this demonstration we are going to look at the NFL Combine Data (Freely available online): http://nflcombineresults.com/nflcombinedata.php
Prior to any of this work I had to scrap the data from the web page (that is separte code from what is here to save space and time). If anyone is interested in that code just let me know. Here I start with a csv file created from the XML webdata.
For the sake of anyone interested I have left the “echo” feature on “TRUE” so the r code used to do this analysis is available for anyone interested. If you are not interested you can simply ignore the grey code boxes.
setwd("~/physexlab/NFLcombineData")
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gtools)
nfldata<-read.csv("data_cleaned.csv")
nfldata<-mutate(nfldata, Weight_kg = Weight_lbs/2.2)
nfldata <- mutate(nfldata, Height_m = Height_in*0.0254)
nfldata <- mutate(nfldata, BMI = Weight_kg / (Height_m**2))
nfldata <- mutate(nfldata, VertJump_m = VertLeap_in*0.0254)
nfldata <- mutate(nfldata, LewisPower = sqrt(4.9)*Weight_kg*9.81*sqrt(VertJump_m))
nfldata <- mutate(nfldata, SayersPower = (60.7*(VertJump_m*100))+(45.3*Weight_kg-2055))
Note for both of these scatter plots it is clear there is significant variation (for any particular height jumped players vary significantly in power)
qplot(VertLeap_in, SayersPower, data = nfldata)
## Warning: Removed 1053 rows containing missing values (geom_point).
qplot(VertLeap_in, LewisPower, data = nfldata)
## Warning: Removed 1053 rows containing missing values (geom_point).
In this sample of NFL Combine participants (n = 5394 between 1999 - 2015) there is significant variability in body mass. Also note that the variation is multi modal (there are several peaks)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
To answer this question we can first color each point in the bivariate distribution by body mass. It is clear from these plots that weight has a major influence (not surprising given the equations, and well, physics).
qplot(VertLeap_in, SayersPower, data = nfldata, color = nfldata$Weight_kg)
## Warning: Removed 1053 rows containing missing values (geom_point).
qplot(VertLeap_in, LewisPower, data = nfldata, color = nfldata$Weight_kg)
## Warning: Removed 1053 rows containing missing values (geom_point).
To make what is happening even more clear we can create a factor (category) variable based on the weight distribution. The code below breaks weight into 4 categories based on the distribution.
nfldata$wt_quantiles<-quantcut(nfldata$Weight_kg, q=seq(0,1,by=0.25), na.rm=TRUE)
We can see that this process does a nice job of partitioning the histogram from earlier into 4 discrete categories of weight.
qplot(Weight_kg, data=nfldata, geom="histogram",color=nfldata$wt_quantiles)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Now we can repeat the scatterplots (bivariate distributions) with the point color by these discrete body mass groups:
qplot(VertLeap_in, SayersPower, data = nfldata, color = nfldata$wt_quantiles)
## Warning: Removed 1053 rows containing missing values (geom_point).
qplot(VertLeap_in, LewisPower, data = nfldata, color = nfldata$wt_quantiles)
## Warning: Removed 1053 rows containing missing values (geom_point).
It is clear from both of the above graphics that there is a large impact of body mass, and for the most part it simply partitions the data into four separate positive relationships.
qplot(y=SayersPower, x=VertLeap_in, col=factor(wt_quantiles), data=nfldata) + stat_smooth(method=lm, formula=y~x)
## Warning: Removed 275 rows containing missing values (stat_smooth).
## Warning: Removed 239 rows containing missing values (stat_smooth).
## Warning: Removed 263 rows containing missing values (stat_smooth).
## Warning: Removed 276 rows containing missing values (stat_smooth).
## Warning: Removed 1053 rows containing missing values (geom_point).
qplot(y=LewisPower, x=VertLeap_in, col=factor(wt_quantiles), data=nfldata) + stat_smooth(method=lm, formula=y~x)
## Warning: Removed 275 rows containing missing values (stat_smooth).
## Warning: Removed 239 rows containing missing values (stat_smooth).
## Warning: Removed 263 rows containing missing values (stat_smooth).
## Warning: Removed 276 rows containing missing values (stat_smooth).
## Warning: Removed 1053 rows containing missing values (geom_point).
This model relates to Figure 1.
fitSayers<-lm(formula = SayersPower ~ VertLeap_in, data = nfldata)
summary(fitSayers)
##
## Call:
## lm(formula = SayersPower ~ VertLeap_in, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2082.71 -541.86 -9.71 529.65 2273.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7656.220 85.939 89.089 < 2e-16 ***
## VertLeap_in 12.081 2.593 4.658 3.28e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 718.3 on 4290 degrees of freedom
## (1053 observations deleted due to missingness)
## Multiple R-squared: 0.005033, Adjusted R-squared: 0.004801
## F-statistic: 21.7 on 1 and 4290 DF, p-value: 3.283e-06
This model relates to Figure 7.
fitSayersWeight<-lm(formula = SayersPower ~ VertLeap_in + wt_quantiles, data = nfldata)
summary(fitSayersWeight)
##
## Call:
## lm(formula = SayersPower ~ VertLeap_in + wt_quantiles, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -807.22 -167.62 -21.42 152.52 1477.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2385.684 39.215 60.84 <2e-16 ***
## VertLeap_in 141.553 1.086 130.40 <2e-16 ***
## wt_quantiles(94.1,107] 527.704 10.004 52.75 <2e-16 ***
## wt_quantiles(107,130] 1234.166 10.241 120.52 <2e-16 ***
## wt_quantiles(130,175] 2333.012 12.638 184.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 230.4 on 4287 degrees of freedom
## (1053 observations deleted due to missingness)
## Multiple R-squared: 0.8977, Adjusted R-squared: 0.8976
## F-statistic: 9404 on 4 and 4287 DF, p-value: < 2.2e-16
This model relates to Figure 2.
fitLewis<-lm(formula = LewisPower ~ VertLeap_in, data = nfldata)
summary(fitLewis)
##
## Call:
## lm(formula = LewisPower ~ VertLeap_in, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -857.19 -234.95 -13.89 221.78 970.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3037.882 36.829 82.49 <2e-16 ***
## VertLeap_in -25.922 1.111 -23.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 307.8 on 4290 degrees of freedom
## (1053 observations deleted due to missingness)
## Multiple R-squared: 0.1125, Adjusted R-squared: 0.1123
## F-statistic: 544.1 on 1 and 4290 DF, p-value: < 2.2e-16
This model relates to Figure 8.
fitLewisWeight<-lm(formula = LewisPower ~ VertLeap_in + wt_quantiles, data = nfldata)
summary(fitLewisWeight)
##
## Call:
## lm(formula = LewisPower ~ VertLeap_in + wt_quantiles, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -358.17 -74.20 -8.18 69.24 579.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 796.5314 16.9507 46.99 <2e-16 ***
## VertLeap_in 28.6907 0.4692 61.14 <2e-16 ***
## wt_quantiles(94.1,107] 245.6157 4.3242 56.80 <2e-16 ***
## wt_quantiles(107,130] 560.4090 4.4266 126.60 <2e-16 ***
## wt_quantiles(130,175] 994.2527 5.4625 182.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99.6 on 4287 degrees of freedom
## (1053 observations deleted due to missingness)
## Multiple R-squared: 0.9072, Adjusted R-squared: 0.9071
## F-statistic: 1.047e+04 on 4 and 4287 DF, p-value: < 2.2e-16
Of course, power is calculated in the Lewis and Sayer equation based on height jumped and weight in Kg; so if we simply add weight in kg as a continuous variable we explain all of the variability because, based on the equation we started with for these power calculations height and weight are the only determinants.
This model relates to Figure 4.
Controlling for weight as a continuous variable - R^2 = 1.0
fitSayersWeightKg<-lm(formula = SayersPower ~ VertLeap_in + Weight_kg, data = nfldata)
summary(fitSayersWeightKg)
##
## Call:
## lm(formula = SayersPower ~ VertLeap_in + Weight_kg, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.770e-12 -5.800e-13 -1.400e-13 2.400e-13 4.477e-10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.055e+03 2.188e-12 -9.391e+14 <2e-16 ***
## VertLeap_in 1.542e+02 4.317e-14 3.571e+15 <2e-16 ***
## Weight_kg 4.530e+01 8.831e-15 5.130e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.172e-12 on 4289 degrees of freedom
## (1053 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.322e+31 on 2 and 4289 DF, p-value: < 2.2e-16
This model relates to Figure 5.
Controlling for weight as a continuous variable - R^2 = 0.9916
fitLewisWeightKg<-lm(formula = LewisPower ~ VertLeap_in + Weight_kg, data = nfldata)
summary(fitLewisWeightKg)
##
## Call:
## lm(formula = LewisPower ~ VertLeap_in + Weight_kg, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -194.519 -14.979 8.258 21.808 65.368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.104e+03 7.150e+00 -154.4 <2e-16 ***
## VertLeap_in 3.468e+01 1.411e-01 245.8 <2e-16 ***
## Weight_kg 1.932e+01 2.886e-02 669.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.97 on 4289 degrees of freedom
## (1053 observations deleted due to missingness)
## Multiple R-squared: 0.9916, Adjusted R-squared: 0.9916
## F-statistic: 2.528e+05 on 2 and 4289 DF, p-value: < 2.2e-16
It is important to analyze vertical jump based on power, not inches or centimeters. This is true due to the wide variation in power associated with the subjects body mass.
Scatter plots allow us to visualize bivariate distributions, and a regression model helps us to try to identify sources of that variation - in essense to “explain variance”
We come full circle and see that if we start with an equation to calculate a dependent variable (in this case power), and we then use the variables from that equation in a regression analysis to predict that dependent variable, we explain all the variance (R^2 = 1.0 (or very near to it likely due to rounding errors)). (Assuming it is a linear equation and you run a linear model.)