The purpose of this analysis is to demonstrate three concepts for my physiology of exercise class related to the vertical jump test.

Concepts:

  1. It is important to analyze vertical jump based on power, not inches or centimeters. This is true due to the wide variation in power associated with the subjects body mass.

  2. Scatter plots allow us to visualize bivariate distributions, and a regression model helps us to try to identify sources of that variation - in essense to “explain variance”

  3. We come full circle and see that if we start with an equation to calculate a dependent variable (in this case power), and we then use the variables from that equation in a regression analysis to predict that dependent variable, we explain all the variance (R^2 = 1.0 (or very near to it likely due to rounding errors)). (Assuming it is a linear equation and you run a linear model.)

Data

For this demonstration we are going to look at the NFL Combine Data (Freely available online): http://nflcombineresults.com/nflcombinedata.php

Prior to any of this work I had to scrap the data from the web page (that is separte code from what is here to save space and time). If anyone is interested in that code just let me know. Here I start with a csv file created from the XML webdata.

Analysis code

For the sake of anyone interested I have left the “echo” feature on “TRUE” so the r code used to do this analysis is available for anyone interested. If you are not interested you can simply ignore the grey code boxes.

Setting working directory and loading required R packages:

setwd("~/physexlab/NFLcombineData")
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(gtools)

Loading the data and calculating derived variables:

nfldata<-read.csv("data_cleaned.csv")
nfldata<-mutate(nfldata, Weight_kg = Weight_lbs/2.2)
nfldata <- mutate(nfldata, Height_m = Height_in*0.0254)
nfldata <- mutate(nfldata, BMI = Weight_kg / (Height_m**2))
nfldata <- mutate(nfldata, VertJump_m = VertLeap_in*0.0254)
nfldata <- mutate(nfldata, LewisPower = sqrt(4.9)*Weight_kg*9.81*sqrt(VertJump_m))
nfldata <- mutate(nfldata, SayersPower = (60.7*(VertJump_m*100))+(45.3*Weight_kg-2055))

Plots of Power (Y axis) and Vertical Leap in inches (X Axis)

Note for both of these scatter plots it is clear there is significant variation (for any particular height jumped players vary significantly in power)

Figure 1. Sayers Power is an estimate of the Peak Power achieved during the jump

qplot(VertLeap_in, SayersPower, data = nfldata)
## Warning: Removed 1053 rows containing missing values (geom_point).

Figure 2. Lewis Power is an estimate of the Mean Power during the jump

qplot(VertLeap_in, LewisPower, data = nfldata)
## Warning: Removed 1053 rows containing missing values (geom_point).

Why is there so much variation if power when compared to vertical jump height?

In this sample of NFL Combine participants (n = 5394 between 1999 - 2015) there is significant variability in body mass. Also note that the variation is multi modal (there are several peaks)

Figure 3. Histogram of body mass

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

How does this variation map to the bivariate variation we see in the scatter plots?

To answer this question we can first color each point in the bivariate distribution by body mass. It is clear from these plots that weight has a major influence (not surprising given the equations, and well, physics).

Figure 4. Sayers Power and vertical height inches, color distribution reflects weight

qplot(VertLeap_in, SayersPower, data = nfldata, color = nfldata$Weight_kg)
## Warning: Removed 1053 rows containing missing values (geom_point).

Figure 5. Lewis Power and vertical height inches, color distribution reflects weight

qplot(VertLeap_in, LewisPower, data = nfldata, color = nfldata$Weight_kg)
## Warning: Removed 1053 rows containing missing values (geom_point).

Can this be made more clear?

To make what is happening even more clear we can create a factor (category) variable based on the weight distribution. The code below breaks weight into 4 categories based on the distribution.

nfldata$wt_quantiles<-quantcut(nfldata$Weight_kg, q=seq(0,1,by=0.25), na.rm=TRUE)

Figure 6. Histogram with weight partitioned into quartiles (4 groups each with 25% of the data)

We can see that this process does a nice job of partitioning the histogram from earlier into 4 discrete categories of weight.

qplot(Weight_kg, data=nfldata, geom="histogram",color=nfldata$wt_quantiles)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Scatter plots with weight variance by factor

Now we can repeat the scatterplots (bivariate distributions) with the point color by these discrete body mass groups:

Figure 7. Sayers Power with weight partitioned into quartiles

qplot(VertLeap_in, SayersPower, data = nfldata, color = nfldata$wt_quantiles)
## Warning: Removed 1053 rows containing missing values (geom_point).

Figure 8. Lewis Power with weight partitioned into quartiles

qplot(VertLeap_in, LewisPower, data = nfldata, color = nfldata$wt_quantiles)
## Warning: Removed 1053 rows containing missing values (geom_point).

Thoughts

It is clear from both of the above graphics that there is a large impact of body mass, and for the most part it simply partitions the data into four separate positive relationships.

Sayers Power with weight partitioned into quartiles with separate OLS lines of best fit

qplot(y=SayersPower, x=VertLeap_in, col=factor(wt_quantiles), data=nfldata) + stat_smooth(method=lm, formula=y~x)
## Warning: Removed 275 rows containing missing values (stat_smooth).
## Warning: Removed 239 rows containing missing values (stat_smooth).
## Warning: Removed 263 rows containing missing values (stat_smooth).
## Warning: Removed 276 rows containing missing values (stat_smooth).
## Warning: Removed 1053 rows containing missing values (geom_point).

Lewis Power with weight partitioned into quartiles with separate OLS lines of best fit

qplot(y=LewisPower, x=VertLeap_in, col=factor(wt_quantiles), data=nfldata) + stat_smooth(method=lm, formula=y~x)
## Warning: Removed 275 rows containing missing values (stat_smooth).
## Warning: Removed 239 rows containing missing values (stat_smooth).
## Warning: Removed 263 rows containing missing values (stat_smooth).
## Warning: Removed 276 rows containing missing values (stat_smooth).
## Warning: Removed 1053 rows containing missing values (geom_point).

From scatter plots to regression:

Sayers regression first without controlling for weight; then controlling for weight via quantiles

Not controlling for weight - R^2 = 0.005033

This model relates to Figure 1.

fitSayers<-lm(formula = SayersPower ~ VertLeap_in, data = nfldata)
summary(fitSayers)
## 
## Call:
## lm(formula = SayersPower ~ VertLeap_in, data = nfldata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2082.71  -541.86    -9.71   529.65  2273.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7656.220     85.939  89.089  < 2e-16 ***
## VertLeap_in   12.081      2.593   4.658 3.28e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 718.3 on 4290 degrees of freedom
##   (1053 observations deleted due to missingness)
## Multiple R-squared:  0.005033,   Adjusted R-squared:  0.004801 
## F-statistic:  21.7 on 1 and 4290 DF,  p-value: 3.283e-06
Controlling for weight - R^2 = 0.8977

This model relates to Figure 7.

fitSayersWeight<-lm(formula = SayersPower ~ VertLeap_in + wt_quantiles, data = nfldata)
summary(fitSayersWeight)
## 
## Call:
## lm(formula = SayersPower ~ VertLeap_in + wt_quantiles, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -807.22 -167.62  -21.42  152.52 1477.39 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2385.684     39.215   60.84   <2e-16 ***
## VertLeap_in             141.553      1.086  130.40   <2e-16 ***
## wt_quantiles(94.1,107]  527.704     10.004   52.75   <2e-16 ***
## wt_quantiles(107,130]  1234.166     10.241  120.52   <2e-16 ***
## wt_quantiles(130,175]  2333.012     12.638  184.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 230.4 on 4287 degrees of freedom
##   (1053 observations deleted due to missingness)
## Multiple R-squared:  0.8977, Adjusted R-squared:  0.8976 
## F-statistic:  9404 on 4 and 4287 DF,  p-value: < 2.2e-16

Lewis regression first without controlling for weight; then controlling for weight via quantiles

Not controlling for weight - R^2 = 0.1125

This model relates to Figure 2.

fitLewis<-lm(formula = LewisPower ~ VertLeap_in, data = nfldata)
summary(fitLewis)
## 
## Call:
## lm(formula = LewisPower ~ VertLeap_in, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -857.19 -234.95  -13.89  221.78  970.91 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3037.882     36.829   82.49   <2e-16 ***
## VertLeap_in  -25.922      1.111  -23.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 307.8 on 4290 degrees of freedom
##   (1053 observations deleted due to missingness)
## Multiple R-squared:  0.1125, Adjusted R-squared:  0.1123 
## F-statistic: 544.1 on 1 and 4290 DF,  p-value: < 2.2e-16
Controlling for weight - R^2 = 0.9072

This model relates to Figure 8.

fitLewisWeight<-lm(formula = LewisPower ~ VertLeap_in + wt_quantiles, data = nfldata)
summary(fitLewisWeight)
## 
## Call:
## lm(formula = LewisPower ~ VertLeap_in + wt_quantiles, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -358.17  -74.20   -8.18   69.24  579.62 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            796.5314    16.9507   46.99   <2e-16 ***
## VertLeap_in             28.6907     0.4692   61.14   <2e-16 ***
## wt_quantiles(94.1,107] 245.6157     4.3242   56.80   <2e-16 ***
## wt_quantiles(107,130]  560.4090     4.4266  126.60   <2e-16 ***
## wt_quantiles(130,175]  994.2527     5.4625  182.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.6 on 4287 degrees of freedom
##   (1053 observations deleted due to missingness)
## Multiple R-squared:  0.9072, Adjusted R-squared:  0.9071 
## F-statistic: 1.047e+04 on 4 and 4287 DF,  p-value: < 2.2e-16

Back full circle

Of course, power is calculated in the Lewis and Sayer equation based on height jumped and weight in Kg; so if we simply add weight in kg as a continuous variable we explain all of the variability because, based on the equation we started with for these power calculations height and weight are the only determinants.

Sayers regression controlling for weight as a continuous variable

This model relates to Figure 4.

Controlling for weight as a continuous variable - R^2 = 1.0

fitSayersWeightKg<-lm(formula = SayersPower ~ VertLeap_in + Weight_kg, data = nfldata)
summary(fitSayersWeightKg)
## 
## Call:
## lm(formula = SayersPower ~ VertLeap_in + Weight_kg, data = nfldata)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.770e-12 -5.800e-13 -1.400e-13  2.400e-13  4.477e-10 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -2.055e+03  2.188e-12 -9.391e+14   <2e-16 ***
## VertLeap_in  1.542e+02  4.317e-14  3.571e+15   <2e-16 ***
## Weight_kg    4.530e+01  8.831e-15  5.130e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.172e-12 on 4289 degrees of freedom
##   (1053 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.322e+31 on 2 and 4289 DF,  p-value: < 2.2e-16

Lewis regression controlling for weight as a continuous variable

This model relates to Figure 5.

Controlling for weight as a continuous variable - R^2 = 0.9916

fitLewisWeightKg<-lm(formula = LewisPower ~ VertLeap_in + Weight_kg, data = nfldata)
summary(fitLewisWeightKg)
## 
## Call:
## lm(formula = LewisPower ~ VertLeap_in + Weight_kg, data = nfldata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -194.519  -14.979    8.258   21.808   65.368 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.104e+03  7.150e+00  -154.4   <2e-16 ***
## VertLeap_in  3.468e+01  1.411e-01   245.8   <2e-16 ***
## Weight_kg    1.932e+01  2.886e-02   669.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.97 on 4289 degrees of freedom
##   (1053 observations deleted due to missingness)
## Multiple R-squared:  0.9916, Adjusted R-squared:  0.9916 
## F-statistic: 2.528e+05 on 2 and 4289 DF,  p-value: < 2.2e-16

Restatement of the Concepts:

  1. It is important to analyze vertical jump based on power, not inches or centimeters. This is true due to the wide variation in power associated with the subjects body mass.

  2. Scatter plots allow us to visualize bivariate distributions, and a regression model helps us to try to identify sources of that variation - in essense to “explain variance”

  3. We come full circle and see that if we start with an equation to calculate a dependent variable (in this case power), and we then use the variables from that equation in a regression analysis to predict that dependent variable, we explain all the variance (R^2 = 1.0 (or very near to it likely due to rounding errors)). (Assuming it is a linear equation and you run a linear model.)