Analysis for Brewster’s physiology of exercise lab report

Using the NFL Combine Data (Freely available online): http://nflcombineresults.com/nflcombinedata.php

Decided to use “all40yard” - not just the NFL combined timed sprints

Goal: test the relationship between the 3 cone test, 40 yard dash times and power produced from vertical jump

Will stick with the Sayers peak power from the vertical jump

Analysis approach

  1. Load, select and clean data relevant for this analysis
  2. Descriptive statistics of relevant data
  3. Univariate distributions of relevant data
  4. Bivariate distributions with scatterplots
  5. Linear regression

For the sake of presentation 4 & 5 will be combined so that output releated to the linear regression model will follow the scatterplot of the same variables.

Setting working directory and loading required R packages:

setwd("~/physexlab/NFLcombineData")
library(dplyr) #For data manipulation
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) #For plotting
library(pastecs) #For descriptive statistics
## Loading required package: boot
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(psych) #For descriptive statistics
## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%
library(gtools)
## 
## Attaching package: 'gtools'
## 
## The following object is masked from 'package:psych':
## 
##     logit
## 
## The following objects are masked from 'package:boot':
## 
##     inv.logit, logit

Load, select and clean data relevant for this analysis

nfldata<-read.csv("data_cleaned.csv")
nfldata<-mutate(nfldata, Weight_kg = Weight_lbs/2.2)
nfldata <- mutate(nfldata, Height_m = Height_in*0.0254)
nfldata <- mutate(nfldata, VertJump_m = VertLeap_in*0.0254)
nfldata <- mutate(nfldata, LewisPower = sqrt(4.9)*Weight_kg*9.81*sqrt(VertJump_m))
nfldata <- mutate(nfldata, SayersPower = (60.7*(VertJump_m*100))+(45.3*Weight_kg-2055))

#Select variables of interest 
nfldata <- select(nfldata, Year, POS, Weight_lbs, X3Cone, all40yard, SayersPower)

#Keep only complete cases to avoid missing values
nfldata <- nfldata[complete.cases(nfldata), ]

Summary of data after the above process

str(nfldata)
## 'data.frame':    3590 obs. of  6 variables:
##  $ Year       : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ POS        : Factor w/ 18 levels "C","CB","DE",..: 15 15 11 18 15 6 3 7 4 15 ...
##  $ Weight_lbs : int  205 221 227 180 221 218 294 243 292 212 ...
##  $ X3Cone     : num  6.79 7.1 7.14 6.64 6.96 7.09 7.2 7.07 7.57 7.13 ...
##  $ all40yard  : num  4.6 4.57 4.55 4.43 4.53 4.56 4.97 4.56 5.1 4.53 ...
##  $ SayersPower: num  8719 8509 8170 6893 7969 ...
summary(nfldata)
##       Year           POS         Weight_lbs        X3Cone     
##  Min.   :1999   WR     : 445   Min.   :155.0   Min.   :6.340  
##  1st Qu.:2003   CB     : 350   1st Qu.:208.0   1st Qu.:6.990  
##  Median :2007   OT     : 309   Median :240.0   Median :7.220  
##  Mean   :2007   DE     : 301   Mean   :247.1   Mean   :7.309  
##  3rd Qu.:2011   RB     : 294   3rd Qu.:294.0   3rd Qu.:7.570  
##  Max.   :2015   DT     : 279   Max.   :386.0   Max.   :9.120  
##                 (Other):1612                                  
##    all40yard      SayersPower   
##  Min.   :4.210   Min.   : 5893  
##  1st Qu.:4.560   1st Qu.: 7553  
##  Median :4.720   Median : 8078  
##  Mean   :4.806   Mean   : 8081  
##  3rd Qu.:5.030   3rd Qu.: 8611  
##  Max.   :6.000   Max.   :10264  
## 

3590 complete cases with the variables of interest, all positions (1999 - 2015)

Descriptive statistics of relevant data

options(scipen=100) #supresses scientific notation of output
options(digits=2) #suggests a limit of 2 significant digits - but it is only a suggestion
stat.desc(nfldata, basic=F) #runs basic stats
##                   Year POS Weight_lbs X3Cone all40yard SayersPower
## median       2007.0000  NA     240.00 7.2200    4.7200    8077.901
## mean         2006.9173  NA     247.14 7.3092    4.8055    8080.920
## SE.mean         0.0835  NA       0.76 0.0071    0.0053      11.936
## CI.mean.0.95    0.1637  NA       1.50 0.0139    0.0103      23.402
## var            25.0232  NA    2097.51 0.1812    0.1000  511441.070
## std.dev         5.0023  NA      45.80 0.4256    0.3162     715.151
## coef.var        0.0025  NA       0.19 0.0582    0.0658       0.088

Univariate distributions of relevant data

qplot(X3Cone, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(all40yard, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(SayersPower, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(Weight_lbs, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

All well normally distributed - except for weight in pounds. There seem to be four distinct groups of weight in the draft. I will explore this more after doing the primary analysis.

Bivariate distributions with scatterplots and Linear regression

X3Cone and 40 yard dash

qplot(y=all40yard, x=X3Cone, data=nfldata) + stat_smooth(method=lm, formula=y~x)

X3Cone_SprintFit <- lm(all40yard ~ X3Cone, data=nfldata) #fit model
summary(X3Cone_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ X3Cone, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7298 -0.1244 -0.0036  0.1196  0.5615 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  0.37782    0.05256    7.19     0.00000000000079 ***
## X3Cone       0.60577    0.00718   84.39 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.18 on 3588 degrees of freedom
## Multiple R-squared:  0.665,  Adjusted R-squared:  0.665 
## F-statistic: 7.12e+03 on 1 and 3588 DF,  p-value: <0.0000000000000002

Sayers Power (peak power from vertical jump)

qplot(y=all40yard, x=SayersPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)

SayersPower_SprintFit <- lm(all40yard ~ SayersPower, data=nfldata) #fit model
summary(SayersPower_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7377 -0.2028 -0.0487  0.1752  1.2152 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 3.19804618 0.05347653    59.8 <0.0000000000000002 ***
## SayersPower 0.00019892 0.00000659    30.2 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.28 on 3588 degrees of freedom
## Multiple R-squared:  0.202,  Adjusted R-squared:  0.202 
## F-statistic:  911 on 1 and 3588 DF,  p-value: <0.0000000000000002

Sayers Power (peak power from vertical jump) and 40 yard dash

qplot(y=all40yard, x=SayersPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)

SayersPower_SprintFit <- lm(all40yard ~ SayersPower, data=nfldata) #fit model
summary(SayersPower_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7377 -0.2028 -0.0487  0.1752  1.2152 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 3.19804618 0.05347653    59.8 <0.0000000000000002 ***
## SayersPower 0.00019892 0.00000659    30.2 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.28 on 3588 degrees of freedom
## Multiple R-squared:  0.202,  Adjusted R-squared:  0.202 
## F-statistic:  911 on 1 and 3588 DF,  p-value: <0.0000000000000002

Sayers Power (peak power from vertical jump) and X3Cone

qplot(y=X3Cone, x=SayersPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)

SayersPower_X3ConeFit <- lm(X3Cone ~ SayersPower, data=nfldata) #fit model
summary(SayersPower_X3ConeFit) # show results 
## 
## Call:
## lm(formula = X3Cone ~ SayersPower, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0397 -0.2677 -0.0431  0.2316  1.7705 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 5.25439622 0.07287858    72.1 <0.0000000000000002 ***
## SayersPower 0.00025427 0.00000898    28.3 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.39 on 3588 degrees of freedom
## Multiple R-squared:  0.183,  Adjusted R-squared:  0.182 
## F-statistic:  801 on 1 and 3588 DF,  p-value: <0.0000000000000002

Multivariable regression Does combining X3Cone and sayers power predict 40 yard dash?

qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = X3Cone)

SayersX3Cone_SprintFit <- lm(all40yard ~ SayersPower + X3Cone, data=nfldata) #fit model
summary(SayersX3Cone_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower + X3Cone, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7070 -0.1228 -0.0062  0.1150  0.6469 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 0.22220006 0.05321477    4.18              0.00003 ***
## SayersPower 0.00005491 0.00000464   11.84 < 0.0000000000000002 ***
## X3Cone      0.56635358 0.00778994   72.70 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.18 on 3587 degrees of freedom
## Multiple R-squared:  0.678,  Adjusted R-squared:  0.677 
## F-statistic: 3.77e+03 on 2 and 3587 DF,  p-value: <0.0000000000000002

With both sayers and X3Cone in the regression model they are both significant preditors of sprint; and the overall R^2 is pretty good, but not much better than just the cone test alone.

What about weight?

As pointed out earlier there are two distinct groups of WR weights based on the histogram:

qplot(Weight_lbs, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Does weight contribute to our model of 40 yard sprint time?

qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = X3Cone, color = Weight_lbs)

SayersX3ConeWeight_SprintFit <- lm(all40yard ~ SayersPower + X3Cone + Weight_lbs, data=nfldata) #fit model
summary(SayersX3ConeWeight_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower + X3Cone + Weight_lbs, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4268 -0.0840 -0.0035  0.0801  0.5977 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)  2.96025612  0.06049218    48.9 <0.0000000000000002 ***
## SayersPower -0.00012941  0.00000459   -28.2 <0.0000000000000002 ***
## X3Cone       0.19129569  0.00851687    22.5 <0.0000000000000002 ***
## Weight_lbs   0.00604030  0.00010360    58.3 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.13 on 3586 degrees of freedom
## Multiple R-squared:  0.834,  Adjusted R-squared:  0.834 
## F-statistic: 6.03e+03 on 3 and 3586 DF,  p-value: <0.0000000000000002

Knowing body weight increases the overall R^2 significantly; and weight is a significant predictor in the model in the direction you would expect, the more weight the greater the time (greater time means slower)

Can this be made more clear?

Another way to try this is to create a factor (category) variable based on the weight distribution. The code below breaks weight into 2 categories based on the distribution.

nfldata$wt_quantiles<-quantcut(nfldata$Weight_lbs, q=seq(0,1,by=0.25), na.rm=TRUE)

We can see that this process does a nice job of partitioning the histogram from earlier into 4 discrete categories of weight.

qplot(Weight_lbs, data=nfldata, geom="histogram",color=nfldata$wt_quantiles)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Now the scatter plot with the weight grouping:

qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = X3Cone, color = nfldata$wt_quantiles)

SayersX3ConeWeight2_SprintFit <- lm(all40yard ~ SayersPower + X3Cone + wt_quantiles, data=nfldata) #fit model
summary(SayersX3ConeWeight2_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower + X3Cone + wt_quantiles, 
##     data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4659 -0.0908 -0.0051  0.0888  0.7382 
## 
## Coefficients:
##                          Estimate  Std. Error t value            Pr(>|t|)
## (Intercept)            3.21396893  0.07539090    42.6 <0.0000000000000002
## SayersPower           -0.00008038  0.00000469   -17.1 <0.0000000000000002
## X3Cone                 0.27252850  0.00872137    31.2 <0.0000000000000002
## wt_quantiles(208,240]  0.12689950  0.00696697    18.2 <0.0000000000000002
## wt_quantiles(240,294]  0.28992977  0.00866589    33.5 <0.0000000000000002
## wt_quantiles(294,386]  0.58680553  0.01241568    47.3 <0.0000000000000002
##                          
## (Intercept)           ***
## SayersPower           ***
## X3Cone                ***
## wt_quantiles(208,240] ***
## wt_quantiles(240,294] ***
## wt_quantiles(294,386] ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.14 on 3584 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.803 
## F-statistic: 2.92e+03 on 5 and 3584 DF,  p-value: <0.0000000000000002

While this factor (as opposed to continuous) approach is better visually with the scatter plot; it is not better mathematically. The factor variable for weight accounts for less of the variability in the data (lower R^2), which makes sense.

To get a sense of positional differences in the variable above

Here are boxplots of each variable and how it varies by position - the black line in the middle of the plot is the overall mean; the upper and lower red lines are the +/- 1 Standard Deviation values. ##### 40 yard dash

ggplot(nfldata, aes(factor(POS), all40yard)) + geom_boxplot() + geom_hline(aes(yintercept=mean(all40yard))) + geom_hline(aes(yintercept=(mean(all40yard)-sd(all40yard))), colour="#BB0000") + geom_hline(aes(yintercept=(mean(all40yard)+sd(all40yard)), colour="#BB0000"))

Sayers Power
ggplot(nfldata, aes(factor(POS), SayersPower)) + geom_boxplot() + geom_hline(aes(yintercept=mean(SayersPower))) + geom_hline(aes(yintercept=(mean(SayersPower)-sd(SayersPower))), colour="#BB0000") + geom_hline(aes(yintercept=(mean(SayersPower)+sd(SayersPower)), colour="#BB0000"))

Cone test
ggplot(nfldata, aes(factor(POS), X3Cone)) + geom_boxplot() + geom_hline(aes(yintercept=mean(X3Cone))) + geom_hline(aes(yintercept=(mean(X3Cone)-sd(X3Cone))), colour="#BB0000") + geom_hline(aes(yintercept=(mean(X3Cone)+sd(X3Cone)), colour="#BB0000"))

Weight (lbs)
ggplot(nfldata, aes(factor(POS), Weight_lbs)) + geom_boxplot() + geom_hline(aes(yintercept=mean(Weight_lbs))) + geom_hline(aes(yintercept=(mean(Weight_lbs)-sd(Weight_lbs))), colour="#BB0000") + geom_hline(aes(yintercept=(mean(Weight_lbs)+sd(Weight_lbs)), colour="#BB0000"))

Final thoughts -

Seems to be some interesting findings to discuss - let me know if you have any questions, or based on the insights of this analysis whether you have any other questions that I might have time to do.