Analysis for Brewster’s physiology of exercise lab report
Using the NFL Combine Data (Freely available online): http://nflcombineresults.com/nflcombinedata.php
Decided to use “all40yard” - not just the NFL combined timed sprints
Goal: test the relationship between the 3 cone test, 40 yard dash times and power produced from vertical jump
Will stick with the Sayers peak power from the vertical jump
For the sake of presentation 4 & 5 will be combined so that output releated to the linear regression model will follow the scatterplot of the same variables.
setwd("~/physexlab/NFLcombineData")
library(dplyr) #For data manipulation
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) #For plotting
library(pastecs) #For descriptive statistics
## Loading required package: boot
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(psych) #For descriptive statistics
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:boot':
##
## logit
##
## The following object is masked from 'package:ggplot2':
##
## %+%
library(gtools)
##
## Attaching package: 'gtools'
##
## The following object is masked from 'package:psych':
##
## logit
##
## The following objects are masked from 'package:boot':
##
## inv.logit, logit
nfldata<-read.csv("data_cleaned.csv")
nfldata<-mutate(nfldata, Weight_kg = Weight_lbs/2.2)
nfldata <- mutate(nfldata, Height_m = Height_in*0.0254)
nfldata <- mutate(nfldata, VertJump_m = VertLeap_in*0.0254)
nfldata <- mutate(nfldata, LewisPower = sqrt(4.9)*Weight_kg*9.81*sqrt(VertJump_m))
nfldata <- mutate(nfldata, SayersPower = (60.7*(VertJump_m*100))+(45.3*Weight_kg-2055))
#Select variables of interest
nfldata <- select(nfldata, Year, POS, Weight_lbs, X3Cone, all40yard, SayersPower)
#Keep only complete cases to avoid missing values
nfldata <- nfldata[complete.cases(nfldata), ]
str(nfldata)
## 'data.frame': 3590 obs. of 6 variables:
## $ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ POS : Factor w/ 18 levels "C","CB","DE",..: 15 15 11 18 15 6 3 7 4 15 ...
## $ Weight_lbs : int 205 221 227 180 221 218 294 243 292 212 ...
## $ X3Cone : num 6.79 7.1 7.14 6.64 6.96 7.09 7.2 7.07 7.57 7.13 ...
## $ all40yard : num 4.6 4.57 4.55 4.43 4.53 4.56 4.97 4.56 5.1 4.53 ...
## $ SayersPower: num 8719 8509 8170 6893 7969 ...
summary(nfldata)
## Year POS Weight_lbs X3Cone
## Min. :1999 WR : 445 Min. :155.0 Min. :6.340
## 1st Qu.:2003 CB : 350 1st Qu.:208.0 1st Qu.:6.990
## Median :2007 OT : 309 Median :240.0 Median :7.220
## Mean :2007 DE : 301 Mean :247.1 Mean :7.309
## 3rd Qu.:2011 RB : 294 3rd Qu.:294.0 3rd Qu.:7.570
## Max. :2015 DT : 279 Max. :386.0 Max. :9.120
## (Other):1612
## all40yard SayersPower
## Min. :4.210 Min. : 5893
## 1st Qu.:4.560 1st Qu.: 7553
## Median :4.720 Median : 8078
## Mean :4.806 Mean : 8081
## 3rd Qu.:5.030 3rd Qu.: 8611
## Max. :6.000 Max. :10264
##
3590 complete cases with the variables of interest, all positions (1999 - 2015)
options(scipen=100) #supresses scientific notation of output
options(digits=2) #suggests a limit of 2 significant digits - but it is only a suggestion
stat.desc(nfldata, basic=F) #runs basic stats
## Year POS Weight_lbs X3Cone all40yard SayersPower
## median 2007.0000 NA 240.00 7.2200 4.7200 8077.901
## mean 2006.9173 NA 247.14 7.3092 4.8055 8080.920
## SE.mean 0.0835 NA 0.76 0.0071 0.0053 11.936
## CI.mean.0.95 0.1637 NA 1.50 0.0139 0.0103 23.402
## var 25.0232 NA 2097.51 0.1812 0.1000 511441.070
## std.dev 5.0023 NA 45.80 0.4256 0.3162 715.151
## coef.var 0.0025 NA 0.19 0.0582 0.0658 0.088
qplot(X3Cone, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(all40yard, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(SayersPower, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(Weight_lbs, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
All well normally distributed - except for weight in pounds. There seem to be four distinct groups of weight in the draft. I will explore this more after doing the primary analysis.
qplot(y=all40yard, x=X3Cone, data=nfldata) + stat_smooth(method=lm, formula=y~x)
X3Cone_SprintFit <- lm(all40yard ~ X3Cone, data=nfldata) #fit model
summary(X3Cone_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ X3Cone, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7298 -0.1244 -0.0036 0.1196 0.5615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.37782 0.05256 7.19 0.00000000000079 ***
## X3Cone 0.60577 0.00718 84.39 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.18 on 3588 degrees of freedom
## Multiple R-squared: 0.665, Adjusted R-squared: 0.665
## F-statistic: 7.12e+03 on 1 and 3588 DF, p-value: <0.0000000000000002
qplot(y=all40yard, x=SayersPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)
SayersPower_SprintFit <- lm(all40yard ~ SayersPower, data=nfldata) #fit model
summary(SayersPower_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7377 -0.2028 -0.0487 0.1752 1.2152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.19804618 0.05347653 59.8 <0.0000000000000002 ***
## SayersPower 0.00019892 0.00000659 30.2 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.28 on 3588 degrees of freedom
## Multiple R-squared: 0.202, Adjusted R-squared: 0.202
## F-statistic: 911 on 1 and 3588 DF, p-value: <0.0000000000000002
qplot(y=all40yard, x=SayersPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)
SayersPower_SprintFit <- lm(all40yard ~ SayersPower, data=nfldata) #fit model
summary(SayersPower_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7377 -0.2028 -0.0487 0.1752 1.2152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.19804618 0.05347653 59.8 <0.0000000000000002 ***
## SayersPower 0.00019892 0.00000659 30.2 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.28 on 3588 degrees of freedom
## Multiple R-squared: 0.202, Adjusted R-squared: 0.202
## F-statistic: 911 on 1 and 3588 DF, p-value: <0.0000000000000002
qplot(y=X3Cone, x=SayersPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)
SayersPower_X3ConeFit <- lm(X3Cone ~ SayersPower, data=nfldata) #fit model
summary(SayersPower_X3ConeFit) # show results
##
## Call:
## lm(formula = X3Cone ~ SayersPower, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0397 -0.2677 -0.0431 0.2316 1.7705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.25439622 0.07287858 72.1 <0.0000000000000002 ***
## SayersPower 0.00025427 0.00000898 28.3 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.39 on 3588 degrees of freedom
## Multiple R-squared: 0.183, Adjusted R-squared: 0.182
## F-statistic: 801 on 1 and 3588 DF, p-value: <0.0000000000000002
qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = X3Cone)
SayersX3Cone_SprintFit <- lm(all40yard ~ SayersPower + X3Cone, data=nfldata) #fit model
summary(SayersX3Cone_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower + X3Cone, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7070 -0.1228 -0.0062 0.1150 0.6469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22220006 0.05321477 4.18 0.00003 ***
## SayersPower 0.00005491 0.00000464 11.84 < 0.0000000000000002 ***
## X3Cone 0.56635358 0.00778994 72.70 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.18 on 3587 degrees of freedom
## Multiple R-squared: 0.678, Adjusted R-squared: 0.677
## F-statistic: 3.77e+03 on 2 and 3587 DF, p-value: <0.0000000000000002
With both sayers and X3Cone in the regression model they are both significant preditors of sprint; and the overall R^2 is pretty good, but not much better than just the cone test alone.
As pointed out earlier there are two distinct groups of WR weights based on the histogram:
qplot(Weight_lbs, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Does weight contribute to our model of 40 yard sprint time?
qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = X3Cone, color = Weight_lbs)
SayersX3ConeWeight_SprintFit <- lm(all40yard ~ SayersPower + X3Cone + Weight_lbs, data=nfldata) #fit model
summary(SayersX3ConeWeight_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower + X3Cone + Weight_lbs, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4268 -0.0840 -0.0035 0.0801 0.5977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.96025612 0.06049218 48.9 <0.0000000000000002 ***
## SayersPower -0.00012941 0.00000459 -28.2 <0.0000000000000002 ***
## X3Cone 0.19129569 0.00851687 22.5 <0.0000000000000002 ***
## Weight_lbs 0.00604030 0.00010360 58.3 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.13 on 3586 degrees of freedom
## Multiple R-squared: 0.834, Adjusted R-squared: 0.834
## F-statistic: 6.03e+03 on 3 and 3586 DF, p-value: <0.0000000000000002
Knowing body weight increases the overall R^2 significantly; and weight is a significant predictor in the model in the direction you would expect, the more weight the greater the time (greater time means slower)
Another way to try this is to create a factor (category) variable based on the weight distribution. The code below breaks weight into 2 categories based on the distribution.
nfldata$wt_quantiles<-quantcut(nfldata$Weight_lbs, q=seq(0,1,by=0.25), na.rm=TRUE)
We can see that this process does a nice job of partitioning the histogram from earlier into 4 discrete categories of weight.
qplot(Weight_lbs, data=nfldata, geom="histogram",color=nfldata$wt_quantiles)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Now the scatter plot with the weight grouping:
qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = X3Cone, color = nfldata$wt_quantiles)
SayersX3ConeWeight2_SprintFit <- lm(all40yard ~ SayersPower + X3Cone + wt_quantiles, data=nfldata) #fit model
summary(SayersX3ConeWeight2_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower + X3Cone + wt_quantiles,
## data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4659 -0.0908 -0.0051 0.0888 0.7382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.21396893 0.07539090 42.6 <0.0000000000000002
## SayersPower -0.00008038 0.00000469 -17.1 <0.0000000000000002
## X3Cone 0.27252850 0.00872137 31.2 <0.0000000000000002
## wt_quantiles(208,240] 0.12689950 0.00696697 18.2 <0.0000000000000002
## wt_quantiles(240,294] 0.28992977 0.00866589 33.5 <0.0000000000000002
## wt_quantiles(294,386] 0.58680553 0.01241568 47.3 <0.0000000000000002
##
## (Intercept) ***
## SayersPower ***
## X3Cone ***
## wt_quantiles(208,240] ***
## wt_quantiles(240,294] ***
## wt_quantiles(294,386] ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.14 on 3584 degrees of freedom
## Multiple R-squared: 0.803, Adjusted R-squared: 0.803
## F-statistic: 2.92e+03 on 5 and 3584 DF, p-value: <0.0000000000000002
While this factor (as opposed to continuous) approach is better visually with the scatter plot; it is not better mathematically. The factor variable for weight accounts for less of the variability in the data (lower R^2), which makes sense.
Here are boxplots of each variable and how it varies by position - the black line in the middle of the plot is the overall mean; the upper and lower red lines are the +/- 1 Standard Deviation values. ##### 40 yard dash
ggplot(nfldata, aes(factor(POS), all40yard)) + geom_boxplot() + geom_hline(aes(yintercept=mean(all40yard))) + geom_hline(aes(yintercept=(mean(all40yard)-sd(all40yard))), colour="#BB0000") + geom_hline(aes(yintercept=(mean(all40yard)+sd(all40yard)), colour="#BB0000"))
ggplot(nfldata, aes(factor(POS), SayersPower)) + geom_boxplot() + geom_hline(aes(yintercept=mean(SayersPower))) + geom_hline(aes(yintercept=(mean(SayersPower)-sd(SayersPower))), colour="#BB0000") + geom_hline(aes(yintercept=(mean(SayersPower)+sd(SayersPower)), colour="#BB0000"))
ggplot(nfldata, aes(factor(POS), X3Cone)) + geom_boxplot() + geom_hline(aes(yintercept=mean(X3Cone))) + geom_hline(aes(yintercept=(mean(X3Cone)-sd(X3Cone))), colour="#BB0000") + geom_hline(aes(yintercept=(mean(X3Cone)+sd(X3Cone)), colour="#BB0000"))
ggplot(nfldata, aes(factor(POS), Weight_lbs)) + geom_boxplot() + geom_hline(aes(yintercept=mean(Weight_lbs))) + geom_hline(aes(yintercept=(mean(Weight_lbs)-sd(Weight_lbs))), colour="#BB0000") + geom_hline(aes(yintercept=(mean(Weight_lbs)+sd(Weight_lbs)), colour="#BB0000"))
Seems to be some interesting findings to discuss - let me know if you have any questions, or based on the insights of this analysis whether you have any other questions that I might have time to do.