Analysis for Currier’s physiology of exercise lab report
Using the NFL Combine Data (Freely available online): http://nflcombineresults.com/nflcombinedata.php
Decided to use “all40yard” - not just the NFL combined timed sprints; most likely includes sprints times at “Pro Day” at the respective college programs (according to a football source)
Goal: “to find out which is a better indicator of a faster 40 yard dash time. Either the shuttle time or vertical jump, among wide receivers”
For the sake of presentation 4 & 5 will be combined so that output releated to the linear regression model will follow the scatterplot of the same variables.
setwd("~/physexlab/NFLcombineData")
library(dplyr) #For data manipulation
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) #For plotting
library(pastecs) #For descriptive statistics
## Loading required package: boot
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(psych) #For descriptive statistics
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:boot':
##
## logit
##
## The following object is masked from 'package:ggplot2':
##
## %+%
library(gtools)
##
## Attaching package: 'gtools'
##
## The following object is masked from 'package:psych':
##
## logit
##
## The following objects are masked from 'package:boot':
##
## inv.logit, logit
nfldata<-read.csv("data_cleaned.csv")
nfldata<-mutate(nfldata, Weight_kg = Weight_lbs/2.2)
nfldata <- mutate(nfldata, Height_m = Height_in*0.0254)
nfldata <- mutate(nfldata, VertJump_m = VertLeap_in*0.0254)
nfldata <- mutate(nfldata, LewisPower = sqrt(4.9)*Weight_kg*9.81*sqrt(VertJump_m))
nfldata <- mutate(nfldata, SayersPower = (60.7*(VertJump_m*100))+(45.3*Weight_kg-2055))
#Select variables of interest
nfldata <- select(nfldata, Year, POS, Weight_lbs, Shuttle, all40yard, LewisPower, SayersPower)
#Filter based on rows of interest (position = WR)
nfldata<-filter(nfldata, grepl('WR', POS))
#Keep only complete cases to avoid missing values
nfldata <- nfldata[complete.cases(nfldata), ]
str(nfldata)
## 'data.frame': 457 obs. of 7 variables:
## $ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ POS : Factor w/ 18 levels "C","CB","DE",..: 18 18 18 18 18 18 18 18 18 18 ...
## $ Weight_lbs : int 180 197 205 195 212 213 211 185 220 216 ...
## $ Shuttle : num 4.07 4.15 4.11 4.26 4.06 4.3 3.98 4.32 4.12 4.18 ...
## $ all40yard : num 4.43 4.42 4.54 4.51 4.43 4.35 4.42 4.56 4.57 4.58 ...
## $ LewisPower : num 1651 1996 1962 1762 2135 ...
## $ SayersPower: num 6893 8400 7871 7048 8632 ...
summary(nfldata)
## Year POS Weight_lbs Shuttle all40yard
## Min. :1999 WR :457 Min. :155 Min. :3.730 Min. :4.21
## 1st Qu.:2004 C : 0 1st Qu.:191 1st Qu.:4.100 1st Qu.:4.45
## Median :2008 CB : 0 Median :201 Median :4.200 Median :4.51
## Mean :2008 DE : 0 Mean :201 Mean :4.205 Mean :4.51
## 3rd Qu.:2012 DT : 0 3rd Qu.:211 3rd Qu.:4.310 3rd Qu.:4.57
## Max. :2015 FB : 0 Max. :240 Max. :4.770 Max. :4.84
## (Other): 0
## LewisPower SayersPower
## Min. :1443 Min. :6035
## 1st Qu.:1770 1st Qu.:7208
## Median :1887 Median :7551
## Mean :1883 Mean :7564
## 3rd Qu.:1987 3rd Qu.:7917
## Max. :2324 Max. :9269
##
457 complete cases with the variables of interest, all WRs across all years (1999 - 2015)
options(scipen=100) #supresses scientific notation of output
options(digits=2) #suggests a limit of 2 significant digits - but it is only a suggestion
stat.desc(nfldata, basic=F) #runs basic stats
## Year POS Weight_lbs Shuttle all40yard LewisPower
## median 2008.0000 NA 201.000 4.200 4.5100 1886.566
## mean 2007.6871 NA 200.993 4.205 4.5104 1883.170
## SE.mean 0.2367 NA 0.679 0.007 0.0049 7.390
## CI.mean.0.95 0.4652 NA 1.335 0.014 0.0096 14.523
## var 25.6058 NA 210.932 0.022 0.0108 24958.330
## std.dev 5.0602 NA 14.523 0.149 0.1039 157.982
## coef.var 0.0025 NA 0.072 0.035 0.0230 0.084
## SayersPower
## median 7551.309
## mean 7563.535
## SE.mean 26.061
## CI.mean.0.95 51.216
## var 310395.163
## std.dev 557.131
## coef.var 0.074
qplot(Shuttle, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(all40yard, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(LewisPower, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(SayersPower, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(Weight_lbs, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
All well normally distributed - except for weight in pounds. There seem to be two distinct groups of WRs in the draft, those that are heavy and those that are light. I will explore this more after doing the primary analysis.
40 yard dash is treated as the Y axis (dependant) variable in all plots and models
qplot(y=all40yard, x=Shuttle, data=nfldata) + stat_smooth(method=lm, formula=y~x)
Shuttle_SprintFit <- lm(all40yard ~ Shuttle, data=nfldata) #fit model
summary(Shuttle_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ Shuttle, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3391 -0.0629 0.0030 0.0659 0.3176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9202 0.1350 29.04 < 0.0000000000000002 ***
## Shuttle 0.1404 0.0321 4.37 0.000015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1 on 455 degrees of freedom
## Multiple R-squared: 0.0404, Adjusted R-squared: 0.0382
## F-statistic: 19.1 on 1 and 455 DF, p-value: 0.0000151
qplot(y=all40yard, x=LewisPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)
LewisPower_SprintFit <- lm(all40yard ~ LewisPower, data=nfldata) #fit model
summary(LewisPower_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ LewisPower, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3014 -0.0611 0.0000 0.0607 0.3289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.51970916 0.05827807 77.55 <0.0000000000000002 ***
## LewisPower -0.00000493 0.00003084 -0.16 0.87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1 on 455 degrees of freedom
## Multiple R-squared: 5.63e-05, Adjusted R-squared: -0.00214
## F-statistic: 0.0256 on 1 and 455 DF, p-value: 0.873
qplot(y=all40yard, x=SayersPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)
SayersPower_SprintFit <- lm(all40yard ~ SayersPower, data=nfldata) #fit model
summary(SayersPower_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3118 -0.0669 0.0010 0.0615 0.3235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7700406 0.0651892 73.17 < 0.0000000000000002 ***
## SayersPower -0.0000343 0.0000086 -3.99 0.000076 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1 on 455 degrees of freedom
## Multiple R-squared: 0.0339, Adjusted R-squared: 0.0317
## F-statistic: 15.9 on 1 and 455 DF, p-value: 0.0000759
qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = Shuttle)
SayersShuttle_SprintFit <- lm(all40yard ~ SayersPower + Shuttle, data=nfldata) #fit model
summary(SayersShuttle_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower + Shuttle, data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3392 -0.0636 0.0029 0.0613 0.3202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.20599522 0.15563056 27.03 < 0.0000000000000002 ***
## SayersPower -0.00003032 0.00000852 -3.56 0.00041 ***
## Shuttle 0.12694152 0.03191136 3.98 0.000081 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1 on 454 degrees of freedom
## Multiple R-squared: 0.0664, Adjusted R-squared: 0.0623
## F-statistic: 16.1 on 2 and 454 DF, p-value: 0.000000168
With both sayers and shuttle in the regression model they are both significant preditors of sprint; BUT the overall R^2 is still fairly low.
Interestingly I think this is a good thing. If these three variables were highly related it would beg the question about why more than one or two tests are necessary. It is their lack of association that makes them each an important piece of infromation for the evaluation of athletic physicality.
As pointed out earlier there are two distinct groups of WR weights based on the histogram:
qplot(Weight_lbs, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Does weight contribute to our model of 40 yard sprint time?
qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = Shuttle, color = Weight_lbs)
SayersShuttleWeight_SprintFit <- lm(all40yard ~ SayersPower + Shuttle + Weight_lbs, data=nfldata) #fit model
summary(SayersShuttleWeight_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower + Shuttle + Weight_lbs,
## data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3067 -0.0624 -0.0015 0.0624 0.3349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.27154676 0.14871295 28.72 < 0.0000000000000002 ***
## SayersPower -0.00006852 0.00000987 -6.94 0.000000000014 ***
## Shuttle 0.05475561 0.03222465 1.70 0.09 .
## Weight_lbs 0.00262133 0.00038518 6.81 0.000000000032 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.096 on 453 degrees of freedom
## Multiple R-squared: 0.153, Adjusted R-squared: 0.147
## F-statistic: 27.3 on 3 and 453 DF, p-value: 0.000000000000000312
Just a bit - the overall R^2 has increased a bit; and weight is a significant predictor in the model in the direction you would expect, the more weight the greater the time (greater time means slower)
Another way to try this is to create a factor (category) variable based on the weight distribution. The code below breaks weight into 2 categories based on the distribution.
nfldata$wt_quantiles<-quantcut(nfldata$Weight_lbs, q=seq(0,1,by=0.5), na.rm=TRUE)
We can see that this process does a nice job of partitioning the histogram from earlier into 4 discrete categories of weight.
qplot(Weight_lbs, data=nfldata, geom="histogram",color=nfldata$wt_quantiles)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Now the scatter plot with the weight grouping:
qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = Shuttle, color = nfldata$wt_quantiles)
SayersShuttleWeight2_SprintFit <- lm(all40yard ~ SayersPower + Shuttle + wt_quantiles, data=nfldata) #fit model
summary(SayersShuttleWeight2_SprintFit) # show results
##
## Call:
## lm(formula = all40yard ~ SayersPower + Shuttle + wt_quantiles,
## data = nfldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3209 -0.0602 -0.0003 0.0622 0.3258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45733425 0.16486838 27.04 < 0.0000000000000002
## SayersPower -0.00004713 0.00000933 -5.05 0.00000063
## Shuttle 0.09248589 0.03248231 2.85 0.0046
## wt_quantiles(201,240] 0.04297454 0.01050192 4.09 0.00005062
##
## (Intercept) ***
## SayersPower ***
## Shuttle **
## wt_quantiles(201,240] ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.099 on 453 degrees of freedom
## Multiple R-squared: 0.0997, Adjusted R-squared: 0.0937
## F-statistic: 16.7 on 3 and 453 DF, p-value: 0.000000000256
While this factor (as opposed to continuous) approach is better visually with the scatter plot; it is not better mathematically. The factor variable for weight accounts for less of the variability in the data (lower R^2), which makes sense.
Seems to be some interesting findings to discuss - let me know if you have any questions, or based on the insights of this analysis whether you have any other questions that I might have time to do. This analysis would all be really easy with all the subjects as well (just have to change a bit of code at the beginning that only selects Wide Recievers). If you really really want to only look at WRs from 2014 I can also try to filter those cases alone. But - from this plot - I think you will see that while sprint time varies from year to year a bit; not by much, and there is no reason to specifically identify 2014 as far as I can see (the vertical black line is the mean of all years combined, and the dashed red lines represent the overall boundary +/- 1 standard deviation). Based on that - the year 2014 falls right at the mean (interestingly).
ggplot(nfldata, aes(factor(Year), all40yard)) + geom_boxplot() + geom_hline(aes(yintercept=4.5)) + geom_hline(aes(yintercept=4.6039), colour="#BB0000", linetype="dashed") + geom_hline(aes(yintercept=4.4061), colour="#BB0000", linetype="dashed")