Analysis for Currier’s physiology of exercise lab report

Using the NFL Combine Data (Freely available online): http://nflcombineresults.com/nflcombinedata.php

Decided to use “all40yard” - not just the NFL combined timed sprints; most likely includes sprints times at “Pro Day” at the respective college programs (according to a football source)

Goal: “to find out which is a better indicator of a faster 40 yard dash time. Either the shuttle time or vertical jump, among wide receivers”

Analysis approach

  1. Load, select and clean data relevant for this analysis
  2. Descriptive statistics of relevant data
  3. Univariate distributions of relevant data
  4. Bivariate distributions with scatterplots
  5. Linear regression

For the sake of presentation 4 & 5 will be combined so that output releated to the linear regression model will follow the scatterplot of the same variables.

Setting working directory and loading required R packages:

setwd("~/physexlab/NFLcombineData")
library(dplyr) #For data manipulation
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) #For plotting
library(pastecs) #For descriptive statistics
## Loading required package: boot
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(psych) #For descriptive statistics
## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%
library(gtools)
## 
## Attaching package: 'gtools'
## 
## The following object is masked from 'package:psych':
## 
##     logit
## 
## The following objects are masked from 'package:boot':
## 
##     inv.logit, logit

Load, select and clean data relevant for this analysis

nfldata<-read.csv("data_cleaned.csv")
nfldata<-mutate(nfldata, Weight_kg = Weight_lbs/2.2)
nfldata <- mutate(nfldata, Height_m = Height_in*0.0254)
nfldata <- mutate(nfldata, VertJump_m = VertLeap_in*0.0254)
nfldata <- mutate(nfldata, LewisPower = sqrt(4.9)*Weight_kg*9.81*sqrt(VertJump_m))
nfldata <- mutate(nfldata, SayersPower = (60.7*(VertJump_m*100))+(45.3*Weight_kg-2055))

#Select variables of interest 
nfldata <- select(nfldata, Year, POS, Weight_lbs, Shuttle, all40yard, LewisPower, SayersPower)

#Filter based on rows of interest (position = WR)
nfldata<-filter(nfldata, grepl('WR', POS))

#Keep only complete cases to avoid missing values
nfldata <- nfldata[complete.cases(nfldata), ]

Summary of data after the above process

str(nfldata)
## 'data.frame':    457 obs. of  7 variables:
##  $ Year       : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ POS        : Factor w/ 18 levels "C","CB","DE",..: 18 18 18 18 18 18 18 18 18 18 ...
##  $ Weight_lbs : int  180 197 205 195 212 213 211 185 220 216 ...
##  $ Shuttle    : num  4.07 4.15 4.11 4.26 4.06 4.3 3.98 4.32 4.12 4.18 ...
##  $ all40yard  : num  4.43 4.42 4.54 4.51 4.43 4.35 4.42 4.56 4.57 4.58 ...
##  $ LewisPower : num  1651 1996 1962 1762 2135 ...
##  $ SayersPower: num  6893 8400 7871 7048 8632 ...
summary(nfldata)
##       Year           POS        Weight_lbs     Shuttle        all40yard   
##  Min.   :1999   WR     :457   Min.   :155   Min.   :3.730   Min.   :4.21  
##  1st Qu.:2004   C      :  0   1st Qu.:191   1st Qu.:4.100   1st Qu.:4.45  
##  Median :2008   CB     :  0   Median :201   Median :4.200   Median :4.51  
##  Mean   :2008   DE     :  0   Mean   :201   Mean   :4.205   Mean   :4.51  
##  3rd Qu.:2012   DT     :  0   3rd Qu.:211   3rd Qu.:4.310   3rd Qu.:4.57  
##  Max.   :2015   FB     :  0   Max.   :240   Max.   :4.770   Max.   :4.84  
##                 (Other):  0                                               
##    LewisPower    SayersPower  
##  Min.   :1443   Min.   :6035  
##  1st Qu.:1770   1st Qu.:7208  
##  Median :1887   Median :7551  
##  Mean   :1883   Mean   :7564  
##  3rd Qu.:1987   3rd Qu.:7917  
##  Max.   :2324   Max.   :9269  
## 

457 complete cases with the variables of interest, all WRs across all years (1999 - 2015)

Descriptive statistics of relevant data

options(scipen=100) #supresses scientific notation of output
options(digits=2) #suggests a limit of 2 significant digits - but it is only a suggestion
stat.desc(nfldata, basic=F) #runs basic stats
##                   Year POS Weight_lbs Shuttle all40yard LewisPower
## median       2008.0000  NA    201.000   4.200    4.5100   1886.566
## mean         2007.6871  NA    200.993   4.205    4.5104   1883.170
## SE.mean         0.2367  NA      0.679   0.007    0.0049      7.390
## CI.mean.0.95    0.4652  NA      1.335   0.014    0.0096     14.523
## var            25.6058  NA    210.932   0.022    0.0108  24958.330
## std.dev         5.0602  NA     14.523   0.149    0.1039    157.982
## coef.var        0.0025  NA      0.072   0.035    0.0230      0.084
##              SayersPower
## median          7551.309
## mean            7563.535
## SE.mean           26.061
## CI.mean.0.95      51.216
## var           310395.163
## std.dev          557.131
## coef.var           0.074

Univariate distributions of relevant data

qplot(Shuttle, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(all40yard, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(LewisPower, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(SayersPower, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(Weight_lbs, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

All well normally distributed - except for weight in pounds. There seem to be two distinct groups of WRs in the draft, those that are heavy and those that are light. I will explore this more after doing the primary analysis.

Bivariate distributions with scatterplots and Linear regression

40 yard dash is treated as the Y axis (dependant) variable in all plots and models

Shuttle

qplot(y=all40yard, x=Shuttle, data=nfldata) + stat_smooth(method=lm, formula=y~x)

Shuttle_SprintFit <- lm(all40yard ~ Shuttle, data=nfldata) #fit model
summary(Shuttle_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ Shuttle, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3391 -0.0629  0.0030  0.0659  0.3176 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   3.9202     0.1350   29.04 < 0.0000000000000002 ***
## Shuttle       0.1404     0.0321    4.37             0.000015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1 on 455 degrees of freedom
## Multiple R-squared:  0.0404, Adjusted R-squared:  0.0382 
## F-statistic: 19.1 on 1 and 455 DF,  p-value: 0.0000151

Lewis Power (average power from vertical jump)

qplot(y=all40yard, x=LewisPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)

LewisPower_SprintFit <- lm(all40yard ~ LewisPower, data=nfldata) #fit model
summary(LewisPower_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ LewisPower, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3014 -0.0611  0.0000  0.0607  0.3289 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)  4.51970916  0.05827807   77.55 <0.0000000000000002 ***
## LewisPower  -0.00000493  0.00003084   -0.16                0.87    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1 on 455 degrees of freedom
## Multiple R-squared:  5.63e-05,   Adjusted R-squared:  -0.00214 
## F-statistic: 0.0256 on 1 and 455 DF,  p-value: 0.873

Sayers Power (peak power from vertical jump)

qplot(y=all40yard, x=SayersPower, data=nfldata) + stat_smooth(method=lm, formula=y~x)

SayersPower_SprintFit <- lm(all40yard ~ SayersPower, data=nfldata) #fit model
summary(SayersPower_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3118 -0.0669  0.0010  0.0615  0.3235 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  4.7700406  0.0651892   73.17 < 0.0000000000000002 ***
## SayersPower -0.0000343  0.0000086   -3.99             0.000076 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1 on 455 degrees of freedom
## Multiple R-squared:  0.0339, Adjusted R-squared:  0.0317 
## F-statistic: 15.9 on 1 and 455 DF,  p-value: 0.0000759

Does combining shuttle and sayers power improve the relationship with 40 yard sprint?

qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = Shuttle)

SayersShuttle_SprintFit <- lm(all40yard ~ SayersPower + Shuttle, data=nfldata) #fit model
summary(SayersShuttle_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower + Shuttle, data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3392 -0.0636  0.0029  0.0613  0.3202 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)  4.20599522  0.15563056   27.03 < 0.0000000000000002 ***
## SayersPower -0.00003032  0.00000852   -3.56              0.00041 ***
## Shuttle      0.12694152  0.03191136    3.98             0.000081 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1 on 454 degrees of freedom
## Multiple R-squared:  0.0664, Adjusted R-squared:  0.0623 
## F-statistic: 16.1 on 2 and 454 DF,  p-value: 0.000000168

With both sayers and shuttle in the regression model they are both significant preditors of sprint; BUT the overall R^2 is still fairly low.

Interestingly I think this is a good thing. If these three variables were highly related it would beg the question about why more than one or two tests are necessary. It is their lack of association that makes them each an important piece of infromation for the evaluation of athletic physicality.

What about weight?

As pointed out earlier there are two distinct groups of WR weights based on the histogram:

qplot(Weight_lbs, data=nfldata, geom="histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Does weight contribute to our model of 40 yard sprint time?

qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = Shuttle, color = Weight_lbs)

SayersShuttleWeight_SprintFit <- lm(all40yard ~ SayersPower + Shuttle + Weight_lbs, data=nfldata) #fit model
summary(SayersShuttleWeight_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower + Shuttle + Weight_lbs, 
##     data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3067 -0.0624 -0.0015  0.0624  0.3349 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)  4.27154676  0.14871295   28.72 < 0.0000000000000002 ***
## SayersPower -0.00006852  0.00000987   -6.94       0.000000000014 ***
## Shuttle      0.05475561  0.03222465    1.70                 0.09 .  
## Weight_lbs   0.00262133  0.00038518    6.81       0.000000000032 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.096 on 453 degrees of freedom
## Multiple R-squared:  0.153,  Adjusted R-squared:  0.147 
## F-statistic: 27.3 on 3 and 453 DF,  p-value: 0.000000000000000312

Just a bit - the overall R^2 has increased a bit; and weight is a significant predictor in the model in the direction you would expect, the more weight the greater the time (greater time means slower)

Can this be made more clear?

Another way to try this is to create a factor (category) variable based on the weight distribution. The code below breaks weight into 2 categories based on the distribution.

nfldata$wt_quantiles<-quantcut(nfldata$Weight_lbs, q=seq(0,1,by=0.5), na.rm=TRUE)

We can see that this process does a nice job of partitioning the histogram from earlier into 4 discrete categories of weight.

qplot(Weight_lbs, data=nfldata, geom="histogram",color=nfldata$wt_quantiles)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Now the scatter plot with the weight grouping:

qplot(y=all40yard, x=SayersPower, data=nfldata, geom=c("point", "smooth"),method="lm",size = Shuttle, color = nfldata$wt_quantiles)

SayersShuttleWeight2_SprintFit <- lm(all40yard ~ SayersPower + Shuttle + wt_quantiles, data=nfldata) #fit model
summary(SayersShuttleWeight2_SprintFit) # show results 
## 
## Call:
## lm(formula = all40yard ~ SayersPower + Shuttle + wt_quantiles, 
##     data = nfldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3209 -0.0602 -0.0003  0.0622  0.3258 
## 
## Coefficients:
##                          Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)            4.45733425  0.16486838   27.04 < 0.0000000000000002
## SayersPower           -0.00004713  0.00000933   -5.05           0.00000063
## Shuttle                0.09248589  0.03248231    2.85               0.0046
## wt_quantiles(201,240]  0.04297454  0.01050192    4.09           0.00005062
##                          
## (Intercept)           ***
## SayersPower           ***
## Shuttle               ** 
## wt_quantiles(201,240] ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.099 on 453 degrees of freedom
## Multiple R-squared:  0.0997, Adjusted R-squared:  0.0937 
## F-statistic: 16.7 on 3 and 453 DF,  p-value: 0.000000000256

While this factor (as opposed to continuous) approach is better visually with the scatter plot; it is not better mathematically. The factor variable for weight accounts for less of the variability in the data (lower R^2), which makes sense.

Final thoughts -

Seems to be some interesting findings to discuss - let me know if you have any questions, or based on the insights of this analysis whether you have any other questions that I might have time to do. This analysis would all be really easy with all the subjects as well (just have to change a bit of code at the beginning that only selects Wide Recievers). If you really really want to only look at WRs from 2014 I can also try to filter those cases alone. But - from this plot - I think you will see that while sprint time varies from year to year a bit; not by much, and there is no reason to specifically identify 2014 as far as I can see (the vertical black line is the mean of all years combined, and the dashed red lines represent the overall boundary +/- 1 standard deviation). Based on that - the year 2014 falls right at the mean (interestingly).

ggplot(nfldata, aes(factor(Year), all40yard)) + geom_boxplot() + geom_hline(aes(yintercept=4.5)) + geom_hline(aes(yintercept=4.6039), colour="#BB0000", linetype="dashed") + geom_hline(aes(yintercept=4.4061), colour="#BB0000", linetype="dashed")