Introduction

I chose to explore data about quarterbacks in the national football league. I used a data set of quarterback statistics from the last 5 years. I cleaned the original data by setting a restriction of 290 minimum passing yards required to make the list. I decided to restrict the dataset in this way because I mainly want to look at starting quarterbacks or backup quarterbacks that played significant time in a given NFL season. By setting a minimum number of total yards, I eliminated any wide receivers that threw passes or backups that barely saw the field. The main response variable I wanted to look at was total passing yards in a season, so I generated my question around that. There were 14 explanatory variables included in the data set, most of which I decided to leave out of my analysis. In the end, I decided to only retain the total touchdowns statistic. I decided to answer the question of how the number of touchdown passes a QB throws affects the number of passing yards they throw in a season. My goal was to generate a predictive model that predicts the amount of pass yards a quarterback that threw a given amount of touchdowns would have.

Data Wrangling

df = read.csv('Project Data.csv')
head(df, 5)
##               Player Pass.Yds Yds.Att Att Cmp Cmp.. TD INT  Rate X1st X1st.
## 1          Tom Brady     5316     7.4 719 485  67.4 43  12 102.1  269  37.4
## 2    Patrick Mahomes     5250     8.1 648 435  67.1 41  12 105.2  272  42.0
## 3 Ben Roethlisberger     5129     7.6 675 452  67.0 34  16  96.5  248  36.7
## 4     Jameis Winston     5109     8.2 626 380  60.7 33  30  84.3  243  38.8
## 5    Patrick Mahomes     5097     8.8 580 383  66.0 50  12 113.8  237  40.9
##   X20. X40. Lng Sck SckY
## 1   75   10  62  22  144
## 2   73   13  67  26  188
## 3   61   16  97  24  166
## 4   75   13  71  47  282
## 5   75   15  89  26  171
m1 <- lmer(Pass.Yds ~ TD + (1 | Player), 
           data = df, REML = F)
summary(m1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Pass.Yds ~ TD + (1 | Player)
##    Data: df
## 
##      AIC      BIC   logLik deviance df.resid 
##   3878.9   3893.0  -1935.5   3870.9      246 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5220 -0.6730 -0.1686  0.6807  2.6153 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Player   (Intercept)  37199   192.9   
##  Residual             277935   527.2   
## Number of obs: 250, groups:  Player, 94
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  617.326     63.771    9.68
## TD           118.192      3.278   36.05
## 
## Correlation of Fixed Effects:
##    (Intr)
## TD -0.779
ranef(m1)
## $Player
##                     (Intercept)
## Aaron Rodgers      -134.7618035
## Alex Smith           67.1781541
## Andrew Luck         -74.8155410
## Andy Dalton          34.3668865
## Bailey Zappe        -50.4376912
## Baker Mayfield       91.8744342
## Ben Roethlisberger   28.7407601
## Blaine Gabbert      -54.7826290
## Blake Bortles        66.5973250
## Brandon Allen       -78.1484691
## Brett Rypien        -65.9513939
## Brian Hoyer         -84.7653842
## Brock Osweiler       -9.3815845
## Brock Purdy         -92.0515846
## C.J. Beathard       -89.7780720
## Cam Newton           50.8906421
## Carson Wentz         -7.3114213
## Case Keenum          40.5754101
## Chase Daniel       -104.9251808
## Cody Kessler        -17.0818638
## Colt McCoy         -100.0597030
## Cooper Rush         -74.6643458
## Dak Prescott         45.2368503
## Daniel Jones        178.0226260
## David Blough        -12.5234701
## Davis Mills          68.3121919
## Derek Anderson      -17.9809448
## Derek Carr          318.4463266
## Deshaun Watson       52.2213037
## Desmond Ridder      -17.1999061
## Devlin Hodges       -17.1497503
## Drew Brees         -167.0136860
## Drew Lock            -6.3764484
## Dwayne Haskins       15.9474795
## Eli Manning          96.6233547
## Gardner Minshew     -90.4621431
## Geno Smith          -40.8974830
## Jacoby Brissett      80.6292173
## Jake Luton          -27.1154629
## Jalen Hurts          81.5149220
## Jameis Winston      -53.4997209
## Jared Goff          277.0293274
## Jarrett Stidham     -51.2413587
## Jeff Driskel       -121.1351188
## Jimmy Garoppolo       7.0553678
## Joe Burrow           21.9931264
## Joe Flacco           26.5993803
## Josh Allen          -80.1894709
## Josh Johnson       -100.5313215
## Josh McCown         -23.1974368
## Josh Rosen           20.2762387
## Justin Fields         4.3020178
## Justin Herbert      107.6252206
## Kenny Pickett       113.2417814
## Kirk Cousins        -18.9199960
## Kyle Allen           22.7335686
## Kyler Murray        126.4514317
## Lamar Jackson      -225.7134070
## Luke Falk           -23.7650196
## Mac Jones           138.1450969
## Marcus Mariota       18.8792908
## Mason Rudolph       -96.9777173
## Matt Moore          -50.8872317
## Matt Ryan           289.3126337
## Matt Schaub         -46.2609514
## Matthew Stafford     33.5114949
## Mike Glennon        -71.0297827
## Mike White           -3.7149267
## Mitchell Trubisky    -0.9587801
## Nathan Peterman     -51.8817262
## Nick Foles          -20.5139626
## Nick Mullens         55.3822525
## Patrick Mahomes    -146.1277677
## Philip Rivers       181.7189824
## PJ Walker           -64.2365967
## Russell Wilson     -164.6798883
## Ryan Finley         -44.8218145
## Ryan Fitzpatrick     21.4329830
## Ryan Tannehill      -64.5430875
## Sam Bradford        -53.5569479
## Sam Darnold         117.9225281
## Sam Ehlinger        -47.0872478
## Taylor Heinicke     -14.7356429
## Taysom Hill         -28.9484255
## Teddy Bridgewater    82.4542424
## Tim Boyle           -52.6352380
## Tom Brady            75.3319358
## Trevor Lawrence     226.6004720
## Trevor Siemian      -90.1176493
## Trey Lance          -71.4492283
## Tua Tagovailoa        1.6284962
## Tyler Huntley        -9.1443412
## Tyrod Taylor        -65.7754287
## Zach Wilson         107.1074421
## 
## with conditional variances for "Player"

Methodology

For this data, I used a mixed effects model. My data is longitudinal, meaning that there are multiple observations of the same subject. Mixed effects models contain both fixed and random effects. The fixed effects are the same as with simple linear regression, they are explanatory variables that we believe to have an effect on the response variable. Random effects is where this model differs from a simple linear regression model. Random effects are categorical variables within our data set that we want to control for, even if we don’t think they have an effect on our response variable. The model assumptions are that the data is linearly related, normally distributed, and have constant variance. Unlike simple linear regression, the data does not need to be independent. For example, my data is longitudinal, meaning that there are multiple observations of the same subject. Clearly my data is not independent, leading to the use of a mixed effects model to control for the dependence. The mixed effects model for my data can be written out in symbols as \(Pass.Yds=\beta_{intercept}+\beta_{TD}TD+(effect_{player} + \epsilon)\), where \({effect}_{player} \sim N(0,\tau)\). So the player effects are random, and normally distributed with a mean of zero and some estimated standard deviation (\(\tau\)). The only difference between this mixed model and a standard regression is the player effect, which typically varies from player to player by some amount that is on average (\(\tau\)). We can simplify this equation to \(pass.yds = \beta_{int\_player}+\beta_{TD}TD+\epsilon\), where \(\beta_{int\_player}\sim N(\beta_{intercept},\tau)\). Now we see the intercepts as normally distributed with a mean of the overall intercept and some standard deviation. As such, this is often called a random intercepts model.

Results and Conclusions

print(VarCorr(m1))
##  Groups   Name        Std.Dev.
##  Player   (Intercept) 192.87  
##  Residual             527.20
lm(Pass.Yds~TD, data = df)
## 
## Call:
## lm(formula = Pass.Yds ~ TD, data = df)
## 
## Coefficients:
## (Intercept)           TD  
##       614.0        119.2
pred.interval <- data.frame(predictInterval(m1))   

head(pred.interval, 5)
##        fit      upr      lwr
## 1 5785.990 6474.876 5099.205
## 2 5307.558 6004.930 4640.502
## 3 4696.076 5382.483 3986.220
## 4 4461.024 5188.246 3720.792
## 5 6373.605 7177.813 5665.643
effect.stats <- data.frame(REsim(m1))
head(effect.stats, 5)
##   groupFctr       groupID        term       mean     median       sd
## 1    Player Aaron Rodgers (Intercept) -133.99619 -121.40391 155.1021
## 2    Player    Alex Smith (Intercept)   69.34768   59.95666 174.8217
## 3    Player   Andrew Luck (Intercept)  -78.56671  -93.84877 192.7275
## 4    Player   Andy Dalton (Intercept)   38.60647   37.32613 155.7156
## 5    Player  Bailey Zappe (Intercept)  -53.24755  -40.70298 175.6930
plotREsim(REsim(m1))
## Warning: Ignoring unknown parameters: linewidth

The fitted mixed effects model for my data is \(Pass.Yds=617.67+118.13(TD)+effect_{player}\). The player effect changes for every player and that is why the general fitted model does not have a singular player effect. In fact, a mixed effects model generates models with different intercepts for every player in the dataset. For example, the model for Aaron Rodgers would be \(Pass.Yds_{Rodgers}=617.67+118.13(TD)-134.76\). The plot above shows the range of player effects for each player, while the table above shows us the mean, median, and standard deviation of player effects for the first 5 quarterbacks. From the VarCorr function above, we see that the standard deviation of the player effects is 192.87, meaning that 68% of the player effects are within 192.87 of 617.67. Because standard deviation, and therefore the variance, is so high in relation to the mean intercept, we can conclude that a mixed effects model was needed to model the data. Now that we know the mixed effects model is necessary, we can use it make a prediction.

Scenario

Now on to the main purpose of my data analysis, prediction. Suppose a random quarterback in the NFL threw for 30 TDs, how many pass yards can we expect them to have thrown for? Using the general model, without player effects, we predict them to have thrown for 4160 yards. However, this prediction is for a random quarterback and each quarterback has their own player effect, so we can also predict for individual quarterbacks. Continuing with the Aaron Rodgers model from above, if Rodgers throws 30 TDs, we can expect him to have thrown for 4027 yards. This output is actually below the expected output for a random quarterback by 133 yards. Next I will create a data frame and subsequent plot showing the fitted models for 5 quarterbacks.

new_data = data.frame(Player = c('Lamar Jackson', 'Tom Brady', 'Taylor Heinicke', 'Aaron Rodgers', 'Derek Carr'), TD = 30)
new_data2 = data.frame(Player = c('Lamar Jackson', 'Tom Brady', 'Taylor Heinicke', 'Aaron Rodgers'), TD = 0)
predict_no_re = predict(m1, new_data, re.form=NA)
predict_no_re
##        1        2        3        4        5 
## 4163.077 4163.077 4163.077 4163.077 4163.077
predict(m1, new_data)
##        1        2        3        4        5 
## 3937.364 4238.409 4148.342 4028.315 4481.524
predictions = data.frame(Player = c('Lamar Jackson', 'Tom Brady', 'Taylor Heinicke', 'Aaron Rodgers', 'Derek Carr'), Pass.Yds = c(3937.364, 4238.409, 4148.342, 4028.315, 4481.524), TD = 30)
predictions
##            Player Pass.Yds TD
## 1   Lamar Jackson 3937.364 30
## 2       Tom Brady 4238.409 30
## 3 Taylor Heinicke 4148.342 30
## 4   Aaron Rodgers 4028.315 30
## 5      Derek Carr 4481.524 30
selected_players <- c('Lamar Jackson', 'Tom Brady', 'Taylor Heinicke', 'Aaron Rodgers', 'Derek Carr')
subset_df <- df[df$Player %in% selected_players, ]
subset_df
##              Player Pass.Yds Yds.Att Att Cmp Cmp.. TD INT  Rate X1st X1st. X20.
## 1         Tom Brady     5316     7.4 719 485  67.4 43  12 102.1  269  37.4   75
## 12       Derek Carr     4804     7.7 626 428  68.4 23  14  94.0  217  34.7   67
## 15        Tom Brady     4694     6.4 733 490  66.8 25   9  90.7  237  32.3   50
## 18        Tom Brady     4633     7.6 610 401  65.7 40  12 102.2  233  38.2   63
## 28    Aaron Rodgers     4442     7.4 597 372  62.3 25   2  97.6  200  33.5   55
## 31        Tom Brady     4355     7.6 570 375  65.8 29  11  97.7  205  36.0   53
## 34    Aaron Rodgers     4299     8.2 526 372  70.7 48   5 121.5  216  41.1   57
## 44    Aaron Rodgers     4115     7.8 531 366  68.9 37   4 111.9  213  40.1   55
## 47       Derek Carr     4103     7.9 517 348  67.3 27   9 101.4  193  37.3   54
## 49        Tom Brady     4057     6.6 613 373  60.8 24   8  88.0  193  31.5   60
## 50       Derek Carr     4054     7.9 513 361  70.4 21   8 100.8  191  37.2   54
## 51       Derek Carr     4049     7.3 553 381  68.9 19  10  93.9  197  35.6   52
## 54    Aaron Rodgers     4002     7.0 569 353  62.0 26   4  95.4  189  33.2   52
## 76    Aaron Rodgers     3695     6.8 542 350  64.6 26  12  91.1  177  32.7   53
## 84       Derek Carr     3522     7.0 502 305  60.8 24  14  86.3  161  32.1   47
## 87  Taylor Heinicke     3419     6.9 494 321  65.0 20  15  85.9  167  33.8   40
## 96    Lamar Jackson     3127     7.8 401 265  66.1 36   6 113.3  161  40.2   42
## 113   Lamar Jackson     2882     7.5 382 246  64.4 16  13  87.0  135  35.3   41
## 117   Lamar Jackson     2757     7.3 376 242  64.4 26   9  99.4  138  36.7   37
## 143   Lamar Jackson     2242     6.9 326 203  62.3 17   7  91.1  105  32.2   24
## 155 Taylor Heinicke     1859     7.2 259 161  62.2 12   6  89.6   93  35.9   20
## 176   Lamar Jackson     1201     7.1 170  99  58.2  6   3  84.5   60  35.3   13
## 248 Taylor Heinicke      320     5.6  57  35  61.4  1   3  60.6   19  33.3    2
##     X40. Lng Sck SckY
## 1     10  62  22  144
## 12    10  61  40  241
## 15     8  63  22  160
## 18    12  50  21  143
## 28    16  75  49  353
## 31     8  63  21  147
## 34    14  78  20  182
## 44    10  75  30  188
## 47    12  85  26  150
## 49     6  59  27  185
## 50    10  75  29  184
## 51     7  66  51  299
## 54    12  74  36  284
## 76     6  58  32  258
## 84     8  60  27  191
## 87     6  73  38  278
## 96     8  83  23  106
## 113    5  49  38  190
## 117    4  47  29  160
## 143    4  75  26  114
## 155    5  61  19  141
## 176    2  74  16   71
## 248    0  33   2   17
ggplot(subset_df, aes(x = TD, y = Pass.Yds, color = Player)) +
  geom_point() +
  geom_abline(intercept = 617.67-134.76, slope = 118.13, linetype = "solid", color = "salmon", alpha = 0.5)+
  geom_abline(intercept = 617.67-14.73, slope = 118.13, linetype = "solid", color = "skyblue")+
  geom_abline(intercept = 617.67+75.33, slope = 118.13, linetype = "solid", color = "darkorchid1")+
  geom_abline(intercept = 617.67-225.71, slope = 118.13, linetype = "solid", color = "seagreen", alpha = 0.5)+
  geom_abline(intercept = 617.67+318.44, slope = 118.13, linetype = "solid", color = "darkkhaki")+
  labs(title = "Plot of lines for 5 QBs",
       x = "TD",
       y = "Pass Yards") +
  theme_minimal()

Discussion and Critique

I learned that different quarterbacks definitely have different fitted models when it comes to predicting passing yards based on touchdown passes. What I thought was the most interesting was the interpretation of the different player effects that I calculated. First I want to preface this discussion by describing 2 different ways a quarterback is thought of to be good. Either they have a lot of passing yards, a lot of touchdowns, or both. A large negative player effect indicates that the quarterback has a high touchdown to passing yard ratio. A large positive player effect indicates that the quarterback has a low touchdown to passing yard ratio. Since a touchdown to passing yard ratio is not an inherently good measure of the quarterback, this project should not be used to rank quarterbacks. For example, Aaron Rodgers, a hall of fame quarterback, has an extremely low player effect, while Tom Brady, another hall of fame quarterback, has a large positive player effect. Both quarterbacks are top quarterbacks of all time, and yet their player effects are far different. This shows that the player effects should not be used to rank quarterbacks. What it should be used for is predicting a quarterback’s season based on expected touchdowns. \(\\\) A major weakness of my analysis is the lack of explanatory variables used in the analysis. Using nested F-tests, one could probably find a much better predictor for passing yards based on more than one predictor and interaction terms. Another concern that came up was the fact that, at the time of the data collection, the NFL season was 16 games. However, it has now been expanded to 17 games. At first I thought this might be a problem, however, since my model is linear and doesn’t explicitly take into account games played, this should not cause any problems with my model. The only worry is that an extra game will increase the yards far more than the touchdowns thrown because yards are much easier to get than TDs. Another change that I could make alongside the inclusion of more explanatory variables is creating a random slopes model instead of a random intercepts model. A random slopes model would need more data than the random intercepts model because each quarterback would have to meet linear regression model assumptions.