I chose to explore data about quarterbacks in the national football league. I used a data set of quarterback statistics from the last 5 years. I cleaned the original data by setting a restriction of 290 minimum passing yards required to make the list. I decided to restrict the dataset in this way because I mainly want to look at starting quarterbacks or backup quarterbacks that played significant time in a given NFL season. By setting a minimum number of total yards, I eliminated any wide receivers that threw passes or backups that barely saw the field. The main response variable I wanted to look at was total passing yards in a season, so I generated my question around that. There were 14 explanatory variables included in the data set, most of which I decided to leave out of my analysis. In the end, I decided to only retain the total touchdowns statistic. I decided to answer the question of how the number of touchdown passes a QB throws affects the number of passing yards they throw in a season. My goal was to generate a predictive model that predicts the amount of pass yards a quarterback that threw a given amount of touchdowns would have.
df = read.csv('Project Data.csv')
head(df, 5)
## Player Pass.Yds Yds.Att Att Cmp Cmp.. TD INT Rate X1st X1st.
## 1 Tom Brady 5316 7.4 719 485 67.4 43 12 102.1 269 37.4
## 2 Patrick Mahomes 5250 8.1 648 435 67.1 41 12 105.2 272 42.0
## 3 Ben Roethlisberger 5129 7.6 675 452 67.0 34 16 96.5 248 36.7
## 4 Jameis Winston 5109 8.2 626 380 60.7 33 30 84.3 243 38.8
## 5 Patrick Mahomes 5097 8.8 580 383 66.0 50 12 113.8 237 40.9
## X20. X40. Lng Sck SckY
## 1 75 10 62 22 144
## 2 73 13 67 26 188
## 3 61 16 97 24 166
## 4 75 13 71 47 282
## 5 75 15 89 26 171
m1 <- lmer(Pass.Yds ~ TD + (1 | Player),
data = df, REML = F)
summary(m1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Pass.Yds ~ TD + (1 | Player)
## Data: df
##
## AIC BIC logLik deviance df.resid
## 3878.9 3893.0 -1935.5 3870.9 246
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5220 -0.6730 -0.1686 0.6807 2.6153
##
## Random effects:
## Groups Name Variance Std.Dev.
## Player (Intercept) 37199 192.9
## Residual 277935 527.2
## Number of obs: 250, groups: Player, 94
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 617.326 63.771 9.68
## TD 118.192 3.278 36.05
##
## Correlation of Fixed Effects:
## (Intr)
## TD -0.779
ranef(m1)
## $Player
## (Intercept)
## Aaron Rodgers -134.7618035
## Alex Smith 67.1781541
## Andrew Luck -74.8155410
## Andy Dalton 34.3668865
## Bailey Zappe -50.4376912
## Baker Mayfield 91.8744342
## Ben Roethlisberger 28.7407601
## Blaine Gabbert -54.7826290
## Blake Bortles 66.5973250
## Brandon Allen -78.1484691
## Brett Rypien -65.9513939
## Brian Hoyer -84.7653842
## Brock Osweiler -9.3815845
## Brock Purdy -92.0515846
## C.J. Beathard -89.7780720
## Cam Newton 50.8906421
## Carson Wentz -7.3114213
## Case Keenum 40.5754101
## Chase Daniel -104.9251808
## Cody Kessler -17.0818638
## Colt McCoy -100.0597030
## Cooper Rush -74.6643458
## Dak Prescott 45.2368503
## Daniel Jones 178.0226260
## David Blough -12.5234701
## Davis Mills 68.3121919
## Derek Anderson -17.9809448
## Derek Carr 318.4463266
## Deshaun Watson 52.2213037
## Desmond Ridder -17.1999061
## Devlin Hodges -17.1497503
## Drew Brees -167.0136860
## Drew Lock -6.3764484
## Dwayne Haskins 15.9474795
## Eli Manning 96.6233547
## Gardner Minshew -90.4621431
## Geno Smith -40.8974830
## Jacoby Brissett 80.6292173
## Jake Luton -27.1154629
## Jalen Hurts 81.5149220
## Jameis Winston -53.4997209
## Jared Goff 277.0293274
## Jarrett Stidham -51.2413587
## Jeff Driskel -121.1351188
## Jimmy Garoppolo 7.0553678
## Joe Burrow 21.9931264
## Joe Flacco 26.5993803
## Josh Allen -80.1894709
## Josh Johnson -100.5313215
## Josh McCown -23.1974368
## Josh Rosen 20.2762387
## Justin Fields 4.3020178
## Justin Herbert 107.6252206
## Kenny Pickett 113.2417814
## Kirk Cousins -18.9199960
## Kyle Allen 22.7335686
## Kyler Murray 126.4514317
## Lamar Jackson -225.7134070
## Luke Falk -23.7650196
## Mac Jones 138.1450969
## Marcus Mariota 18.8792908
## Mason Rudolph -96.9777173
## Matt Moore -50.8872317
## Matt Ryan 289.3126337
## Matt Schaub -46.2609514
## Matthew Stafford 33.5114949
## Mike Glennon -71.0297827
## Mike White -3.7149267
## Mitchell Trubisky -0.9587801
## Nathan Peterman -51.8817262
## Nick Foles -20.5139626
## Nick Mullens 55.3822525
## Patrick Mahomes -146.1277677
## Philip Rivers 181.7189824
## PJ Walker -64.2365967
## Russell Wilson -164.6798883
## Ryan Finley -44.8218145
## Ryan Fitzpatrick 21.4329830
## Ryan Tannehill -64.5430875
## Sam Bradford -53.5569479
## Sam Darnold 117.9225281
## Sam Ehlinger -47.0872478
## Taylor Heinicke -14.7356429
## Taysom Hill -28.9484255
## Teddy Bridgewater 82.4542424
## Tim Boyle -52.6352380
## Tom Brady 75.3319358
## Trevor Lawrence 226.6004720
## Trevor Siemian -90.1176493
## Trey Lance -71.4492283
## Tua Tagovailoa 1.6284962
## Tyler Huntley -9.1443412
## Tyrod Taylor -65.7754287
## Zach Wilson 107.1074421
##
## with conditional variances for "Player"
For this data, I used a mixed effects model. My data is longitudinal, meaning that there are multiple observations of the same subject. Mixed effects models contain both fixed and random effects. The fixed effects are the same as with simple linear regression, they are explanatory variables that we believe to have an effect on the response variable. Random effects is where this model differs from a simple linear regression model. Random effects are categorical variables within our data set that we want to control for, even if we don’t think they have an effect on our response variable. The model assumptions are that the data is linearly related, normally distributed, and have constant variance. Unlike simple linear regression, the data does not need to be independent. For example, my data is longitudinal, meaning that there are multiple observations of the same subject. Clearly my data is not independent, leading to the use of a mixed effects model to control for the dependence. The mixed effects model for my data can be written out in symbols as \(Pass.Yds=\beta_{intercept}+\beta_{TD}TD+(effect_{player} + \epsilon)\), where \({effect}_{player} \sim N(0,\tau)\). So the player effects are random, and normally distributed with a mean of zero and some estimated standard deviation (\(\tau\)). The only difference between this mixed model and a standard regression is the player effect, which typically varies from player to player by some amount that is on average (\(\tau\)). We can simplify this equation to \(pass.yds = \beta_{int\_player}+\beta_{TD}TD+\epsilon\), where \(\beta_{int\_player}\sim N(\beta_{intercept},\tau)\). Now we see the intercepts as normally distributed with a mean of the overall intercept and some standard deviation. As such, this is often called a random intercepts model.
print(VarCorr(m1))
## Groups Name Std.Dev.
## Player (Intercept) 192.87
## Residual 527.20
lm(Pass.Yds~TD, data = df)
##
## Call:
## lm(formula = Pass.Yds ~ TD, data = df)
##
## Coefficients:
## (Intercept) TD
## 614.0 119.2
pred.interval <- data.frame(predictInterval(m1))
head(pred.interval, 5)
## fit upr lwr
## 1 5785.990 6474.876 5099.205
## 2 5307.558 6004.930 4640.502
## 3 4696.076 5382.483 3986.220
## 4 4461.024 5188.246 3720.792
## 5 6373.605 7177.813 5665.643
effect.stats <- data.frame(REsim(m1))
head(effect.stats, 5)
## groupFctr groupID term mean median sd
## 1 Player Aaron Rodgers (Intercept) -133.99619 -121.40391 155.1021
## 2 Player Alex Smith (Intercept) 69.34768 59.95666 174.8217
## 3 Player Andrew Luck (Intercept) -78.56671 -93.84877 192.7275
## 4 Player Andy Dalton (Intercept) 38.60647 37.32613 155.7156
## 5 Player Bailey Zappe (Intercept) -53.24755 -40.70298 175.6930
plotREsim(REsim(m1))
## Warning: Ignoring unknown parameters: linewidth
The fitted mixed effects model for my data is \(Pass.Yds=617.67+118.13(TD)+effect_{player}\).
The player effect changes for every player and that is why the general
fitted model does not have a singular player effect. In fact, a mixed
effects model generates models with different intercepts for every
player in the dataset. For example, the model for Aaron Rodgers would be
\(Pass.Yds_{Rodgers}=617.67+118.13(TD)-134.76\).
The plot above shows the range of player effects for each player, while
the table above shows us the mean, median, and standard deviation of
player effects for the first 5 quarterbacks. From the VarCorr function
above, we see that the standard deviation of the player effects is
192.87, meaning that 68% of the player effects are within 192.87 of
617.67. Because standard deviation, and therefore the variance, is so
high in relation to the mean intercept, we can conclude that a mixed
effects model was needed to model the data. Now that we know the mixed
effects model is necessary, we can use it make a prediction.
Now on to the main purpose of my data analysis, prediction. Suppose a random quarterback in the NFL threw for 30 TDs, how many pass yards can we expect them to have thrown for? Using the general model, without player effects, we predict them to have thrown for 4160 yards. However, this prediction is for a random quarterback and each quarterback has their own player effect, so we can also predict for individual quarterbacks. Continuing with the Aaron Rodgers model from above, if Rodgers throws 30 TDs, we can expect him to have thrown for 4027 yards. This output is actually below the expected output for a random quarterback by 133 yards. Next I will create a data frame and subsequent plot showing the fitted models for 5 quarterbacks.
new_data = data.frame(Player = c('Lamar Jackson', 'Tom Brady', 'Taylor Heinicke', 'Aaron Rodgers', 'Derek Carr'), TD = 30)
new_data2 = data.frame(Player = c('Lamar Jackson', 'Tom Brady', 'Taylor Heinicke', 'Aaron Rodgers'), TD = 0)
predict_no_re = predict(m1, new_data, re.form=NA)
predict_no_re
## 1 2 3 4 5
## 4163.077 4163.077 4163.077 4163.077 4163.077
predict(m1, new_data)
## 1 2 3 4 5
## 3937.364 4238.409 4148.342 4028.315 4481.524
predictions = data.frame(Player = c('Lamar Jackson', 'Tom Brady', 'Taylor Heinicke', 'Aaron Rodgers', 'Derek Carr'), Pass.Yds = c(3937.364, 4238.409, 4148.342, 4028.315, 4481.524), TD = 30)
predictions
## Player Pass.Yds TD
## 1 Lamar Jackson 3937.364 30
## 2 Tom Brady 4238.409 30
## 3 Taylor Heinicke 4148.342 30
## 4 Aaron Rodgers 4028.315 30
## 5 Derek Carr 4481.524 30
selected_players <- c('Lamar Jackson', 'Tom Brady', 'Taylor Heinicke', 'Aaron Rodgers', 'Derek Carr')
subset_df <- df[df$Player %in% selected_players, ]
subset_df
## Player Pass.Yds Yds.Att Att Cmp Cmp.. TD INT Rate X1st X1st. X20.
## 1 Tom Brady 5316 7.4 719 485 67.4 43 12 102.1 269 37.4 75
## 12 Derek Carr 4804 7.7 626 428 68.4 23 14 94.0 217 34.7 67
## 15 Tom Brady 4694 6.4 733 490 66.8 25 9 90.7 237 32.3 50
## 18 Tom Brady 4633 7.6 610 401 65.7 40 12 102.2 233 38.2 63
## 28 Aaron Rodgers 4442 7.4 597 372 62.3 25 2 97.6 200 33.5 55
## 31 Tom Brady 4355 7.6 570 375 65.8 29 11 97.7 205 36.0 53
## 34 Aaron Rodgers 4299 8.2 526 372 70.7 48 5 121.5 216 41.1 57
## 44 Aaron Rodgers 4115 7.8 531 366 68.9 37 4 111.9 213 40.1 55
## 47 Derek Carr 4103 7.9 517 348 67.3 27 9 101.4 193 37.3 54
## 49 Tom Brady 4057 6.6 613 373 60.8 24 8 88.0 193 31.5 60
## 50 Derek Carr 4054 7.9 513 361 70.4 21 8 100.8 191 37.2 54
## 51 Derek Carr 4049 7.3 553 381 68.9 19 10 93.9 197 35.6 52
## 54 Aaron Rodgers 4002 7.0 569 353 62.0 26 4 95.4 189 33.2 52
## 76 Aaron Rodgers 3695 6.8 542 350 64.6 26 12 91.1 177 32.7 53
## 84 Derek Carr 3522 7.0 502 305 60.8 24 14 86.3 161 32.1 47
## 87 Taylor Heinicke 3419 6.9 494 321 65.0 20 15 85.9 167 33.8 40
## 96 Lamar Jackson 3127 7.8 401 265 66.1 36 6 113.3 161 40.2 42
## 113 Lamar Jackson 2882 7.5 382 246 64.4 16 13 87.0 135 35.3 41
## 117 Lamar Jackson 2757 7.3 376 242 64.4 26 9 99.4 138 36.7 37
## 143 Lamar Jackson 2242 6.9 326 203 62.3 17 7 91.1 105 32.2 24
## 155 Taylor Heinicke 1859 7.2 259 161 62.2 12 6 89.6 93 35.9 20
## 176 Lamar Jackson 1201 7.1 170 99 58.2 6 3 84.5 60 35.3 13
## 248 Taylor Heinicke 320 5.6 57 35 61.4 1 3 60.6 19 33.3 2
## X40. Lng Sck SckY
## 1 10 62 22 144
## 12 10 61 40 241
## 15 8 63 22 160
## 18 12 50 21 143
## 28 16 75 49 353
## 31 8 63 21 147
## 34 14 78 20 182
## 44 10 75 30 188
## 47 12 85 26 150
## 49 6 59 27 185
## 50 10 75 29 184
## 51 7 66 51 299
## 54 12 74 36 284
## 76 6 58 32 258
## 84 8 60 27 191
## 87 6 73 38 278
## 96 8 83 23 106
## 113 5 49 38 190
## 117 4 47 29 160
## 143 4 75 26 114
## 155 5 61 19 141
## 176 2 74 16 71
## 248 0 33 2 17
ggplot(subset_df, aes(x = TD, y = Pass.Yds, color = Player)) +
geom_point() +
geom_abline(intercept = 617.67-134.76, slope = 118.13, linetype = "solid", color = "salmon", alpha = 0.5)+
geom_abline(intercept = 617.67-14.73, slope = 118.13, linetype = "solid", color = "skyblue")+
geom_abline(intercept = 617.67+75.33, slope = 118.13, linetype = "solid", color = "darkorchid1")+
geom_abline(intercept = 617.67-225.71, slope = 118.13, linetype = "solid", color = "seagreen", alpha = 0.5)+
geom_abline(intercept = 617.67+318.44, slope = 118.13, linetype = "solid", color = "darkkhaki")+
labs(title = "Plot of lines for 5 QBs",
x = "TD",
y = "Pass Yards") +
theme_minimal()
I learned that different quarterbacks definitely have different fitted models when it comes to predicting passing yards based on touchdown passes. What I thought was the most interesting was the interpretation of the different player effects that I calculated. First I want to preface this discussion by describing 2 different ways a quarterback is thought of to be good. Either they have a lot of passing yards, a lot of touchdowns, or both. A large negative player effect indicates that the quarterback has a high touchdown to passing yard ratio. A large positive player effect indicates that the quarterback has a low touchdown to passing yard ratio. Since a touchdown to passing yard ratio is not an inherently good measure of the quarterback, this project should not be used to rank quarterbacks. For example, Aaron Rodgers, a hall of fame quarterback, has an extremely low player effect, while Tom Brady, another hall of fame quarterback, has a large positive player effect. Both quarterbacks are top quarterbacks of all time, and yet their player effects are far different. This shows that the player effects should not be used to rank quarterbacks. What it should be used for is predicting a quarterback’s season based on expected touchdowns. \(\\\) A major weakness of my analysis is the lack of explanatory variables used in the analysis. Using nested F-tests, one could probably find a much better predictor for passing yards based on more than one predictor and interaction terms. Another concern that came up was the fact that, at the time of the data collection, the NFL season was 16 games. However, it has now been expanded to 17 games. At first I thought this might be a problem, however, since my model is linear and doesn’t explicitly take into account games played, this should not cause any problems with my model. The only worry is that an extra game will increase the yards far more than the touchdowns thrown because yards are much easier to get than TDs. Another change that I could make alongside the inclusion of more explanatory variables is creating a random slopes model instead of a random intercepts model. A random slopes model would need more data than the random intercepts model because each quarterback would have to meet linear regression model assumptions.