Introduction

For this project, the dataset I will be using is NFL quarterback statistical data from 1970 to 2022. The dataset is sourced from kaggle and contains statistics from every quarterback that has made an appearance in an NFL game between the years 1970 to 2022. There are 3,177 observations and 15 variables. There is also a Player column which contains the name of the player which that row of statistics corresponds to and the year in which that player played. I will ignore those columns for the purposes of this statistical analysis. I will be looking at the touchdown variable as a response variable and attempt to create a useful model by which to predict the number of touchdowns a quarterback will throw based on the available statistics in our dataset. I will use the ideas of weighted multiple linear regression to carry out this analysis. As for my research question, wins are the most important thing in the NFL, Quarterbacks are the most important position in the sport, with the number of touchdowns being the way that points are scored, I want to determine a model that can be used to predict the number of touchdowns thrown by a quarterback.

Methods

The model that I will be attempting to create is a weighted least squares regression model. Similar to an ordinary least squares, weighted least squares has a response variable with one or more predictor variables, and each predictor has a beta coefficient factor. In ordinary multiple linear regression, the predictors qll have equal weighting of one, and none of the predictors have more of an impact on the model than other ones. But in weighted multiple linear regression, the observations each have their own weighting corresponding to the residual distance from the fitted model, so observations further from the model fit line have a smaller impact on the model itself. Some key differences between ordinary multiple linear regression and weighted multiple linear regression are the assumptions. In ordinary multiple regression, we have an assumption of homoscedasicity, or equal variance of errors. But in the weighted framework, the model is not dependent on homoscedasicity, and we can have heteroscedastic, or varying error variances.

The theoretical model equation looks like the following:

\[Y = \beta_0 + (\beta_1 \times X_1) + (\beta_2 \times X_2) + ... + \epsilon\] The weights for each observation in my model will be calculated through the following equation \(w_i = \frac{1}{\sigma_i^2}\), the reciprical of the variance of each observation. The weights will then be multiplied through the equation to obtain the fitted values. The weighted least squares model minimizes the sum of the weights multiplied by the squared residuals, creating a model that properly handles the residual variance and does not violate any of the multiple linear regression model assumptions, specifically the homoscedasicity of errors. The other model assumptions we have are linearity, independence of observations, and zero-centered errors.

Results and Conclusions

Here is the data that we read in to carry out our analysis:

nfl_data <- read.csv("~/Downloads/NFL QB Stats.csv")
head(nfl_data)
##   Year          Player Pass.Yds Yds.Att Att Cmp Cmp.. TD INT  Rate X1st X1st.
## 1 2022 Patrick Mahomes     5250     8.1 648 435  67.1 41  12 105.2  272  42.0
## 2 2022  Justin Herbert     4739     6.8 699 477  68.2 25  10  93.2  228  32.6
## 3 2022       Tom Brady     4694     6.4 733 490  66.8 25   9  90.7  237  32.3
## 4 2022    Kirk Cousins     4547     7.1 643 424  65.9 29  14  92.5  230  35.8
## 5 2022      Joe Burrow     4475     7.4 606 414  68.3 35  12 100.8  219  36.1
## 6 2022      Jared Goff     4438     7.6 587 382  65.1 29   7  99.3  227  38.7
##   X20. X40. Lng Sck SckY
## 1   73   13  67  26  188
## 2   50    7  55  38  206
## 3   50    8  63  22  160
## 4   47   10  66  46  329
## 5   53   10  60  41  259
## 6   57   12  81  23  156

I filtered out the data a little bit after examining the pairwise scatter plots and noticed that there were a lot of players who did not play very much, and that was skewing the data in my estimation. So I calculated the average number of quarterbacks who played in each season and found out that roughly 59 quarterbacks play every year, about double the number of teams in the NFL. The median number of pass attempts of all quartbacks is 211, so I filtered for all quarterbacks which had more than 211 pass attempts in an effort to get only the top half of quarterbacks in each year.

mean(table(nfl_data$Year))
## [1] 59.9434
favstats(nfl_data$Att)
##  min Q1 median  Q3 max     mean       sd    n missing
##    2 75    211 400 733 245.3503 184.5515 3177       0
sum(is.na(nfl_data$TD))
## [1] 0
filtered_nfl_data <- nfl_data %>%
  filter(Att > 211) 

filtered_nfl_data$Year <- NULL 
filtered_nfl_data$Player <- NULL


head(filtered_nfl_data)
##   Pass.Yds Yds.Att Att Cmp Cmp.. TD INT  Rate X1st X1st. X20. X40. Lng Sck SckY
## 1     5250     8.1 648 435  67.1 41  12 105.2  272  42.0   73   13  67  26  188
## 2     4739     6.8 699 477  68.2 25  10  93.2  228  32.6   50    7  55  38  206
## 3     4694     6.4 733 490  66.8 25   9  90.7  237  32.3   50    8  63  22  160
## 4     4547     7.1 643 424  65.9 29  14  92.5  230  35.8   47   10  66  46  329
## 5     4475     7.4 606 414  68.3 35  12 100.8  219  36.1   53   10  60  41  259
## 6     4438     7.6 587 382  65.1 29   7  99.3  227  38.7   57   12  81  23  156

Next we have a correlation table of all of the predictor variables in the dataset against Touchdowns. We see that Pass Yards and Cmp are two of the more highly correlated variables with Touchdowns and INT and Sck are the least correlated with touchdowns.

cor_table <- cor(filtered_nfl_data['TD'], filtered_nfl_data[ , !names(filtered_nfl_data) %in% 'TD'], use = "complete.obs")
cor_table
##     Pass.Yds   Yds.Att       Att       Cmp     Cmp..       INT      Rate
## TD 0.8574058 0.5744764 0.7549639 0.7922835 0.5269718 0.1343383 0.7321022
##         X1st     X1st.     X20.     X40.       Lng       Sck      SckY
## TD 0.7292133 0.3636926 0.637886 0.593574 0.3100493 0.1674977 0.1154877

Let’s take a closer look at those highly correlated predictors:

ggplot(filtered_nfl_data, aes(x = Pass.Yds, y = TD)) +
  geom_point(alpha = 0.6, color = 'blue') + 
  labs(title = "TD v. Pass Yards") + theme_classic() 

ggplot(filtered_nfl_data, aes(x = Cmp, y = TD)) +
  geom_point(alpha = 0.6, color = 'skyblue') + 
  labs(title = "TD v. Cmp") + theme_classic() 

Clearly, we see strong positive correlation between each of these predictors and touchdowns, we can also take a look at some of the predictors which do not appear to be strongly correlated with touchdowns according to their correlation coefficients.

ggplot(filtered_nfl_data, aes(x = INT, y = TD)) + 
  geom_point(alpha = 0.6, color = 'blue') + 
  labs(title = "TD v. INT") + theme_classic() 

ggplot(filtered_nfl_data, aes(x = Sck, y = TD)) +
  geom_point(alpha = 0.6, color = 'skyblue') + 
  labs(title = "TD v. Sck") + theme_classic() 

These graphs do not have a clear correlation between touchdowns and interceptions or sacks.

The theoretical model equation for a simple multiple linear regression of our data is:

\[TD = \beta_0 + (\beta_1 \times X_1) + ... + (\beta_i \times X_i) + \epsilon\]

where i is the number of predictors, in our case we have 10. The assumptions of this model are independence of observations,

\(\epsilon \sim N(0,\sigma^2)\)

I am only interested in examining the predictors Pass.Yards, Yds.Att, Att, Cmp, Cmp.. (%), Rate, Lng, Sck, and SckY. So for this model i is 10.

Let’s first create a simple multiple linear regression model which we can then use for the rest of our analysis.

td.lm <- lm(TD ~Pass.Yds + Yds.Att+Att+Cmp+Cmp..+INT+Rate+Lng+Sck+SckY, data = filtered_nfl_data)

summary(td.lm)
## 
## Call:
## lm(formula = TD ~ Pass.Yds + Yds.Att + Att + Cmp + Cmp.. + INT + 
##     Rate + Lng + Sck + SckY, data = filtered_nfl_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2072 -1.0840 -0.1396  0.9315  9.4250 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54.7673465  1.9505124  28.078  < 2e-16 ***
## Pass.Yds     0.0089058  0.0006305  14.125  < 2e-16 ***
## Yds.Att     -6.5629986  0.2647635 -24.788  < 2e-16 ***
## Att         -0.1538712  0.0050118 -30.702  < 2e-16 ***
## Cmp          0.1720859  0.0089920  19.138  < 2e-16 ***
## Cmp..       -1.4684837  0.0392307 -37.432  < 2e-16 ***
## INT          1.0054090  0.0193625  51.926  < 2e-16 ***
## Rate         0.9657059  0.0136556  70.719  < 2e-16 ***
## Lng          0.0024558  0.0039388   0.623 0.533047    
## Sck         -0.0389331  0.0114489  -3.401 0.000689 ***
## SckY         0.0036911  0.0015676   2.355 0.018660 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.894 on 1573 degrees of freedom
## Multiple R-squared:  0.9478, Adjusted R-squared:  0.9474 
## F-statistic:  2854 on 10 and 1573 DF,  p-value: < 2.2e-16

From our ordinary model summary, we see that 9 of our predictors are currently significant, with p values below our significance level of alpha = 0.05. Let’s check our model conditions now with a Fitted vs. Residuals plot and QQ plot.

plot(td.lm, which = 1)

plot(td.lm, which = 2)

From these fitted vs. residual and QQ plot of our initial multiple linear regression model, we see in our residuals vs. fitted plot we do not have equal variance across all of our errors. In an effort to make this model better, we can create a weighted multiple linear regression model that does not require homoscedasicity of residuals. First we can create a variance model based on the squares of the residuals to calculate our weightings.

residuals <- residuals(td.lm)

residuals_squared <- residuals^2

variance_model <- lm(residuals_squared ~ Pass.Yds + Yds.Att+Att+Cmp+Cmp..+INT+Rate+Sck+SckY+Lng, data = filtered_nfl_data)
predicted_variance <- predict(variance_model)

filtered_nfl_data$predicted_variance <- predicted_variance
summary(variance_model)
## 
## Call:
## lm(formula = residuals_squared ~ Pass.Yds + Yds.Att + Att + Cmp + 
##     Cmp.. + INT + Rate + Sck + SckY + Lng, data = filtered_nfl_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.500  -3.029  -1.841   0.168  81.672 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.2133874  7.2375687   4.589 4.81e-06 ***
## Pass.Yds     0.0012753  0.0023396   0.545  0.58577    
## Yds.Att      1.2660941  0.9824311   1.289  0.19768    
## Att         -0.1099266  0.0185969  -5.911 4.16e-09 ***
## Cmp          0.1835338  0.0333658   5.501 4.41e-08 ***
## Cmp..       -0.4443189  0.1455693  -3.052  0.00231 ** 
## INT         -0.0419369  0.0718465  -0.584  0.55950    
## Rate        -0.1550757  0.0506705  -3.060  0.00225 ** 
## Sck         -0.0521653  0.0424824  -1.228  0.21966    
## SckY         0.0007172  0.0058166   0.123  0.90188    
## Lng         -0.0203716  0.0146152  -1.394  0.16356    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.028 on 1573 degrees of freedom
## Multiple R-squared:  0.06477,    Adjusted R-squared:  0.05883 
## F-statistic: 10.89 on 10 and 1573 DF,  p-value: < 2.2e-16
ggplot(filtered_nfl_data, aes(x = TD, y = residuals_squared)) +
  geom_point(alpha = 0.6) +
  geom_line(aes(y = predicted_variance), color = "blue") +
  labs(title = "Variance Function", y = "Squared Residuals") + theme_classic()

Here we have our Variance function plot of the squared residuals against the response variable, this is what we will use to create the weighted model.

filtered_nfl_data$weights <- 1 / (abs(filtered_nfl_data$TD) + 0.1)

td_weighted.lm <- lm(TD ~Pass.Yds + Yds.Att+Att+Cmp+Cmp..+INT+Rate+Lng+Sck+SckY, data = filtered_nfl_data, weights = weights)
summary(td_weighted.lm)
## 
## Call:
## lm(formula = TD ~ Pass.Yds + Yds.Att + Att + Cmp + Cmp.. + INT + 
##     Rate + Lng + Sck + SckY, data = filtered_nfl_data, weights = weights)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92839 -0.24763  0.00804  0.27099  2.53177 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.8728050  1.5596025  31.978   <2e-16 ***
## Pass.Yds     0.0101207  0.0005594  18.091   <2e-16 ***
## Yds.Att     -6.1344740  0.2094644 -29.286   <2e-16 ***
## Att         -0.1563966  0.0046686 -33.499   <2e-16 ***
## Cmp          0.1608407  0.0090040  17.863   <2e-16 ***
## Cmp..       -1.2349233  0.0350111 -35.272   <2e-16 ***
## INT          0.9715852  0.0195839  49.611   <2e-16 ***
## Rate         0.8242451  0.0126255  65.284   <2e-16 ***
## Lng          0.0019284  0.0035618   0.541   0.5883    
## Sck         -0.0179715  0.0101729  -1.767   0.0775 .  
## SckY         0.0009541  0.0014119   0.676   0.4993    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4788 on 1573 degrees of freedom
## Multiple R-squared:  0.9545, Adjusted R-squared:  0.9542 
## F-statistic:  3299 on 10 and 1573 DF,  p-value: < 2.2e-16

With our new fitted model, we see that the number of statistically significant predictors is decreased to 7, as Sck and SckY are no longer statistically significant with regards to our response variable of Touchdowns. Lets look at our model conditions now with the Fitted vs. Residual and QQ Plots.

plot(td_weighted.lm, which = 1)

plot(td_weighted.lm, which = 2)

The QQ Plot for our weighted multiple linear regression model is slightly better than our initial ordinary regression model, but we can see in our Fitted vs Residual plot that our new model has a closer centering to 0 on the left side of our fitted values, and they do appear to be more normally distributed, making this model a better model that satisfies our model conditions to a better extent.

Discussion and Critique

This may not have been the best method for creating the weighted model for touchdowns by predictor variables. There are other ways to do it calculating weights by each individual predictor if I were to have prior knowledge of what statistics are more likely to have impact on touchdowns, but I did not. The weighted model created has potential for multicollinearity which has not been checked for yet, and I may be able to simplify the model by removing variables with heavy multicollinearity. Extensions of this project would be to find a “best model” by starting with a null model and following the forward stepwise process in order to add one predictor at a time, that would likely make the model simpler and also illustrate which variables may be causing the unequal variance and heteroscedasticity of the residuals.