Model Building Process

Start by setting up the libraries and processing data.

library(rio)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(arm)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loading required package: lme4
## Registered S3 method overwritten by 'lme4':
##   method           from
##   na.action.merMod car 
## 
## Attaching package: 'lme4'
## 
## The following object is masked from 'package:rio':
## 
##     factorize
## 
## 
## arm (Version 1.15-2, built: 2026-3-22)
## 
## Working directory is /Users/sjhawkins/H510
## 
## 
## Attaching package: 'arm'
## 
## The following object is masked from 'package:car':
## 
##     logit
data <- import("plays.csv") %>%
  mutate(zone = pff_manZone == "Zone", 
         incompleted_pass = passResult == "I", 
         interception = passResult == "IN", 
         dropbackDistance = ifelse(is.na(dropbackDistance), 0, dropbackDistance),
         timeToThrow = ifelse(is.na(timeToThrow), 0, timeToThrow),
         timeInTackleBox = ifelse(is.na(timeInTackleBox), 0, timeInTackleBox), 
         score_differential = abs(preSnapHomeScore - preSnapVisitorScore), 
         penaltyYards = ifelse(is.na(penaltyYards), 0, penaltyYards)
  ) %>%
  filter(passResult %in% c("C", "I", "IN"))

Now we can fit a model including all variables that could possibly influence whether a pass is completed, incompleted or intercepted. We’ve set completion as the reference level, so the model will use the log odds of an incompletion vs a completion and the log odds of an interception vs a completion.

model <- vglm(formula = passResult ~ yardsToGo + quarter + down + passLength + playAction + dropbackDistance + timeToThrow + passTippedAtLine + zone + score_differential, 
              family = multinomial(refLevel = "C"), 
              data = data )
summary(model)
## 
## Call:
## vglm(formula = passResult ~ yardsToGo + quarter + down + passLength + 
##     playAction + dropbackDistance + timeToThrow + passTippedAtLine + 
##     zone + score_differential, family = multinomial(refLevel = "C"), 
##     data = data)
## 
## Coefficients: 
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1          -1.518205   0.141766 -10.709  < 2e-16 ***
## (Intercept):2          -5.622102   0.431005 -13.044  < 2e-16 ***
## yardsToGo:1            -0.034620   0.007119  -4.863 1.16e-06 ***
## yardsToGo:2            -0.019788   0.020338  -0.973   0.3306    
## quarter:1              -0.005909   0.023626  -0.250   0.8025    
## quarter:2               0.148323   0.070231   2.112   0.0347 *  
## down:1                 -0.031315   0.033484  -0.935   0.3497    
## down:2                  0.114849   0.095960   1.197   0.2314    
## passLength:1            0.049469   0.002684  18.429  < 2e-16 ***
## passLength:2            0.079274   0.005973  13.271  < 2e-16 ***
## playActionTRUE:1       -0.326186   0.066520  -4.904 9.41e-07 ***
## playActionTRUE:2       -0.369821   0.206504  -1.791   0.0733 .  
## dropbackDistance:1     -0.002164   0.015496  -0.140   0.8889    
## dropbackDistance:2      0.049144   0.047412   1.037   0.3000    
## timeToThrow:1           0.433147   0.029604  14.631  < 2e-16 ***
## timeToThrow:2           0.328185   0.082102   3.997 6.41e-05 ***
## passTippedAtLineTRUE:1  4.407971   0.328131  13.434  < 2e-16 ***
## passTippedAtLineTRUE:2  5.145965   0.410873  12.524  < 2e-16 ***
## zoneTRUE:1             -0.566654   0.054775 -10.345  < 2e-16 ***
## zoneTRUE:2             -0.129794   0.169891  -0.764   0.4449    
## score_differential:1   -0.001321   0.004042  -0.327   0.7438    
## score_differential:2   -0.005662   0.012009  -0.471   0.6373    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,2]/mu[,1]), log(mu[,3]/mu[,1])
## 
## Residual deviance: 11175.5 on 17388 degrees of freedom
## 
## Log-likelihood: -5587.75 on 17388 degrees of freedom
## 
## Number of Fisher scoring iterations: 7 
## 
## 
## Reference group is level  1  of the response

Because our model has so many variables, the first thing we need to check for is issues with multicolinearity. We do this by first fitting a simple ordinary least squares regression and then can calculate the variance inflation factor from that model.

tem_model <- lm(passResult == "C" ~ yardsToGo + quarter + down + passLength + playAction + dropbackDistance + timeToThrow + passTippedAtLine + zone + score_differential, 
                data = data)
vif(tem_model)
##          yardsToGo            quarter               down         passLength 
##           1.175031           1.145361           1.270405           1.151570 
##         playAction   dropbackDistance        timeToThrow   passTippedAtLine 
##           1.275255           1.420051           1.430717           1.015449 
##               zone score_differential 
##           1.106535           1.140841

Here we can see very slight correlations between the variables, however the vif values are all close to 1, so we don’t need to worry about multicolinearity issues within the model. The next thing we need to check for is issues with the residuals in the model. Since this is a multinomial logistic regression, we can’t just plot the residual vs fitted values, but instead look at a binned residual plot.

probs <- fitted(model)  # Matrix of predicted probabilities for each category
res <- resid(model, type = "pearson") 
binnedplot(probs[, 1], res[, 1], 
           xlab = "Fitted Probabilities (Completion)", 
           main = "Binned Residual Plot")

Here we can see a couple of points falling outside the binned lines, however it’s not terribly prevalent. Coefficents may not be reliable at the most extreme values, however it’s not terribly concerning.

Finally we see if we can simplify the model at all. Here we loop through the covariates to see if we can retain or improve the AIC with a simplified model.

exp_var <- c("yardsToGo",
             "quarter", 
             "down", 
             "passLength", 
             "playAction", 
             "dropbackDistance", 
             "timeToThrow", 
             "passTippedAtLine", 
             "zone", 
             "score_differential"
)
initial_model <- vglm(as.formula(paste("passResult ~ ", paste(exp_var, collapse = "+"))),
                     data = data, 
                     family = multinomial(refLevel = "C"))
initial_AIC <- AIC(initial_model)
new_aic <- initial_AIC -1
while(initial_AIC > new_aic){
  initial_AIC <- new_aic
  AIC_vec <- vector()
  var_left_out <- vector()
  
  for(i in 1:length(exp_var)){
    test_formula <- as.formula(paste("passResult~ ", paste(exp_var[-i], collapse = "+")))
    test_model <- vglm(test_formula,
                      data = data, 
                      family = multinomial(refLevel = "C"))
    AIC_vec[i] <- AIC(test_model)
    var_left_out[i] <- exp_var[i]
    
  }
  print(paste("Dropping", exp_var[AIC_vec == min(AIC_vec)]))
  exp_var <- c(exp_var[!AIC_vec == min(AIC_vec)])
  
  new_model <-  vglm(as.formula(paste("passResult ~ ", paste(exp_var[-i], collapse = "+"))),
                     data = data, 
                     family = multinomial(refLevel = "C"))
  new_aic <- AIC(new_model)
  
}
## Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, : some
## quantities such as z, residuals, SEs may be inaccurate due to convergence at a
## half-step
## [1] "Dropping score_differential"
## [1] "Dropping dropbackDistance"
## [1] "Dropping down"
## Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, : some
## quantities such as z, residuals, SEs may be inaccurate due to convergence at a
## half-step
## [1] "Dropping quarter"
new_model <-  vglm(as.formula(paste("passResult ~ ", paste(exp_var, collapse = "+"))),
                   data = data, 
                   family = multinomial(refLevel = "C"))
summary(new_model)
## 
## Call:
## vglm(formula = as.formula(paste("passResult ~ ", paste(exp_var, 
##     collapse = "+"))), family = multinomial(refLevel = "C"), 
##     data = data)
## 
## Coefficients: 
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1          -1.628794   0.093517 -17.417  < 2e-16 ***
## (Intercept):2          -4.902795   0.287514 -17.052  < 2e-16 ***
## yardsToGo:1            -0.032399   0.006684  -4.847 1.25e-06 ***
## yardsToGo:2            -0.025319   0.019935  -1.270   0.2041    
## passLength:1            0.049313   0.002678  18.413  < 2e-16 ***
## passLength:2            0.080694   0.005943  13.579  < 2e-16 ***
## playActionTRUE:1       -0.304873   0.059298  -5.141 2.73e-07 ***
## playActionTRUE:2       -0.434309   0.179826  -2.415   0.0157 *  
## timeToThrow:1           0.429498   0.027111  15.842  < 2e-16 ***
## timeToThrow:2           0.370186   0.075239   4.920 8.65e-07 ***
## passTippedAtLineTRUE:1  4.407689   0.328062  13.436  < 2e-16 ***
## passTippedAtLineTRUE:2  5.148987   0.410167  12.553  < 2e-16 ***
## zoneTRUE:1             -0.559721   0.054121 -10.342  < 2e-16 ***
## zoneTRUE:2             -0.174386   0.167125  -1.043   0.2967    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,2]/mu[,1]), log(mu[,3]/mu[,1])
## 
## Residual deviance: 11184.85 on 17396 degrees of freedom
## 
## Log-likelihood: -5592.425 on 17396 degrees of freedom
## 
## Number of Fisher scoring iterations: 7 
## 
## 
## Reference group is level  1  of the response
exp(coefvlm(new_model))
##          (Intercept):1          (Intercept):2            yardsToGo:1 
##           1.961660e-01           7.425801e-03           9.681204e-01 
##            yardsToGo:2           passLength:1           passLength:2 
##           9.749991e-01           1.050549e+00           1.084039e+00 
##       playActionTRUE:1       playActionTRUE:2          timeToThrow:1 
##           7.372171e-01           6.477121e-01           1.536487e+00 
##          timeToThrow:2 passTippedAtLineTRUE:1 passTippedAtLineTRUE:2 
##           1.448003e+00           8.207958e+01           1.722570e+02 
##             zoneTRUE:1             zoneTRUE:2 
##           5.713682e-01           8.399722e-01

This gives a simpler model, using only the six variables, yardsToGo, passLength, playAction, timeToThrow, passTippedAtLine, and zone. We check one more time the residuals vs fitted values

probs <- fitted(new_model)  # Matrix of predicted probabilities for each category
res <- resid(new_model, type = "pearson") 
library(arm)
binnedplot(probs[, 1], res[, 1], 
           xlab = "Fitted Probabilities (Category 1)", 
           main = "Binned Residual Plot")

Here we can see again that the binned residual plot shows a couple of points outside the bin lines, however, it is even fewer points than the more complicated model, so there is still not reason for concern.