Start by setting up the libraries and processing data.
library(rio)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(arm)
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loading required package: lme4
## Registered S3 method overwritten by 'lme4':
## method from
## na.action.merMod car
##
## Attaching package: 'lme4'
##
## The following object is masked from 'package:rio':
##
## factorize
##
##
## arm (Version 1.15-2, built: 2026-3-22)
##
## Working directory is /Users/sjhawkins/H510
##
##
## Attaching package: 'arm'
##
## The following object is masked from 'package:car':
##
## logit
data <- import("plays.csv") %>%
mutate(zone = pff_manZone == "Zone",
incompleted_pass = passResult == "I",
interception = passResult == "IN",
dropbackDistance = ifelse(is.na(dropbackDistance), 0, dropbackDistance),
timeToThrow = ifelse(is.na(timeToThrow), 0, timeToThrow),
timeInTackleBox = ifelse(is.na(timeInTackleBox), 0, timeInTackleBox),
score_differential = abs(preSnapHomeScore - preSnapVisitorScore),
penaltyYards = ifelse(is.na(penaltyYards), 0, penaltyYards)
) %>%
filter(passResult %in% c("C", "I", "IN"))
Now we can fit a model including all variables that could possibly influence whether a pass is completed, incompleted or intercepted. We’ve set completion as the reference level, so the model will use the log odds of an incompletion vs a completion and the log odds of an interception vs a completion.
model <- vglm(formula = passResult ~ yardsToGo + quarter + down + passLength + playAction + dropbackDistance + timeToThrow + passTippedAtLine + zone + score_differential,
family = multinomial(refLevel = "C"),
data = data )
summary(model)
##
## Call:
## vglm(formula = passResult ~ yardsToGo + quarter + down + passLength +
## playAction + dropbackDistance + timeToThrow + passTippedAtLine +
## zone + score_differential, family = multinomial(refLevel = "C"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -1.518205 0.141766 -10.709 < 2e-16 ***
## (Intercept):2 -5.622102 0.431005 -13.044 < 2e-16 ***
## yardsToGo:1 -0.034620 0.007119 -4.863 1.16e-06 ***
## yardsToGo:2 -0.019788 0.020338 -0.973 0.3306
## quarter:1 -0.005909 0.023626 -0.250 0.8025
## quarter:2 0.148323 0.070231 2.112 0.0347 *
## down:1 -0.031315 0.033484 -0.935 0.3497
## down:2 0.114849 0.095960 1.197 0.2314
## passLength:1 0.049469 0.002684 18.429 < 2e-16 ***
## passLength:2 0.079274 0.005973 13.271 < 2e-16 ***
## playActionTRUE:1 -0.326186 0.066520 -4.904 9.41e-07 ***
## playActionTRUE:2 -0.369821 0.206504 -1.791 0.0733 .
## dropbackDistance:1 -0.002164 0.015496 -0.140 0.8889
## dropbackDistance:2 0.049144 0.047412 1.037 0.3000
## timeToThrow:1 0.433147 0.029604 14.631 < 2e-16 ***
## timeToThrow:2 0.328185 0.082102 3.997 6.41e-05 ***
## passTippedAtLineTRUE:1 4.407971 0.328131 13.434 < 2e-16 ***
## passTippedAtLineTRUE:2 5.145965 0.410873 12.524 < 2e-16 ***
## zoneTRUE:1 -0.566654 0.054775 -10.345 < 2e-16 ***
## zoneTRUE:2 -0.129794 0.169891 -0.764 0.4449
## score_differential:1 -0.001321 0.004042 -0.327 0.7438
## score_differential:2 -0.005662 0.012009 -0.471 0.6373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,2]/mu[,1]), log(mu[,3]/mu[,1])
##
## Residual deviance: 11175.5 on 17388 degrees of freedom
##
## Log-likelihood: -5587.75 on 17388 degrees of freedom
##
## Number of Fisher scoring iterations: 7
##
##
## Reference group is level 1 of the response
Because our model has so many variables, the first thing we need to check for is issues with multicolinearity. We do this by first fitting a simple ordinary least squares regression and then can calculate the variance inflation factor from that model.
tem_model <- lm(passResult == "C" ~ yardsToGo + quarter + down + passLength + playAction + dropbackDistance + timeToThrow + passTippedAtLine + zone + score_differential,
data = data)
vif(tem_model)
## yardsToGo quarter down passLength
## 1.175031 1.145361 1.270405 1.151570
## playAction dropbackDistance timeToThrow passTippedAtLine
## 1.275255 1.420051 1.430717 1.015449
## zone score_differential
## 1.106535 1.140841
Here we can see very slight correlations between the variables, however the vif values are all close to 1, so we don’t need to worry about multicolinearity issues within the model. The next thing we need to check for is issues with the residuals in the model. Since this is a multinomial logistic regression, we can’t just plot the residual vs fitted values, but instead look at a binned residual plot.
probs <- fitted(model) # Matrix of predicted probabilities for each category
res <- resid(model, type = "pearson")
binnedplot(probs[, 1], res[, 1],
xlab = "Fitted Probabilities (Completion)",
main = "Binned Residual Plot")
Here we can see a couple of points falling outside the binned lines, however it’s not terribly prevalent. Coefficents may not be reliable at the most extreme values, however it’s not terribly concerning.
Finally we see if we can simplify the model at all. Here we loop through the covariates to see if we can retain or improve the AIC with a simplified model.
exp_var <- c("yardsToGo",
"quarter",
"down",
"passLength",
"playAction",
"dropbackDistance",
"timeToThrow",
"passTippedAtLine",
"zone",
"score_differential"
)
initial_model <- vglm(as.formula(paste("passResult ~ ", paste(exp_var, collapse = "+"))),
data = data,
family = multinomial(refLevel = "C"))
initial_AIC <- AIC(initial_model)
new_aic <- initial_AIC -1
while(initial_AIC > new_aic){
initial_AIC <- new_aic
AIC_vec <- vector()
var_left_out <- vector()
for(i in 1:length(exp_var)){
test_formula <- as.formula(paste("passResult~ ", paste(exp_var[-i], collapse = "+")))
test_model <- vglm(test_formula,
data = data,
family = multinomial(refLevel = "C"))
AIC_vec[i] <- AIC(test_model)
var_left_out[i] <- exp_var[i]
}
print(paste("Dropping", exp_var[AIC_vec == min(AIC_vec)]))
exp_var <- c(exp_var[!AIC_vec == min(AIC_vec)])
new_model <- vglm(as.formula(paste("passResult ~ ", paste(exp_var[-i], collapse = "+"))),
data = data,
family = multinomial(refLevel = "C"))
new_aic <- AIC(new_model)
}
## Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, : some
## quantities such as z, residuals, SEs may be inaccurate due to convergence at a
## half-step
## [1] "Dropping score_differential"
## [1] "Dropping dropbackDistance"
## [1] "Dropping down"
## Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, : some
## quantities such as z, residuals, SEs may be inaccurate due to convergence at a
## half-step
## [1] "Dropping quarter"
new_model <- vglm(as.formula(paste("passResult ~ ", paste(exp_var, collapse = "+"))),
data = data,
family = multinomial(refLevel = "C"))
summary(new_model)
##
## Call:
## vglm(formula = as.formula(paste("passResult ~ ", paste(exp_var,
## collapse = "+"))), family = multinomial(refLevel = "C"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -1.628794 0.093517 -17.417 < 2e-16 ***
## (Intercept):2 -4.902795 0.287514 -17.052 < 2e-16 ***
## yardsToGo:1 -0.032399 0.006684 -4.847 1.25e-06 ***
## yardsToGo:2 -0.025319 0.019935 -1.270 0.2041
## passLength:1 0.049313 0.002678 18.413 < 2e-16 ***
## passLength:2 0.080694 0.005943 13.579 < 2e-16 ***
## playActionTRUE:1 -0.304873 0.059298 -5.141 2.73e-07 ***
## playActionTRUE:2 -0.434309 0.179826 -2.415 0.0157 *
## timeToThrow:1 0.429498 0.027111 15.842 < 2e-16 ***
## timeToThrow:2 0.370186 0.075239 4.920 8.65e-07 ***
## passTippedAtLineTRUE:1 4.407689 0.328062 13.436 < 2e-16 ***
## passTippedAtLineTRUE:2 5.148987 0.410167 12.553 < 2e-16 ***
## zoneTRUE:1 -0.559721 0.054121 -10.342 < 2e-16 ***
## zoneTRUE:2 -0.174386 0.167125 -1.043 0.2967
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,2]/mu[,1]), log(mu[,3]/mu[,1])
##
## Residual deviance: 11184.85 on 17396 degrees of freedom
##
## Log-likelihood: -5592.425 on 17396 degrees of freedom
##
## Number of Fisher scoring iterations: 7
##
##
## Reference group is level 1 of the response
exp(coefvlm(new_model))
## (Intercept):1 (Intercept):2 yardsToGo:1
## 1.961660e-01 7.425801e-03 9.681204e-01
## yardsToGo:2 passLength:1 passLength:2
## 9.749991e-01 1.050549e+00 1.084039e+00
## playActionTRUE:1 playActionTRUE:2 timeToThrow:1
## 7.372171e-01 6.477121e-01 1.536487e+00
## timeToThrow:2 passTippedAtLineTRUE:1 passTippedAtLineTRUE:2
## 1.448003e+00 8.207958e+01 1.722570e+02
## zoneTRUE:1 zoneTRUE:2
## 5.713682e-01 8.399722e-01
This gives a simpler model, using only the six variables,
yardsToGo, passLength,
playAction, timeToThrow,
passTippedAtLine, and zone. We check one more
time the residuals vs fitted values
probs <- fitted(new_model) # Matrix of predicted probabilities for each category
res <- resid(new_model, type = "pearson")
library(arm)
binnedplot(probs[, 1], res[, 1],
xlab = "Fitted Probabilities (Category 1)",
main = "Binned Residual Plot")
Here we can see again that the binned residual plot shows a couple of points outside the bin lines, however, it is even fewer points than the more complicated model, so there is still not reason for concern.