If you missed part 1 on the College Football Expected Points model fundamentals, you can check it out here.

There have been some developments since I started talking about college football expected points a couple months ago, so let’s start with some housekeeping.

I have spent the past month or two working with the package creator on improving the capabilities and accuracy of the expected points model as well as the win probability model. Our initial roll-out of the updated package was used for part 1 of the College Football Expected Points model fundamentals series.

cfbscrapR tweet

Previously in part 1, we left off discussing field position and expected points, including a breakdown by down. Generally, what was presented could be considered the most top-level results discussion and model definition information without much depth into how we arrived at the model or detail on how the predictions the model provides generate the expected points.

In order to remedy that, I feel it necessary for us to cover some model building methods. This article will show you how we arrive at the model that we currently have from a modeling perspective. The entire purpose of this series of articles is to show you as transparently as I can how the Expected Points model works, how well it works, and how the model is limited.

Additionally, and maybe most importantly. this research is reproducible.

The Data

We will be acquiring data from CollegeFootballData.com, courtesy of @CFB_data, using the cfbscrapR(@cfbscrapR) package, created by Meyappan Subbaiah and collaborators Saiem Gilani and Parker Fleming .

Warning: I have been told advised to give the readers an alert that things are about to get nerdy. This might be true, but the theme of this part is motivation, both personal and intellectual. Do you want to learn and excel? Then continue. Otherwise, just scroll through the visuals and skip the technical math details. (Editor’s note: I jumped right to the pretty pictures, don’t be ashamed for admitting your lesser.)

Regression Methods

Regression flashcard by Chris Albon

Regression is a set of techniques for estimating relationships between multiple variables on a quantitative target variable, and our focus will be on one of the simplest types of relationships: linear.

Linear Regression

Linear Regression definitions

Assumptions

  • Dependent variable is continuous, unbounded, and measured on an interval or ratio scale
  • Model has linear relation between independent and dependent variables
  • No outliers present
  • Independence Assumption: Sample observations are independent
  • Absence of multicollinearity between the predictor variables
  • Constant Variance Assumption (homoscedasticity)
  • Normal Distribution of error terms
  • Little or no auto-correlation in the residuals
library(cfbscrapR)
## Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
## loading 'cfbscrapR'
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(stringr)
library(tidyverse)
## -- Attaching packages ---------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v forcats 0.5.0
## v tidyr   1.1.0
## -- Conflicts ------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(nnet)
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
## 
## Attaching package: 'mgcv'
## The following object is masked from 'package:nnet':
## 
##     multinom
library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
## 
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(aod)
## 
## Attaching package: 'aod'
## The following object is masked from 'package:mgcv':
## 
##     negbin
library(xtable)

pbp_full_df <- read.csv("ep_fg_model_data_loso.csv")

Filter out OT games for now Create the EP model dataset that only includes plays with seven basic types of next scoring events along with the following play types:

  • Field Goal
  • No Play
  • Pass
  • Punt
  • Run
  • Sack
  • Spike

Remove [6] play types:

  • “Extra Point Missed”
  • “Extra Point Good”
  • “Timeout”
  • “End of Half”
  • “End of Game”
  • “Uncategorized”
remove_plays <-
  c(
    "Extra Point Missed",
    "Extra Point Good",
    "Timeout",
    "End of Half",
    "End of Game",
    "Uncategorized"
  )

pbp_no_OT <- pbp_full_df %>% 
  filter(down > 0, down!=4) %>%
  filter(year>=2014) %>%   
  filter(!period %in% c(2,4)) %>%
  filter(abs_diff<10)%>%
  filter(!play_type %in% remove_plays) %>%
  filter(!is.na(down)) %>%
  filter(!is.na(game_id)) %>%
  filter(log_ydstogo != -Inf) %>%
  filter(game_id != 400603838) %>%
  mutate(
    Next_Score = forcats::fct_relevel(factor(Next_Score), "No_Score"),
    Goal_To_Go = ifelse(
      str_detect(play_type, "Field Goal"),
      distance >= (yards_to_goal - 17),
      distance >= yards_to_goal
    ),
    Under_two = TimeSecsRem <= 120,
    id_play = as.numeric(id_play)
  ) %>% filter(!is.na(game_id))

Since scores in football only happen in increments of 2, 3, and 6 (+0, +1, +2), with the additional points in parentheses resulting from extra point attempts, the football scoring scheme is not continuous over an interval or ratio scale without transformation of the target variable.

While pretending I was unaware of the continuous dependent variable assumption of linear regression, I took a look into producing a linear regression model using down, distance, and yards-to-goal as independent variables using a similarly treated college football dataset and excluding all 4th down plays. I tried to predict on either the next score in the half or points on the drive and the one that demonstrated the highest adjusted R-squared at 0.5143, the others were quite low.

Adjusted R-squared is a measure of the percentage of the dependent variable variation explained by the independent variables, in this case 51.43%. Below is the model summary of this fitting with no intercept.

Next Score in Half EP linear regression model with down-distance interactions

lm_NSH_epa_model_1 <-lm(NSH ~ yards_to_goal + factor(down) * distance, data = pbp_no_OT)
summary(lm_NSH_epa_model_1)
## 
## Call:
## lm(formula = NSH ~ yards_to_goal + factor(down) * distance, data = pbp_no_OT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.414  -4.917   1.615   4.762  11.024 
## 
## Coefficients:
##                          Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             6.5412927  0.1075969   60.794  < 2e-16 ***
## yards_to_goal          -0.0643730  0.0006261 -102.819  < 2e-16 ***
## factor(down)2          -0.3489803  0.1199236   -2.910  0.00361 ** 
## factor(down)3          -1.2943644  0.1210966  -10.689  < 2e-16 ***
## distance               -0.0633671  0.0107188   -5.912 3.39e-09 ***
## factor(down)2:distance -0.0339670  0.0121657   -2.792  0.00524 ** 
## factor(down)3:distance -0.0333073  0.0124170   -2.682  0.00731 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.474 on 150719 degrees of freedom
## Multiple R-squared:  0.08315,    Adjusted R-squared:  0.08311 
## F-statistic:  2278 on 6 and 150719 DF,  p-value: < 2.2e-16
par(mfrow=c(2, 2))
plot(lm_NSH_epa_model_1)

Drive Points EP linear regression model with down-distance interactions

lm_drivepts_epa_model_1 <-lm(pts_drive ~ yards_to_goal + factor(down) * distance, data = pbp_no_OT)
summary(lm_drivepts_epa_model_1)
## 
## Call:
## lm(formula = pts_drive ~ yards_to_goal + factor(down) * distance, 
##     data = pbp_no_OT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.244 -2.289 -1.191  2.808  9.023 
## 
## Coefficients:
##                          Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             6.3507555  0.0575555  110.341   <2e-16 ***
## yards_to_goal          -0.0452934  0.0003349 -135.244   <2e-16 ***
## factor(down)2          -0.5600635  0.0641492   -8.731   <2e-16 ***
## factor(down)3          -1.3111901  0.0647767  -20.242   <2e-16 ***
## distance               -0.0800461  0.0057337  -13.961   <2e-16 ***
## factor(down)2:distance  0.0003025  0.0065076    0.046    0.963    
## factor(down)3:distance -0.0025589  0.0066420   -0.385    0.700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.928 on 150719 degrees of freedom
## Multiple R-squared:  0.144,  Adjusted R-squared:  0.144 
## F-statistic:  4227 on 6 and 150719 DF,  p-value: < 2.2e-16
par(mfrow=c(2, 2))
plot(lm_drivepts_epa_model_1)

Next Score in Half EP linear regression model without interactions

lm_NSH_epa_model_2 <-lm(NSH ~0+ yards_to_goal + factor(down) + distance, data = pbp_no_OT)
summary(lm_NSH_epa_model_2)
## 
## Call:
## lm(formula = NSH ~ 0 + yards_to_goal + factor(down) + distance, 
##     data = pbp_no_OT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.661  -4.918   1.625   4.762  10.686 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## yards_to_goal -0.0641357  0.0006207 -103.33   <2e-16 ***
## factor(down)1  6.8177017  0.0519665  131.19   <2e-16 ***
## factor(down)2  6.1411510  0.0476657  128.84   <2e-16 ***
## factor(down)3  5.2044477  0.0497537  104.60   <2e-16 ***
## distance      -0.0923753  0.0041345  -22.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.474 on 150721 degrees of freedom
## Multiple R-squared:  0.185,  Adjusted R-squared:  0.185 
## F-statistic:  6844 on 5 and 150721 DF,  p-value: < 2.2e-16
par(mfrow=c(2, 2))
plot(lm_NSH_epa_model_2)

Drive Points EP linear regression model without interactions

lm_drivepts_epa_model <-lm(pts_drive ~ 
                             0+
                             yards_to_goal + 
                             factor(down) + 
                             distance, 
                           data = pbp_no_OT)
summary(lm_drivepts_epa_model)
## 
## Call:
## lm(formula = pts_drive ~ 0 + yards_to_goal + factor(down) + distance, 
##     data = pbp_no_OT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.242 -2.289 -1.191  2.809  9.071 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## yards_to_goal -0.045283   0.000332  -136.4   <2e-16 ***
## factor(down)1  6.359000   0.027797   228.8   <2e-16 ***
## factor(down)2  5.799372   0.025496   227.5   <2e-16 ***
## factor(down)3  5.027286   0.026613   188.9   <2e-16 ***
## distance      -0.080933   0.002211   -36.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.928 on 150721 degrees of freedom
## Multiple R-squared:  0.5155, Adjusted R-squared:  0.5154 
## F-statistic: 3.207e+04 on 5 and 150721 DF,  p-value: < 2.2e-16
png("lm_drivepts_epa_model_2.png")
par(mfrow=c(2, 2))
plot(lm_drivepts_epa_model)
dev.off()
## png 
##   2

Figure: Linear Regression model summary | Expected Drive-Points model using Down, Distance, and Field Position as factors

Additionally, the linear model indicates that all three factors are significant at a p=0.001 level, which is at least some evidence that our variables have a relationship with drive points. Here is a plot of the linear regression model fitting.

Figure: Linear Regression model plots | Expected Drive-Points model using Down, Distance, and Field Position as factors

In the plot on the top left, there are 4 distinct scoring types clearly visible and the model is trying to fire a shot through to fit all of them. The red line in the top-left would be relatively flat if the residuals of the model fit had constant variance. While there are two other non-zero scoring types, Field Goals and Opponent Field Goals, the data excluded 4th down, so they do not appear on the plot. For clarity, there are in fact 7 next score types when we include the absence of a score, i.e. “No Score”, since the absence of a scoring event is also a type of next score.

Upon viewing these plots, I quickly realized that several assumptions of linear regression are being violated here, namely the constant variance assumption and normal distribution of error terms (see Figure 4), at minimum. We need to keep adding to our regression toolbox, so let us now take a look at a type of regression that does not restrict us to these assumptions.

Logistic Regression

Suppose we have a binary output variable Y, let’s say Y is a variable that gives a response of 1 if the next score in the half is a TD for the offense and a 0 otherwise. If we wanted to predict the probability that the next score in the half is a TD for the offense, one of the prime candidate models would be logistic regression.

Logistic Regression flashcard by Chris Albon

Logistic Regression model definition

Assumptions

  • Binary logistic regression requires the dependent variable to be binary (i.e. 0/1)
  • There is a linear relationship between the log-odds of the outcome and each of the predictor variables.
  • No outliers present
  • Independence Assumption: Sample observations are independent
  • Absence of multicollinearity between the predictor variables
  • Constant Variance Assumption (homoscedasticity)
  • Normal Distribution of error terms
  • Little or no auto-correlation in the residuals

Now we are attempting to calculate the probability of the next scoring event directly, an essential component of an expected points model. Once we have a model capable of calculating the expectation of the scoring event, e.g. the probability of the next score being an offense touchdown, then we simply have to multiply the probability by the point value of the score to get the expected points.

Logistic vs Linear Regression flashcard by Chris Albon

Let’s first restart with the data load again.

pbp_no_OT <- pbp_full_df %>% 
  filter(down > 0) %>%
  filter(year>=2014) %>%   
  filter(!play_type %in% remove_plays) %>%
  filter(!is.na(down)) %>%
  filter(!is.na(game_id)) %>%
  filter(log_ydstogo != -Inf) %>%
  filter(game_id != 400603838) %>%
  mutate(
    Next_Score = forcats::fct_relevel(factor(Next_Score), "No_Score"),
    NTDH = ifelse(NSH==7,1,0),
    Next_TD = ifelse(NTDH==1,"TD","No_Score"),
    Next_TD = forcats::fct_relevel(factor(Next_TD), "No_Score"),
    Goal_To_Go = ifelse(
      str_detect(play_type, "Field Goal"),
      distance >= (yards_to_goal - 17),
      distance >= yards_to_goal
    ),
    Under_two = TimeSecsRem <= 120,
    down = as.factor(down),
    id_play = as.numeric(id_play)
  ) %>% filter(!is.na(game_id))

With that background, we can build our model equation with whatever independent variables we choose to include in the model as shown in the figure below.

Offense Touchdown Logit Regression model equation

Next Touchdown EP model - Logistic Regression with down and distance interactions

logit_ntdh <- glm(Next_TD ~ 
                    yards_to_goal + 
                    down + 
                    distance +
                    down*distance, 
                  data = pbp_no_OT,
                  family = "binomial")
nullmod <- glm(Next_TD ~1, data = pbp_no_OT,family="binomial")

Below is a model fit to the next score offense touchdown variable using the independent variables yards-to-goal, down, distance, and the interaction between down and distance.

Offense Touchdown Logit Regression model summary Offense Touchdown Logit regression model coefficients

Below is how to get the model summary and the confidence intervals on the coefficients as well as demonstrating the Wald Test for model coefficients

summary(logit_ntdh)
## 
## Call:
## glm(formula = Next_TD ~ yards_to_goal + down + distance + down * 
##     distance, family = "binomial", data = pbp_no_OT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7570  -1.0208  -0.7799   1.2210   3.5356  
## 
## Coefficients:
##                  Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)     1.3921969  0.0249371   55.828   <2e-16 ***
## yards_to_goal  -0.0149114  0.0001272 -117.212   <2e-16 ***
## down2          -0.5409924  0.0274692  -19.695   <2e-16 ***
## down3          -0.9034545  0.0277244  -32.587   <2e-16 ***
## down4          -1.6292365  0.0301042  -54.120   <2e-16 ***
## distance       -0.0741096  0.0024984  -29.663   <2e-16 ***
## down2:distance  0.0274052  0.0027963    9.801   <2e-16 ***
## down3:distance  0.0276541  0.0028794    9.604   <2e-16 ***
## down4:distance  0.0543400  0.0030948   17.558   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 723566  on 534191  degrees of freedom
## Residual deviance: 689829  on 534183  degrees of freedom
## AIC: 689847
## 
## Number of Fisher Scoring iterations: 4
confint(logit_ntdh)
## Waiting for profiling to be done...
##                      2.5 %      97.5 %
## (Intercept)     1.34348738  1.44124120
## yards_to_goal  -0.01516084 -0.01466216
## down2          -0.59496954 -0.48729028
## down3          -0.95793035 -0.84925056
## down4          -1.68836355 -1.57035562
## distance       -0.07902385 -0.06923005
## down2:distance  0.02193760  0.03289893
## down3:distance  0.02202205  0.03330919
## down4:distance  0.04828129  0.06041303
confint.default(logit_ntdh)
##                      2.5 %      97.5 %
## (Intercept)     1.34332116  1.44107261
## yards_to_goal  -0.01516078 -0.01466209
## down2          -0.59483095 -0.48715381
## down3          -0.95779333 -0.84911561
## down4          -1.68823953 -1.57023339
## distance       -0.07900641 -0.06921284
## down2:distance  0.02192462  0.03288574
## down3:distance  0.02201065  0.03329755
## down4:distance  0.04827426  0.06040572
wald.test(b=coef(logit_ntdh),Sigma=vcov(logit_ntdh),Terms=c(3:5))
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 4087.6, df = 3, P(> X2) = 0.0
wald.test(b=coef(logit_ntdh),Sigma=vcov(logit_ntdh),Terms=c(7:9))
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 325.4, df = 3, P(> X2) = 0.0
coefficients_df <- data.frame(exp(cbind(OR = coef(logit_ntdh), confint(logit_ntdh))))
## Waiting for profiling to be done...

Pre-processing for the visual

pbp_no_OT_preds <- cbind(pbp_no_OT,
                         p_logit_td = predict(logit_ntdh,
                                              pbp_no_OT,
                                              type="response"))

pbp_chart_data_viz <- pbp_no_OT_preds %>%
  group_by(yards_to_goal) %>%
  summarize(
    TD = mean(p_logit_td),
    No_Score = mean(1-TD),
    TD_EP = mean(7*TD),
    plays=n()
  )%>% filter(plays>75) %>%
  select(yards_to_goal, TD, No_Score)%>%
  gather(key = Event, value = Probability, -yards_to_goal) %>% ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)

Creating the visual

m <- png::readPNG("Tomahawk_Nation_Full.png")
img <- matrix(rgb(m[,,1],m[,,2],m[,,3], m[,,4] * 0.1), nrow=dim(m)[1]) #0.1 is alpha
rast <- grid::rasterGrob(img, interpolate = T)

ann_text_rz <- data.frame(x = c(10), y = c(0.15), 
                          lab = c("Red\nZone"))

ann_text_sc <- data.frame(x = c(30), y = c(0.15), 
                          lab = c("Score\nOpp."))

ggplot(filter(pbp_chart_data_viz), aes(x=yards_to_goal,y=Probability,color=Event)) +
  geom_smooth(size=0.8,span = 0.10) + theme_bw() + 
  xlab("Yards from Opponent's End Zone") + 
  ylab("Next Score Type Probability") +
  labs(title="Field Position and TD Probability\nLogistic Regression Model",
       subtitle="Data: @CFB_data with #cfbscrapR")+
  scale_x_reverse(limits = c(101,-1),breaks=seq(100,0,-10), expand = c(0.0, 0.0))+
  scale_y_continuous(limits = c(0,1),breaks=seq(0, 1, .1)) + 
  scale_color_manual(values =alpha(c("mediumblue","steelblue2"),.7),
                     name="Next Score Type",
                     labels=c("No Score","TD")) +
  annotation_custom(rast, xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf) +
  annotate("rect", xmin=20, xmax=40, ymin=-Inf, ymax=Inf, alpha=0.1, fill="#FFDB6D",color="#FFDB6D")  +
  geom_text(data = ann_text_sc, aes(x = x, y = y, label = lab), 
            colour = "black", size = 3, family = "serif") +
  annotate("rect", xmin=-1, xmax=20, ymin=-Inf, ymax=Inf, alpha=0.1, fill="red")  +
  geom_text(data = ann_text_rz, aes(x = x, y = y, label = lab), 
            colour = "black", size = 3, family = "serif") +
  geom_hline(yintercept = 0,linetype="solid", color="black",size = 0.2)+
  geom_hline(yintercept = 1,linetype="solid", color="black",size = 0.2)+
  geom_hline(yintercept = 0.25,linetype="dashed", color="gray25",size = 0.2)+
  geom_hline(yintercept = 0.50,linetype="dashed", color="gray25",size = 0.2)+
  geom_hline(yintercept = 0.75,linetype="dashed", color="gray25",size = 0.2)+
  geom_vline(xintercept = 100,linetype="solid", color="gray25",size = 0.2)+
  geom_vline(xintercept = 80,linetype="solid", color="gray25",size = 0.2)+
  geom_vline(xintercept = 60,linetype="solid", color="gray25",size = 0.2)+
  geom_vline(xintercept = 40,linetype="solid", color="gray25",size = 0.2)+
  geom_vline(xintercept = 20,linetype="dashed", color="red",size = 0.2)+
  geom_vline(xintercept = 0,linetype="solid", color="red",size = 0.2)+
  guides(fill=guide_legend(ncol=2)) +
  theme(axis.title.x = element_text(size = 10,margin=margin(0.5,0,0,0,unit=c("mm")), face = "bold", family = "serif"),
        axis.text.x = element_text(size = 8,margin=margin(0,0,0.1,0,unit=c("mm")), family = "serif"),
        axis.title.y = element_text(size = 10,margin=margin(0,0,0.7,0,unit=c("mm")), face = "bold", family = "serif"),
        axis.text.y = element_text(size = 8,margin=margin(0,0.1,0.7,0.3,unit=c("mm")), family = "serif"),
        strip.background = element_rect(fill = "grey85"),
        strip.text = element_text(size=10,colour = "black", face = "bold", family = "serif"),
        legend.title = element_text(size = 8,face = "bold", family = "serif"),
        legend.text = element_text(size = 8, margin=margin(t=0.0,r=0.0,b=0.0,l=0.0,unit=c("mm")), family = "serif"),
        legend.background = element_rect(fill = "grey85"),
        legend.key = element_rect(fill=NA),
        legend.key.width = unit(1.5,"mm"),
        legend.key.size = unit(2.0,"mm"),
        legend.position = c(0.3, 0.83),
        legend.margin=margin(t = 1,b = 1,l=1,r=1,unit=c('mm')),
        legend.direction = "vertical",
        legend.box.background = element_rect(colour = "#500f1b"),
        plot.title = element_text(size = 9, margin=margin(t=0,r=0,b=1,l=0,unit=c("mm")),lineheight=1,face = "bold", family = "serif"),
        plot.subtitle = element_text(size = 7, margin=margin(t=0,r=0,b=1,l=0,unit=c("mm")), lineheight=1, family = "serif"),
        plot.caption = element_text(size = 4, margin=margin(t=0,r=0,b=0,l=0,unit=c("mm")),lineheight=-0.5, family = "serif"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "grey99",color="black"),
        plot.background = element_rect(fill = "grey85",color="black"),
        plot.margin=unit(c(1,2,1,3),"mm"))+
  ggsave("Logit_TD_square.png", height = 77.2, width = 77.2,units=c('mm'),type="cairo")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You might be asking “Yeah, that’s great, but you’re still only predicting the probability of one scoring type relative to another. You’d need to do this like 6 times, right?” Well, fair point. How does one do logistic regression on a categorical variable that has more than one class?

Multinomial Logistic Regression (or Softmax regression)

A multinomial logistic regression model uses the independent (predictor) variables and target variable data from the training set to build relationships between the independent variables and each of the classes of the target variable.

Multinomial Logistic Regression flashcard by Chris Albon

The primary difference between logistic and multinomial logistic regression is the use of the softmax function which re-weights the probabilities generated from each of the individual models so that in total they add to one as seen in Figure 12 below.

Softmax function

More specifically, a multinomial logistic regression model is an extension of the binomial logistic regression model because it is a series of logistic regression models estimated simultaneously with the same reference outcome.

Multinomial Logistic Regression Football Expected Points Model

The college football expected points (EP) model is a multinomial logistic regression model which generates probabilities for the possible types of next score events within the same half. In our case, we build 6 logistic regression models fit to the next score types — Offense FG, Offense TD, Offense Safety, Opponent TD, Opponent FG, and Opponent Safety — all except for the class that is used as the base case (i.e. No Score), since that is accounted for in the intercept, as mentioned in part 1.

Now that we have a way to calculate the probabilities of scores, we can calculate the expected points. We will talk about this in Part 3.