## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.4
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
## Warning: package 'infer' was built under R version 4.0.3
## Warning: package 'png' was built under R version 4.0.3
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
## Warning: package 'ggrepel' was built under R version 4.0.4
I am attempting to establish if analytics ratings programs like College Football’s F+ rankings are predictive of a college football team’s success against the spread (ATS); that is, which teams exceed professional bettors expectations when they set betting lines, thus earning money for the bettors that pick them. Is a team’s higher F+ rating an indicator that it is more likely to cover the spread? If we can distill and identify a population of teams that consistently perform in accordance with their F+ Rankings, that beat lower F+-ranked teams, then we can create a betting strategy that earns money greater than 50% of the time.
In College Football, we are limited to a single season dataset, because teams change so much from one season to another due to several factors, including graduating players, inconsistent recruiting, and high coaching turnover. In addition, 2020 was such a weird year for college football, with some conferences only playing a few games, many games getting canceled or rescheduled, that I’m better off using the full dataset from 2019 to create the model for now.
The overall F+ Rating system was created by an outfit called the Football Outsiders and is comprised of a few component rankings, including individual F+ Offense and F+ Defense ratings, some metrics related to a team’s efficiency, and a moderately-weighted proprietary model called SP+, all of which are comprised of many variables and weights. An overview of the ratings system is available in the Appendix.
Leveraging the existing models to determine a population of teams that we should and should not bet on represents a dichotomous outcome and would seem to indicate that the most appropriate model should be a logistic regression Generalized Linear Model (GLM), since we can’t necessarily count on a normal error distribution for a model like this, and given that we are determining the odds/probability of a team covering the spread or not.
I was able to organize the 2019 data into two csv files for simplicity purposes. After some data cleaning (mostly aligning the team names between the two datasets), I was able to merge the two sets easily, and the data is ready for analysis.
ats <- read.csv(file = 'C:/Users/Evan/Desktop/CUNY/606/Data/ATS_Record.csv')
fplus <- read.csv(file = 'C:/Users/Evan/Desktop/CUNY/606/Data/F_Plus_Rankings.csv')
fullset <- merge(ats, fplus, by="team")
print(head(fullset))
lm(fullset$F_plus ~ fullset$Cover_pct_2019)
plot(jitter(fullset$F_plus,factor=1.5) ~
jitter(fullset$Cover_pct_2019,factor=1.5))
print(cor.test(fullset$F_plus, fullset$Cover_pct_2019))
Right off the bat, we can see some correlation (.41) between F+ ratings and success against the spread, but not a tremendous amount, and certainly not enough to justify betting real money since the correlation number is not impressive and the confidence interval isn’t great. Since I am attempting to distill the population of teams down to those whose F+ rating(s) infer that they will cover the spread against lower-F+-rated greater than 50% of the time (compared against simply using a coin to determine which team to pick), 50% is an important figure in my logistic regression. Thus, I’ve added another column to the dataset that flags those teams that have a greater than 50% record of covering the spread.
.5 is fed into my logistic function as the probability.
#GLM - Single variable (F+ overall)
fullset$cover <- ifelse(fullset$Cover_pct_2019 >= .5, 1, 0)
foverall <- fullset$F_plus
cover <- fullset$cover
print(head(fullset, n=10))
lr.out <- glm(cover ~ foverall, data = fullset, family = binomial(link="logit"), maxit = 100)
print(summary(lr.out))
ggplot(fullset, aes(x=foverall, y=cover)) + geom_point() + geom_smooth(formula = 'y ~ x', method="glm", method.args=list(family="binomial"), se=FALSE)
So the equation for the log function is: log(p/(1-p)) = .5778 + .5952(F+ points)
This is fine, given the low p-value, the fact that the coefficient is not insignificant, and the model converged with no problem according to the Fisher scoring iterations, but the reduction in the deviance over the null deviance is not terribly impressive. Maybe we can refine this and improve it by breaking down F+ into some of its component metrics, including the proprietary SP+ metric, and attempting to derive insights from each of those, isolating the most important variables. We will compare future models’ AICs to the current one of 149.17.
offense <- fullset$OF_plus
defense <- fullset$DF_plus
oefficiency <- fullset$OFEI
defficiency <- fullset$DFEI
sp <- fullset$SP_plus
lr.out <- glm(cover ~ offense + defense + oefficiency + defficiency + sp, family = binomial(link = "logit"), data = fullset)
print(summary(lr.out))
This didn’t add much. All the p-values for the component variables render them non-significant. It’s likely that these variables are collinear and adding more than one of these variables to the model has not added much value. Additionally, the AIC actually went up, indicating that the current model is actually MORE complex, even though all of these metrics are baked into the overall F+ ranking. In this application and with these highly-correlated predictors, it is reasonable to knock some out to try to improve the model.
lr.out <- glm(cover ~ offense, family = binomial(link = "logit"), data = fullset)
print(summary(lr.out))
Dropping variables based on their P-values ultimately improves the model slightly until we’re left with only offensive F+ ranking as significant, although we haven’t improved the model over the null or reduced our AIC meaningfully over the original F+ model. But we might as well check the correlation.
lm(offense ~ fullset$Cover_pct_2019)
plot(jitter(sp,factor=1.2) ~ jitter(fullset$Cover_pct_2019,factor=1.2))
print(cor(offense, fullset$Cover_pct_2019))
In general, these are not particularly useful (correlation .38) results, since the explanatory power of the overall model is moderate to begin with, and our correlation number actually declined. The overall F+ rating provides the strongest indication that a team will cover the spread.
In general, there exists some indication that F+ rankings could provide some indicator of a team’s likelihood for covering the spread or not. This provides us a useful jumping off point in terms of cutting our data further and distilling the population of teams down further to those that behave consistently with their F+ rankings, both those that consistently cover the spread, and those that never do; identifying both of these populations is useful for prospective bettors.