In this article, I’ll go over how the NFL rankings that ultimately get fed into the J.A.R.E.D.G.O.F.F. model for predicting spreads are created. First and foremost I need to thank the Yale University Sports Analytics Group for helping me learn how to do this kind of stuff. This model is based heavily off their NBA Rankings model, which can be found on their website.
In the near future, I will add some code for scraping the 2018 NFL game data from the web. Pro Football Reference contains everything we’ll need, but as the season moves on they change their table formatting from Home/Away columns to Winner/Loser columns. This makes it more challenging for our purposes, since the model is based off of Home and Away variables. For now, I have just been manually entering the information into a .CSV file, which looks like this.
CSV File
Everything in the CSV File should be pretty self-explanatory. PTSA = Points for the Away team, YDSA = Yards for the Away team, TOA = Turnovers for the Away team, apply the same to the Home teams. The PICKSIX column captures all the pick sixes that occurred in a game. For each one, 6 points are removed from that teams total points, so if you noticed that some game scores don’t match what you remember from that game, that would be why. This is done to improve the model performance later, as pick sixes are not replicable.
Now that we have our CSV file completely filled out, we’ll read it into R.
data <- read.csv("Schedule18.csv")
frame <- data
frame$Away.Neutral <- as.character(frame$Away.Neutral)
frame$Home.Neutral <- as.character(frame$Home.Neutral)
frame$PTSA <- as.numeric(as.character(frame[[6]]))
frame$PTSH <- as.numeric(as.character(frame[[10]]))
frame <- frame[!is.na(frame$PTSA),]
The dataframe “frame” now holds the basic game data for the season so far. Let’s have a look at what we’re working with.
head(frame)
## Week Day Date Time Away.Neutral PTSA YDSA TOA
## 1 1 Thu 6-Sep-18 8:20PM Atlanta Falcons 12 299 1
## 2 1 Sun 9-Sep-18 1:00PM San Francisco 49ers 16 327 4
## 3 1 Sun 9-Sep-18 1:00PM Cincinnati Bengals 34 330 2
## 4 1 Sun 9-Sep-18 1:00PM Tampa Bay Buccaneers 48 529 0
## 5 1 Sun 9-Sep-18 1:00PM Jacksonville Jaguars 14 305 1
## 6 1 Sun 9-Sep-18 1:00PM Tennessee Titans 20 336 3
## Home.Neutral PTSH YDSH TOH PICKSIX
## 1 Philadelphia Eagles 18 232 2
## 2 Minnesota Vikings 18 343 1 Minnesota Vikings
## 3 Indianapolis Colts 24 380 2
## 4 New Orleans Saints 40 475 2
## 5 New York Giants 15 324 2 Jacksonville Jaguars
## 6 Miami Dolphins 27 342 2
Next we’re going to create a dataframe called games that just contains the score of each game along with the point differential.
games <- data.frame(frame$Away.Neutral,frame$PTSA,frame$Home.Neutral,frame$PTSH)
games$frame.Away.Neutral <- as.character(games$frame.Away.Neutral)
games$frame.Home.Neutral <- as.character(games$frame.Home.Neutral)
games$pt_dif <- games$frame.PTSH - games$frame.PTSA
head(games)
## frame.Away.Neutral frame.PTSA frame.Home.Neutral frame.PTSH pt_dif
## 1 Atlanta Falcons 12 Philadelphia Eagles 18 6
## 2 San Francisco 49ers 16 Minnesota Vikings 18 2
## 3 Cincinnati Bengals 34 Indianapolis Colts 24 -10
## 4 Tampa Bay Buccaneers 48 New Orleans Saints 40 -8
## 5 Jacksonville Jaguars 14 New York Giants 15 1
## 6 Tennessee Titans 20 Miami Dolphins 27 7
Now we can put this clean data into a final data frame that will contain 4 columns; team, opponent, ptdif and location.
team <- character(length = 2*length(games$pt_dif))
opponent <- character(length = 2*length(games$pt_dif))
location <- character(length = 2*length(games$pt_dif))
ptdif <- vector(mode='numeric',length = 2*length(games$pt_dif))
clean <- data.frame(team,opponent,location,ptdif)
clean$team <- as.character(clean$team)
clean$opponent <- as.character(clean$opponent)
clean$location <- as.character(clean$location)
for(i in 1:length(games$pt_dif)){
clean$team[i] <- games$frame.Home.Neutral[i]
clean$opponent[i] <- games$frame.Away.Neutral[i]
clean$location[i] <- "H"
clean$ptdif[i] <- games$pt_dif[i]
clean$team[i + length(games$pt_dif)] <- games$frame.Away.Neutral[i]
clean$opponent[i + length(games$pt_dif)] <- games$frame.Home.Neutral[i]
clean$location[i + length(games$pt_dif)] <- "A"
clean$ptdif[i + length(games$pt_dif)] <- (-1)*games$pt_dif[i]
}
#Correct for the 3 NFL games this season that were played at neutral locations
#Remember when the Chiefs & Rams game was supposed to be in Mexico?
clean$location[84] <- "N"
clean$location[95] <- "N"
clean$location[109] <- "N"
clean$location[84 + length(games$pt_dif)] <- "N"
clean$location[95 + length(games$pt_dif)] <- "N"
clean$location[109 + length(games$pt_dif)] <- "N"
head(clean)
## team opponent location ptdif
## 1 Philadelphia Eagles Atlanta Falcons H 6
## 2 Minnesota Vikings San Francisco 49ers H 2
## 3 Indianapolis Colts Cincinnati Bengals H -10
## 4 New Orleans Saints Tampa Bay Buccaneers H -8
## 5 New York Giants Jacksonville Jaguars H 1
## 6 Miami Dolphins Tennessee Titans H 7
With this dataframe we can onto the creation of the rankings and model. Here’s a boxplot of the home point differentials for this season.
boxplot(clean$ptdif[1:(.5*length(clean$ptdif))], col = "blue", horizontal = TRUE,
main = "Home Point Differentials for 2018 NFL Season",
xlab = "Home Score - Away Score")
abline( v = 0, col = "red")
You can see on the boxplot that on average the home team will win the game by a few points. Interestingly enough, the Buffalo Bills are responsible for the 2 biggest outliers this season. The 44 point away loss to Baltimore and the 32 point home loss to Chicago.
Our NFL model (btw since you’re probably wondering, J.A.R.E.D.G.O.F.F. stands for Judgmental Algorithm Regressing Empirical Data for Gambling On Football Fixtures) is a linear model that uses team, opponent & location to predict point differential. We fit a coefficient for each team, as well as for home field advantage, such that the formula for a given game’s predicted point differential is R1 - R2 + H where R1 and R2 are the power ranking coefficients for home and away teams respectively. J.A.R.E.D. does this by finding the set of coefficients that minimize the residuals between predicted and actual score over the course of the entire season.
lm.NFLfootball <- lm(ptdif ~ team + opponent + location, data = clean)
#Let's take a look at some of the values
lm.NFLfootball$coefficients[1:7]
## (Intercept) teamAtlanta Falcons teamBaltimore Ravens
## -2.933127 7.529349 17.718959
## teamBuffalo Bills teamCarolina Panthers teamChicago Bears
## 2.112729 10.973096 15.891987
## teamCincinnati Bengals
## 8.660101
Now we can create the rankings using these coefficients. Each value is linearly adjusted so that the mean of all the rankings is 0. This allows us to see how we predict a team would perform against an average NFL team.
rankings <- data.frame("team" = sort(unique(clean$team)),
"bs_coeff" = rep(NA, 32))
scale_factor <- mean(lm.NFLfootball$coefficients[2:32])
rankings$bs_coeff <- c(0, lm.NFLfootball$coefficients[2:32]) - scale_factor
rankings <- rankings[(order(rankings$bs_coeff, decreasing = T)),]
rankings
## team bs_coeff
## 22 New Orleans Saints 9.2871188
## 18 Los Angeles Rams 9.1851058
## 16 Kansas City Chiefs 8.4993957
## 3 Baltimore Ravens 7.1796981
## 17 Los Angeles Chargers 6.3736295
## 29 Seattle Seahawks 6.1817765
## 6 Chicago Bears 5.3527263
## 27 Pittsburgh Steelers 4.4540319
## 21 New England Patriots 4.1135638
## 13 Houston Texans 2.7024486
## 14 Indianapolis Colts 1.4190302
## 10 Denver Broncos 1.3007420
## 9 Dallas Cowboys 1.2358664
## 5 Carolina Panthers 0.4338357
## 31 Tennessee Titans -0.7311400
## 12 Green Bay Packers -0.8130954
## 20 Minnesota Vikings -1.3553974
## 8 Cleveland Browns -1.3783654
## 7 Cincinnati Bengals -1.8791591
## 23 New York Giants -2.1822924
## 26 Philadelphia Eagles -2.2143069
## 2 Atlanta Falcons -3.0099120
## 30 Tampa Bay Buccaneers -3.1657289
## 11 Detroit Lions -3.9086921
## 15 Jacksonville Jaguars -4.4642397
## 32 Washington Redskins -5.4404500
## 19 Miami Dolphins -6.0691152
## 28 San Francisco 49ers -6.3655301
## 24 New York Jets -6.5930722
## 4 Buffalo Bills -8.4265314
## 25 Oakland Raiders -9.7219409
## 1 Arizona Cardinals -10.5392606
Looks like Carolina is about as close to baseline as you’re going to get. So when reading these, we’d expect the Cardinals to lose to the average NFL team by 10.54 points and the Rams to beat the average NFL team by 9.17 points. Here is a stripchart of the rankings.
stripchart(rankings$bs_coeff, pch = 19 , col = "blue",
xlab = "BS Coefficient", main = "BS Coefficients for 2018 Season")
Next we can create another model, based on the first model, that finds the probability that a team will win given their predicted point differential. We must add 2 new columns showing the predicted point differential of a game, as well as whether or not that team actually won. Then we fit a logistic function that takes in 1 parameter (predicted point differential) and outputs the win probability based on this season’s game data. This fucntion has limits such that as x goes to infinity or negative infinity, y will go to 1 or 0.
clean$predscore <- predict(lm.NFLfootball, newdata = clean)
clean$win <- ifelse(clean$ptdif > 0, 1, 0)
glm.pointspread <- glm(win ~ predscore, data = clean, family = "binomial"
,control = list(maxit = 50))
clean$winprob <- predict(glm.pointspread, newdata = clean, type = "response")
#We can plot the win probability vs. predicted score
plot(clean$predscore, clean$winprob, xlab = "Home Predicted Score Differential",
ylab = "Home Win Probability",
main = "Logistic Function for Predicting Games with the G.O.F.F. NFL Model",
pch = 4, col = c("red","green")[(clean$ptdif > 0) + 1])
legend(5,.4,legend=c("Actual Home Win", "Actual Road Win"), fill=c("green","red"))