Fantasy football is a game in which fans of American football select players from current rosters of teams in the National Football League and are awarded points based on their players performances. The basic idea is to predict which (real-life) players will perform well in an upcoming game. For this project, students will be separated into teams and will be tasked with building a statistical model to predict a player’s performance.
The game which will be the target of this project is the between the Kansas City Chiefs and the Oakland Raiders. Why this game? Because this game will be played on Thursday, 19 October which is the next time we will meet in class. Your team will have until 18 October to collect the necessary data and build your model. Teams will present their modeling approaches and their predictions in class on the 19th before game later that evening. On the 20th, we’ll compare each model’s prediction against the game results.
Teams must build a statistical model to predict the number of fantasy points awarded to either quarterback using the simplified scoring rules below:
Teams must then present their modelling approach to the class using an rmarkdown slide format such as either ioslides, reveal.js, slidy, or beamer Teams must include in their presentations: Where the data was obtained, how the data was accessed, their chosen statistical modelling approach, the loss function used by the modelling approach, the factors chosen to include in their model.
Teams will be assessed on the quality of their presentations and the value of their predictions.
In the sections that follow we walk through the initial steps of extracting data for NFL games using the nflscraR package. In getting data we should recognize that we may need to collect different data, depending on how we choose to attack the problem. For example, suppose I choose to predict the number of fantasy points scored by Alex Smith the quarterback for the Kansas City Chiefs. I could collect data on how many fantasy points he’s scored in games over the last several seasons. Or, I could collect data on the number of fantasy points that the Oakland Raiders defense has given up to the quarterbacks in games over the last several seasons. There’s ‘right’ to do this, but in this document I’ve chosen to pull data on how many points the Raiders defense has given up.
As usual, we first need to load some packages. To do this, simply copy and copy the code in the chunk below and paste it into your R console. If you’ve never installed one or more of these packages or if you get an error when running this code you can install the packages using the install.packages() function.
library(ggplot2) # Graphics and data viz
library(plyr) # Data manipulation
library(dplyr) # Data manipulation
library(pheatmap) #
library(RJSONIO) # Interface between R & JSON
library(RCurl) # Interface between R & curl
library(DT) # Interactive tables
library(rprojroot) # Easily get root directories
library(devtools) # Tools for R development
library(readr) # read in data from Excel files
library(data.table)
With these pacakges loaded, we now need to install the nflscrapR package. This package has not yet been published to the CRAN, but can be downloaded and installed from GitHub using the devtools package. The code below downloads and installs the package from the nflscrapR repository which is owned by the GitHub user maksimhorowitz.
devtools::install_github("maksimhorowitz/nflscrapR")
Once installed, we can load the package as normal usign the library() function.
library(nflscrapR)
With the nflscrapR package installed and loaded, we’re ready to start collecting data for serveral NFL games across many seasons. First, we want to know what functions are available for us to use with this package to find out run the function below, this should open up a window listing all of the functions in the package that you can use along with a short description of what they do.
help(package = 'nflscrapR')
As you see there are several functions in the package and some my be of use to your in completing this project. NOTE: some of the functions can take a long time to run. For example, the season_games() function scrapes all of the data from every game in a specified season. We only want a small amount of data from a small number of games.
I’d like to use the simple_boxscore() function to extract data from only the games that a specific team was involved in. The simple_boxscore() function takes two arguments: GameID and home. The first argument GameID is a 10-digit character string or number that matches up with the game ID assigned by the NFL to uniquely identify each game. The second argument home is a logical argument indicating if data is desired for the home team or not. Of course we may not know the game ID’s for the games involving the Raiders but we can use the extracting_gameids() function to get the game ID’s for the 2015, 2016, and 2017 seasons.
id2015 <- extracting_gameids(2015)
id2016 <- extracting_gameids(2016)
id2017 <- extracting_gameids(2017)
Observing the game ID’s for the first 10 games of the 2015 season we see that the identifier includes the game date and a 2-digit number for the game playedd on that data.
id2015[1:10]
## ids
## 1: 2015091000
## 2: 2015091300
## 3: 2015091306
## 4: 2015091305
## 5: 2015091304
## 6: 2015091303
## 7: 2015091302
## 8: 2015091301
## 9: 2015091307
## 10: 2015091308
Now that we have the game ID’s we need to find out which ID’s reference games that involved either the Oakland Raiders or the Kansas City Chiefs. To do this, we’ll first declare a function to extract the teams playing in every game. This function take a vector of game ID’s and return a data.frame with three columns for the game ID, the abbreviation of the home team and the abbreviation of the away team.
raw.games <- function(gameids) {
game_urls <- sapply(gameids, proper_jsonurl_formatting)
games_unf <-
sapply(game_urls, FUN = function(x) {
cbind(sapply(RJSONIO::fromJSON(RCurl::getURL(x))[[1]]$home[2]$abbr, c),
sapply(RJSONIO::fromJSON(RCurl::getURL(x))[[1]]$away[2]$abbr, c))
})
raw.teams <- data.frame(colnames(games_unf),
t(games_unf),
row.names = NULL)
colnames(raw.teams) <- c('Game ID', 'Home Team','Away Team')
return(raw.teams)
}
In the code chunk below, we run the raw.games() function that we just defined on the game ID’s from the 2015, 2016, and 2017 seasons. Then we use the join_all() function from the plyr package to join the three data.frames together. Finally, we use the write.csv() function to save this data to a CSV file. The reason for saving the file is because it can take a while to run the function for each of the game id’s so it’s a good idea to save it for future use. Don’t worry about running this code, now I’ve already done it and save the results to a file in this repository.
raw_teams5 <- raw.games(gameids = id2015)
raw_teams6 <- raw.games(gameids = id2016)
raw_teams7 <- raw.games(gameids = id2017[-c(64,78)])
raw_teams <- plyr::join_all(dfs = list(raw_teams5,
raw_teams6,
raw_teams7),
type = 'full')
root <- find_root(is_git_root)
write.csv(raw_teams,
paste0(root,'/code/ffootball/objs/raw_teams.csv'),
row.names = F)
We can speed things up by loading it directly by accessing the data.frame of results that have already been extracted and saved within this repo. The code in the chunk below reads the data from the CSV file saved in this repository and stores the data under the object named Games. The datatable function in the DT package is then used to generate an interactive table displaying the data.
root <- find_root(is_git_root)
CSV <- paste0(root,'/code/ffootball/objs/raw_teams.csv')
Games <- readr::read_csv(CSV)
DT::datatable(Games)
Now, we want find which games involved the Oakland Raiders. The code in the following chunk identifies which values in Games are the same as the character string ‘OAK’ and then determines which rows in Games include ‘OAK’ in either the Home Team or Away Team columns. This results in a subset of the Games data that only includes the games in which the Oakland Raiders played.
oak <- 'OAK' == Games
keep_games <- Games[rowSums(oak) > 0,]
With this reduced data we can now use the simple_boxscore() function to extract the passing data for each of these games. First let’s take a look at what’s inside of a ‘simple boxscore’. We’ll choose the first game in the keep_games data set. The game ID is in the first column so we want to use the keep_games[1,1] value. Running this game ID return the following list of results.
simple_boxscore(GameID = keep_games[1,1], home = T)
## $HomePassing
## stat date Game.ID home.team.name name att cmp
## 00-0030419 passing 2015-09-13 2015091310 OAK M.McGloin 31 23
## 00-0031280 passing 2015-09-13 2015091310 OAK D.Carr 12 7
## yds tds ints twopta twoptm playerID
## 00-0030419 142 2 1 1 0 00-0030419
## 00-0031280 61 0 0 0 0 00-0031280
##
## $HomeRushing
## stat date Game.ID home.team.name name att yds tds
## 00-0030513 rush 2015-09-13 2015091310 OAK L.Murray 11 44 0
## 00-0031280 rush 2015-09-13 2015091310 OAK D.Carr 1 8 0
## 00-0029492 rush 2015-09-13 2015091310 OAK J.Olawale 1 6 0
## 00-0028063 rush 2015-09-13 2015091310 OAK T.Jones 3 5 0
## lng lngtd twopta twoptm playerID
## 00-0030513 12 0 0 0 00-0030513
## 00-0031280 8 0 0 0 00-0031280
## 00-0029492 6 0 0 0 00-0029492
## 00-0028063 4 0 0 0 00-0028063
##
## $HomeReceiving
## stat date Game.ID home.team.name name rec
## 00-0030513 receiving 2015-09-13 2015091310 OAK L.Murray 7
## 00-0031544 receiving 2015-09-13 2015091310 OAK A.Cooper 5
## 00-0026986 receiving 2015-09-13 2015091310 OAK M.Crabtree 5
## 00-0026393 receiving 2015-09-13 2015091310 OAK M.Reece 3
## 00-0029492 receiving 2015-09-13 2015091310 OAK J.Olawale 3
## 00-0031166 receiving 2015-09-13 2015091310 OAK S.Roberts 3
## 00-0028063 receiving 2015-09-13 2015091310 OAK T.Jones 1
## 00-0029129 receiving 2015-09-13 2015091310 OAK R.Streater 1
## 00-0030514 receiving 2015-09-13 2015091310 OAK M.Rivera 1
## 00-0032256 receiving 2015-09-13 2015091310 OAK C.Walford 1
## yds tds lng lngtd twopta twoptm playerID
## 00-0030513 36 0 11 0 0 0 00-0030513
## 00-0031544 47 0 24 0 0 0 00-0031544
## 00-0026986 37 0 11 0 0 0 00-0026986
## 00-0026393 26 2 11 11 0 0 00-0026393
## 00-0029492 19 0 11 0 0 0 00-0029492
## 00-0031166 12 0 7 0 0 0 00-0031166
## 00-0028063 13 0 13 0 0 0 00-0028063
## 00-0029129 8 0 8 0 1 0 00-0029129
## 00-0030514 4 0 4 0 0 0 00-0030514
## 00-0032256 1 0 1 0 0 0 00-0032256
##
## $HomeDef
## stat date Game.ID home.team.name name tkl
## 00-0027759 defense 2015-09-13 2015091310 OAK L.Asante 7
## 00-0028180 defense 2015-09-13 2015091310 OAK M.Smith 6
## 00-0018227 defense 2015-09-13 2015091310 OAK C.Woodson 5
## 00-0031269 defense 2015-09-13 2015091310 OAK T.Carrie 5
## 00-0023509 defense 2015-09-13 2015091310 OAK J.Tuck 3
## 00-0027877 defense 2015-09-13 2015091310 OAK D.Williams 3
## 00-0031040 defense 2015-09-13 2015091310 OAK K.Mack 3
## 00-0030572 defense 2015-09-13 2015091310 OAK D.Hayden 3
## 00-0030050 defense 2015-09-13 2015091310 OAK R.Armstrong 2
## 00-0027945 defense 2015-09-13 2015091310 OAK A.Smith 1
## 00-0026177 defense 2015-09-13 2015091310 OAK C.Lofton 1
## 00-0027652 defense 2015-09-13 2015091310 OAK N.Allen 0
## ast sk int ffum playerID
## 00-0027759 1 0 0 0 00-0027759
## 00-0028180 1 0 0 0 00-0028180
## 00-0018227 2 0 0 0 00-0018227
## 00-0031269 1 0 0 0 00-0031269
## 00-0023509 2 0 0 0 00-0023509
## 00-0027877 1 0 0 0 00-0027877
## 00-0031040 1 0 0 0 00-0031040
## 00-0030572 1 0 0 0 00-0030572
## 00-0030050 1 0 0 0 00-0030050
## 00-0027945 1 0 0 0 00-0027945
## 00-0026177 1 0 0 0 00-0026177
## 00-0027652 1 0 0 0 00-0027652
##
## $HomeKicking
## stat date Game.ID home.team.name name fgm
## 00-0019646 kicking 2015-09-13 2015091310 OAK S.Janikowski 0
## fga fgyds totpfg xpmade xpmissed xpa xpb xptot playerID
## 00-0019646 0 0 0 1 0 1 0 1 00-0019646
##
## $HomeFumbles
## stat date Game.ID home.team.name name tot rcv
## 00-0031280 fumbles 2015-09-13 2015091310 OAK D.Carr 1 0
## 00-0030419 fumbles 2015-09-13 2015091310 OAK M.McGloin 1 0
## trcv yds lost playerID
## 00-0031280 0 0 0 00-0031280
## 00-0030419 0 0 1 00-0030419
##
## $HomeKR
## stat date Game.ID home.team.name name ret avg
## 00-0028063 kickreturn 2015-09-13 2015091310 OAK T.Jones 2 30
## 00-0028097 kickreturn 2015-09-13 2015091310 OAK L.Smith 1 8
## tds lng lngtd playerID
## 00-0028063 0 33 0 00-0028063
## 00-0028097 0 8 0 00-0028097
##
## $HomePR
## stat date Game.ID home.team.name name ret
## 00-0031269 puntreturn 2015-09-13 2015091310 OAK T.Carrie 3
## avg tds lng lngtd playerID
## 00-0031269 9 0 11 0 00-0031269
We can subset this list by noting the names of the objects stored inside. We see that the first object in this list is ‘HomePassing’ which is passing data about for the home team.
names(simple_boxscore(GameID = keep_games[1,1], home = T))
## [1] "HomePassing" "HomeRushing" "HomeReceiving" "HomeDef"
## [5] "HomeKicking" "HomeFumbles" "HomeKR" "HomePR"
Finally, let’s extract passing data from each game in keep_games. The function below does this by taking as first argument the data.frame of games and as second argument the abbreviation that we are interested in. This function then returns a data.frame with a new column on the fantasy passing points scored by the opposing team’s quarterbacks in each game. Note that in some of the games the opposing team may have played multiple quarterbacks. This function takes this into account and sums up all of the passing stats for anyone playing quarterback in a given game.
pass_stats <- function(df, team) {
df$qb_pts <- rep(0, nrow(df))
for(i in 1:nrow(df)) {
if(df[i,2] == team) {
sbs <- simple_boxscore(df[i,1], home = F)[[1]]
yds <- Reduce('+', sbs$yds)
tds <- Reduce('+', sbs$tds)
ints <- Reduce('+', sbs$ints)
score <- floor(yds/10) + tds * 6 - ints * 4
df$qb_pts[i] <- score
}
if(df[i,3] == team) {
sbs <- simple_boxscore(df[i,1], home = T)[[1]]
yds <- Reduce('+', sbs$yds)
tds <- Reduce('+', sbs$tds)
ints <- Reduce('+', sbs$ints)
score <- floor(yds/30) + tds * 6 - ints * 2
df$qb_pts[i] <- score
}
}
return(df)
}
Finally, lets run this function and look at the scores given up by the Raiders in these games.
scores <- pass_stats(keep_games, 'OAK')
DT::datatable(scores,
rownames = FALSE,
options = list(columnDefs = list(list(className = 'dt-center',
targets = '_all'))))
Add a column to the data set to indicate if the game is a home game for the Raiders
scores$`Home Game` <- (scores$`Home Team`=='OAK') * 1
Add a column to the data set to indicate if the game is a division game for the Raiders
div_teams <- c("SD","KC","DEN")
div <- sapply(X = 1:nrow(scores),
FUN = function(x) sum(scores[x,]%in%div_teams))
scores$`Div Game` <- div
Convert the Home Game and Div Game columns from numeric values to factors
scores$`Div Game` <- as.factor(scores$`Div Game`)
scores$`Home Game` <- as.factor(scores$`Home Game`)
library(gridExtra)
g1 <- ggplot(scores, aes(y = qb_pts, x = `Home Game`)) + geom_point()
g2 <- ggplot(scores, aes(y = qb_pts, x = `Div Game`)) + geom_point()
marrangeGrob(grobs = list(g1,g2), nrow = 1, ncol = 2, top = '')
model1 <- lm(qb_pts ~ `Home Game` + `Div Game`, data = scores) # additive effects
model2 <- lm(qb_pts ~ `Home Game` * `Div Game`, data = scores) # interaction effects
summary(model1)
##
## Call:
## lm(formula = qb_pts ~ `Home Game` + `Div Game`, data = scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.688 -7.312 0.250 5.688 21.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.312 2.596 6.670 1.18e-07 ***
## `Home Game`1 14.375 3.258 4.412 9.80e-05 ***
## `Div Game`1 -1.563 3.411 -0.458 0.65
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.899 on 34 degrees of freedom
## Multiple R-squared: 0.3686, Adjusted R-squared: 0.3315
## F-statistic: 9.926 on 2 and 34 DF, p-value: 0.0004026
summary(model2)
##
## Call:
## lm(formula = qb_pts ~ `Home Game` * `Div Game`, data = scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0833 -7.9167 0.0833 4.6667 21.9167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.917 2.890 6.200 5.36e-07 ***
## `Home Game`1 13.167 4.087 3.222 0.00286 **
## `Div Game`1 -3.202 4.761 -0.673 0.50584
## `Home Game`1:`Div Game`1 3.452 6.908 0.500 0.62053
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.01 on 33 degrees of freedom
## Multiple R-squared: 0.3734, Adjusted R-squared: 0.3164
## F-statistic: 6.554 on 3 and 33 DF, p-value: 0.00134
newgame <- data.frame(as.factor(1), as.factor(1))
colnames(newgame) <- c('Home Game', 'Div Game')
predict(model1, newdata = newgame)
## 1
## 30.125
predict(model2, newdata = newgame)
## 1
## 31.33333