Daily fantasy football is like traditional fantasy football only you set your lineup for every set of games imaginable. Afternoon games, primetime only, week three are all subsets of games everyone is familiar with. In this sense, NFL doesn’t really have a daily slate of games to set a lineup to, but it’s still a part of daily fantasy sports (DFS), so we will run with it. The world makes less and less sense everyday. If you are reading this, the first few sentences are likely a complete waste of time. I should have warned you beforehand. Honestly, a good majority of DFS players I know can hardly read or formulate complete sentences, but that’s ok. Life is full of majestic mysteries.
Fantasy football is mostly about numbers which isn’t really a big secret. Most everyone goes through some sort of routine prior to drafting a lineup. This write-up is meant to showcase a method I’ve used to gather data from the web (world wide web, not just the web). The data that I get comes from two sources, so I’ll also be highlighting some methods for joining data. By the end of this, readers should be the Jerry Rice of copying and pasting R code. High 5!
To start, you will need R on your computer or none of this code will work. You can definitely copy and paste this code into notepad or excel, maybe Word if you splurge and pay for Microsoft Office. You can type this code literally anywhere on your computer and try to get it to run, but the chances of it doing what it’s supposed to do are very slim. I’d give it roughly a 0% chance. Trump did get elected so who knows, maybe you roll the dice and get lucky by pasting this code into gmail and something truly magical unravels before your eyes (doubtful).
I recommend downloading R, then immediately downloading RStudio. I strongly suggest googling what and how to do these things because I am not going to cover those topics here. This is about DFS (NFL). That’s really the only requirements for moving forward. If you don’t want to take the half hour to set R and RStudio up, you shoudld probably not read any further unless you’re bored or want to look smart by doing something other than looking at Facebook.
One of my favorite things about the R language is that you can pretty much get it to do anything you want. You can retrieve data, clean data, join data to more data, and much more. It’s fairly easy to learn, too. Which is nice since most folks don’t have a ton of time outside of work to learn how to type a bunch of jumbled mess. I’ve done a lot of the hard stuff so you don’t have to work super hard outside of reading this marathon of a novel.
First, head over to http://www.footballdb.com/fantasy-football/index.html. You’ll find a weekly breakdown of some basic NFL player stats. There are some options here: position, year, week (in particular). This tells us that we can filter to what we need, but it doesn’t show us a game by game log for each player for an entire season. Now, visit http://rotoguru1.com/cgi-bin/fyday.pl?week=1&game=dk&scsv=1. Here, there is a handy csv (delimiter separated) style dataset for each week. Hopefully, you’re already guessing where this is going. We’re going to take the basic stats and combine them with our DFS salaries and positions and points. This will give us a lot of what we need to analyze historical data from 2016. We can do matchup type analysis or player based or whatever we want to do with this data to help us out. I’m going to limit this write-up to only quarterbacks so instead of only taking 9 hours to get through, maybe it’ll only 3 or 8.
R is open source. Anyone can contribute. With that, there are ‘pacakges’ that have been written by folks who are pretty slick with the ol keyboard. This enables normal R users to be able to load up the packages and simply execute pre-defined functions as opposed to writing your own functions. The R community is quite large and mature so there are packages for pretty much everything from loading excel files, predictive analytics, and really anything you can possibly think of. We will be leveraging some data scraping packages here to kick things off. Scraping just means we are having R visit a website and gather data from a table on a website. Usually this happens through parsing HTML or sometimes more complicated frameworks depending on the website.
This might be overkill but it does save some time. First we are going to check to see if the pacakges are installed, and if they aren’t, we’ll install them.
list.of.packages <- c("rvest", "reshape")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(rvest)
## Loading required package: xml2
library(reshape)
## Warning: package 'reshape' was built under R version 3.3.3
What follows is a little hairy. We’re going to set up a loop that will start with week 1 and go all the way through week 16. We’re going to start with the footballdb site and each week is on its own page. The idea is to take each week’s data and then combine all 16 weeks into one table. I’m calling i <- 1 which is just saying i = 1 and while i <= 16, we’ll run the code below it. At the bottom of this code chunk, i is actually i + 1 which means that after each pass through the code we’ll be visiting the next week. For example, the first time through, i = 1 for grabbing week 1’s data. When it’s finished, i becomes i + 1 which = 2. Now we go after week 2’s data.
Since we’re focused on QB data, when you filter footballdb to week 1 for QBs, you actually see the URL of the page clearly depict the filters. The URL becomes ‘http://www.footballdb.com/fantasy-football/index.html?pos=QB&yr=2016&wk=1&rules=1’ You can see pos=QB, yr=2016, and wk=1. To make our loop effective, we break the URL out into pieces so that we can parameterize the week with our i variable. Take a glance at the code below, there’s clearly multiple URL variables. We’ll call the paste function to paste the two URL slices around the i variable.
Rvest is a really powerfull web scraping package, and to be honest, I am not the person to explain it. However, we will start by giving our table a name, qbstats. We will run a pipeline on qbstats to accomplish our task of retrieving data from the web page and sticking it in qbstats. A pipeline is just an ordered set of tasks, FYI LOL OMG TTYL. The first task is to visit the URL we give it, then provide the specific location of the web page where the table lives. This is done by selecting part of the table, right clicking and going to inspect (if using Chrome). When you hit inspect, a pane will pop open along the right and display a lot of black magic and wizardry that is beyond my comprehension. As you navigate the HTML, you should hover over div class =“table-responsive” which should highlight the table on the web page. If you right click on the previously mentioned div class, you can copy the XPath. This is where the html_node comes from: “xpath=’//*[@id="leftcol"]/div[3]/table“. Next up in the pipeline is declaring this an html_table. Again this is all apart of Rvest and the wonderful work those angels did building this package. Finally we will paste qbstats with our variable i to let us know which week we’re on. We’ll remove the first two rows and name our columns since the scraper didn’t grab the column headers from the web page. Lastly, we’ll make sure that each week has its own distint table name with properly formatted data. You’ll also see the i + 1 to keep the loop moving.
i <- 1
while (i <= 16){
url.link <- "http://www.footballdb.com/fantasy-football/index.html?pos=QB&yr=2016&wk="
url.link2 <- "&rules=1"
url <- paste(url.link,i,url.link2,sep ="")
qbstats <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="leftcol"]/div[3]/table') %>%
html_table()
paste0("qbstats",i)
qbstats <- (qbstats[[1]])
qbstats <- qbstats[-1:-2,]
names(qbstats) <- c("player","opp","fp","passattempts","passcompl","passyards","passtd","passint","pass2pt",
"rushattempts","rushyards","rushTD","rush2pt","receptions","recyards","rectd","rec2pt",
"fumbl","fumbltd")
qbstats$week <- rep(i,nrow(qbstats))
assign(paste("qbstats",i,sep=""), qbstats)
i <- i + 1}
qbstats <- NULL
We’ve scraped some basic stats, time to take advantage of some free dfs data. Rotoguru is a wonderful and delightful resource for dfs data. As mentioned before, guru tees this data up nicely. Just like before, we’ll be looping through each week’s web page and using some html tags to tell R where to find the data.
i <- 1
while (i <= 16){
url1 <- "http://rotoguru1.com/cgi-bin/fyday.pl?week="
url2 <- "&game=dk&scsv=1"
url <- paste(url1,i,url2,sep="")
webpage <- read_html(url)
roto_guru_html <- webpage %>%
html_nodes("pre") %>%
html_text() %>%
strsplit(split = "\n") %>%
unlist() %>%
.[. != ""]
roto_guru_html <- as.data.frame(roto_guru_html)
roto_guru_html = transform(roto_guru_html, roto_guru_html = colsplit(roto_guru_html, split = "\\;",
names = c('week', 'year','GID','player','Pos',
'Team','H/A','Oppt','DKP','DKSalary')))
roto_guru_html <- roto_guru_html[-1,]
paste0("roto_guru_html",i)
assign(paste("week",i,"DFS",sep=""), roto_guru_html)
i <- i + 1}
roto_guru_html <- NULL
Very rarely will there ever be a time where - after spending hours figuring out clever ways to retrieve data… that you don’t have to then spend forever cleaning it up. Luckily, most of this data is acceptable. One of our objectives was to merge basic stats with DFS logs. Joining separate tables in R is pretty straight forward. We can append rows or join columns together. Both the footballdb data and guru data contain week and player name. We will match those columns between sources to combine our data. Before we combine the two data sources, we’ll need to combine each of the 16 separate tables from each source. All of the qbstats tables will need to be combined, same with roto_guru_html:1-16. ‘rbind’ takes care of this nicely for us. After running this snippet of code, you should have one big dfs table and one big basic stats table
qbstats <- rbind(qbstats1,qbstats2,qbstats3,qbstats4,qbstats5,qbstats6,qbstats7,qbstats8,qbstats9,
qbstats10,qbstats11,qbstats12,qbstats13,qbstats14,qbstats15,qbstats16)
dfsLogs <- rbind(week1DFS,week2DFS,week3DFS,week4DFS,week5DFS,week6DFS,week7DFS,week8DFS,week9DFS,week10DFS,week11DFS,
week12DFS,week13DFS,week14DFS,week15DFS,week16DFS)
Now for the fun part. We must clean up a few things. First, the football db ‘Player Name’ column is a jumbled mess (ex : Andrew Luck, IndA. Luck, Ind). Another topic for a separate google / copy / paste session is regular expressions. Regular expressions are a handy way to parse through text data and do whatever you want to it. Look at player name for the dfsLogs (ex: Luck, Andrew). I’m not going to detail what regular expressions do because I don’t think anyone truly understands it. We just want our columns to show Andrew Luck. Anywho, let’s clean up the player name columns for both tables.
qbstats$player <- gsub(",.*$", "", qbstats$player)
qbDfsLogs$player <- sub("(\\w+),\\s(\\w+)","\\2 \\1", qbDfsLogs$player)
This next section is going to conclude our data acquisition journey for now. You may have noticed, our dfsLogs table contains more than just QBs. First we will subset to QBs. Then let’s wrap it up by joining our tables together by matching player and week.
qbDfsLogs <- subset(dfsLogs, Pos == "QB", select = c("week","player","Team","Oppt","H.A","DKSalary","DKP"))
qbData <- merge(qbstats,qbDfsLogs, by = c("player","week"))
Hopefully this tutorial shined the light on R and how we can use this beautiful language to scrape and munge data. I’ll be working on a second part of this write-up which will review some analysis we can perform on the data and possibly get into machine learning.