There are multiple different sources of data at the play, player, and team level that may be useful for future analysis. The goal of this analysis is to document these potential sources and highlight the important and interesting variables to inform future analysis. While this is all public data, there is additional private data that would be very interesting. Overall, there are still some sources of data that aren’t public, such as NFL data made available to teams. From a quote in the Washington Post, “teams get the raw, individual, player-level tracking data for every player on and off the field,” said Michael Lopez, director of data and analytics for the NFL. “Their speed, acceleration, direction and orientation are all captured. That is much more refined than anything that is available to the public. A lot of these things are what scouts and front office personnel have always looked at, but now you are able to answer questions with data rather than just watching film.” This seems to be developed in partnership with AWS.
The public data sets included in this review are:
There are some additional datasets which aren’t discussed further, but I downloaded so they’re available in the future. These are:
This is a R package and github repository that will likely be one of my main sources of information. The repository has updated data through the 2019 season, but there may be a delay of at least a couple days between the game and data being uploaded. Maybe this won’t be the case, but given that I’ll initally plan to perform the scraping myself for more consistent access to the data. The page also has some analysis examples, which could be useful at some point.
The package itself is built to scrape data from the NFL API. The data includes play-by-play data, and also some individual player data. It’s most interesting because of the first one, but if it includes the same or better player data than other sources it would convenient to use this as well. The code below is an introduction to what the package can do.
I immediately ran into the problem that apparently this no longer works. The NFL has changed its API and this data can no longer be scraped. Very disappointing. It would have been nice if the package advertised this more clearly. I can still access the data repository for data through 2019, but this won’t work for updated data throughout the 2020 season. There is more discussion of this issue here and here and here.
library(nflscrapR)
library(tidyverse)
library(nflfastR)
#can generate a list of game IDs by season, week, and team here
#week_2_games <- scrape_game_ids(2018, weeks = 2)
This is a new package that scrapes data from the NFL website, and is expected to work during the 2020 season. It produces play-by-play data, so this could be a new main source. This package seems largely intended to duplicate the functionality, but faster. There are two vignettes that provide some information about how to use the package. It’s also important to note that they have repositories for old data, so its best to use that rather than continuously rescraping new data. Still, I test the scraping here (without running the chunk) to understand it in case the new data is not updated quickly enough after games.
rams_2019 <- fast_scraper_schedules(2019) %>%
filter(away_team=='LA'|home_team=='LA') %>%
select(game_id) #this will find the id's of all Rams games
rams_week1 <- fast_scraper('2019_01_LA_CAR') #scrap play by play data (pbp) using an id
rams_week1_cleaned <- fast_scraper('2019_01_LA_CAR') %>%
clean_pbp() #clean the pbp data a bit
This shows how we can scrape data, but for now we will use the repository to prevent unnecessary load. Below, I load in the play-by-play data for the first Rams game. It loads in 190 rows of 340 different variables. Given the significant amount of variables, it will be useful to highlight those that may be useful. In order to allow an exploration of each one, I use an interactive table below.
library(reactable)
rams_2020 <- fast_scraper_schedules(2020) %>%
filter(away_team=='LA'|home_team=='LA') %>%
select(game_id)
pbp <- readRDS(url(glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2020.rds"))) %>% #this reads in 2020 data. can choose different year or use a function for multiple ones
filter(game_id== '2020_01_DAL_LA')
vars <- data.frame(colnames(pbp))
reactable(vars)
This is a lot to look through, but here are some highlights about the structure of the data and some if its interesting elements. This CRAN documentation gives good definitions for many of the variables.
This data is hosted on Github and contains a lot of useful information about teams besides player stats or play-by-play. For instance, it has data about draft picks and their estimated value from several sources; game characteristic such as the type of stadium (roof, surface type, temp, wind), spread line, and referee; team colors and logos; rosters; and trades. At least some of this game characteristics data is available through nflfastR. This data is all able to be easily joined to nflfastR data using the ‘name’ field. Below is a simple use of the draft pick data. It goes back to 2000, and can be filtered by things such as team, position, and pick.
rams_10 <- read_csv("https://raw.githubusercontent.com/leesharpe/nfldata/master/data/draft_picks.csv") %>%
filter(team=='LA') %>% #rams are referred to as STL in earlier seasons
head(n=10)
rams_10
## # A tibble: 10 x 10
## season team round pick playerid full_name name side category position
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2016 LA 1 1 GoffJa00 Jared Goff J.Goff O QB QB
## 2 2016 LA 4 110 HigbTy00 Tyler Higbee T.Hig~ O TE TE
## 3 2016 LA 4 117 CoopPh00 Pharoh Coop~ P.Coo~ O WR WR
## 4 2016 LA 6 177 HemiTe00 Temarrick H~ T.Hem~ O TE TE
## 5 2016 LA 6 190 ForrJo00 Josh Forrest J.For~ D LB ILB
## 6 2016 LA 6 206 ThomMi04 Mike Thomas M.Tho~ O WR WR
## 7 2017 LA 2 44 EverGe00 Gerald Ever~ G.Eve~ O TE TE
## 8 2017 LA 3 69 KuppCo00 Cooper Kupp C.Kupp O WR WR
## 9 2017 LA 3 91 JohnJo10 John Johnson J.Joh~ D DB S
## 10 2017 LA 4 117 ReynJo00 Josh Reynol~ J.Rey~ O WR WR
The data above shows the first ten picks of the LA Rams era, along with all of the included variables. We can also use data from this source to examine the value of these picks, according to several sources created by analysts and others (below). This could be used for various analysis, included projected vs actual draft value compared to other teams, comparing a player like Goff to similarly selected QBs, or predicted the accuracy of the draft values.
rams_value <- rams_10 %>%
inner_join(read_csv("https://raw.githubusercontent.com/leesharpe/nfldata/master/data/draft_values.csv"), by= 'pick') %>%
select(season:pick, name, position, stuart:otc)
rams_value
## # A tibble: 10 x 10
## season team round pick name position stuart johnson hill otc
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2016 LA 1 1 J.Goff QB 34.6 3000 1000 3000
## 2 2016 LA 4 110 T.Higbee TE 4.7 74 28.9 525
## 3 2016 LA 4 117 P.Cooper WR 4.3 60 24.7 492
## 4 2016 LA 6 177 T.Hemingway TE 1.6 21 6.49 274
## 5 2016 LA 6 190 J.Forrest ILB 1.2 15 4.86 237
## 6 2016 LA 6 206 M.Thomas WR 0.7 9 3.4 194
## 7 2017 LA 2 44 G.Everett TE 10.5 460 135. 1007
## 8 2017 LA 3 69 C.Kupp WR 7.6 245 71.4 770
## 9 2017 LA 3 91 J.Johnson S 5.9 136 44.0 625
## 10 2017 LA 4 117 J.Reynolds WR 4.3 60 24.7 492
Finally, we can quickly use the data to find recent Rams trades. It’s also easy to use the other data, which I’m sure I will at some point. Lee Sharpe also maintains a site called NFL Game Data which has various stats, game scores, and win percentages. It’s likely all of this is in nflfastR, but it may be worth looking more deeply at. The data below shows the 10 most recent trades the Rams have been involved in. Each observation is about a thing traded, so there are often multiple observations per trade. A few variables are excluded, but nothing overly important. It would be interesting to see if there is similar data about players that are cut or not resigned.
rams_trades <- read_csv("https://raw.githubusercontent.com/leesharpe/nfldata/master/data/trades.csv") %>%
filter(gave=='LA'|received=='LA') %>%
tail(n=10) %>%
select(season, trade_date, gave, received, pfr_name, pick_round)
rams_trades
## # A tibble: 10 x 6
## season trade_date gave received pfr_name pick_round
## <dbl> <date> <chr> <chr> <chr> <dbl>
## 1 2019 2019-10-29 LA MIA Aqib Talib NA
## 2 2019 2019-10-29 LA MIA Darnell Mooney 5
## 3 2019 2019-10-29 MIA LA <NA> 7
## 4 2020 2020-04-09 HOU LA Van Jefferson 2
## 5 2020 2020-04-09 LA HOU Brandin Cooks NA
## 6 2020 2020-04-09 LA HOU <NA> 4
## 7 2020 2020-04-25 LA HOU Charlie Heck 4
## 8 2020 2020-04-25 HOU LA Brycen Hopkins 4
## 9 2020 2020-04-25 HOU LA Sam Sloman 7
## 10 2020 2020-04-25 HOU LA Tremayne Anchrum 7
This repository contains some player tracking data from a select group of 2017 games. This data is available to NFL teams and it looks like some media members (at ESPN, for instance), but not for the average person. It likely won’t be useful for any projects, but is still important to know about. There could be a variety of uses, like tracking player speed throughout a season, or speed/separation as a predictor of success for certain routes. I’m sure there are a bunch of other interesting things to explore as well.
This site is maintained by Josh Hermsmeyer, who writes for 538. I believe it currently draws nflfastR for its data, although it hasn’t yet produced data for the 2020 season. If it gets back up and running I can take a deeper look. It’s quite easy to read into R, shown below. It has data from runningbacks, wide receivers, and tight ends that seems to capture most facets of their performance and efficiency.
library(jsonlite)
df_air <- fromJSON('http://api.airyards.com/2019/weeks')
df_air %>%
head(n=5)
## index player_id full_name position team week tar td rush_td rec
## 1 0 00-0027944 Julio Jones WR ATL 15 20 2 0 13
## 2 1 00-0033040 Tyreek Hill WR KC 10 19 1 0 11
## 3 2 00-0030431 Robert Woods WR LA 13 18 0 0 13
## 4 3 00-0032211 Tyler Lockett WR SEA 9 18 2 0 13
## 5 4 00-0030279 Keenan Allen WR LAC 3 17 2 0 13
## rec_yards rush_yards yac air_yards tm_att team_air aypt racr ms_air_yards
## 1 134 0 63 153 39 213 7.7 0.88 0.72
## 2 157 3 69 237 51 422 12.5 0.66 0.56
## 3 172 0 128 80 45 226 4.4 2.15 0.35
## 4 152 0 54 167 43 346 9.3 0.91 0.48
## 5 183 3 74 166 46 471 9.8 1.10 0.35
## target_share wopr
## 1 0.51 1.27
## 2 0.37 0.95
## 3 0.40 0.85
## 4 0.42 0.97
## 5 0.37 0.80
Next Gen Stats is maintained by the NFL and features a lot of pretty in-depth stats for different players. Scraped data from the site is available via a Github page. This only features the QB data, and includes all Qbs with at least 15 percent. The chunk below shows some of the variables that are included. It also available for past years and would be useful for comparing performance over seasons.
next_gen_2020 <- read_csv("https://raw.githubusercontent.com/Deryck97/nfl_nextgenstats_data/master/data/nextgen_2020.csv")
names(next_gen_2020)
## [1] "shortName"
## [2] "playerName"
## [3] "aggressiveness"
## [4] "attempts"
## [5] "avgAirDistance"
## [6] "avgAirYardsDifferential"
## [7] "avgAirYardsToSticks"
## [8] "avgCompletedAirYards"
## [9] "avgIntendedAirYards"
## [10] "avgTimeToThrow"
## [11] "completionPercentage"
## [12] "completionPercentageAboveExpectation"
## [13] "completions"
## [14] "expectedCompletionPercentage"
## [15] "interceptions"
## [16] "maxAirDistance"
## [17] "maxCompletedAirDistance"
## [18] "passTouchdowns"
## [19] "passYards"
## [20] "passerRating"
## [21] "season"
## [22] "seasonType"
## [23] "week"
## [24] "teamId"
Many of the variables listed above could contribute the picture when evaluating a QB. They would be a good complement to some of the data included in nflfastR that could explore success by position of a throw, down, and other factors. There is a glossary that provides definitions for each variable.
Some of the important ones will be time to throw (tt), vairous air yards measures, amount of passing attempts into tight coverage (aggressiveness where there is a defender within one yard), air yards to the sticks, and difference between completion percentage and expected completion percentage.
The website also has stats for receivers and runningbacks which could presumably be scraped if they were needed.