NFL Data Analysis

Overview

There are multiple different sources of data at the play, player, and team level that may be useful for future analysis. The goal of this analysis is to document these potential sources and highlight the important and interesting variables to inform future analysis. While this is all public data, there is additional private data that would be very interesting. Overall, there are still some sources of data that aren’t public, such as NFL data made available to teams. From a quote in the Washington Post, “teams get the raw, individual, player-level tracking data for every player on and off the field,” said Michael Lopez, director of data and analytics for the NFL. “Their speed, acceleration, direction and orientation are all captured. That is much more refined than anything that is available to the public. A lot of these things are what scouts and front office personnel have always looked at, but now you are able to answer questions with data rather than just watching film.” This seems to be developed in partnership with AWS.

The public data sets included in this review are:

Play-by-play data from nflscrapR
Play-by-play data from nflfastR
Limited play-by-play data for player field position from Big Data Bowl
NFL Game Data from Lee Sharpe
Player data from Next Gen Stats
Player data from Air Yards
Player data from Pro Football Reference

There are some additional datasets which aren’t discussed further, but I downloaded so they’re available in the future. These are:

NFL combine and game weather data from nflsavant. More updated versions are likely available somewhere else.

nflscrapR

This is a R package and github repository that will likely be one of my main sources of information. The repository has updated data through the 2019 season, but there may be a delay of at least a couple days between the game and data being uploaded. Maybe this won’t be the case, but given that I’ll initally plan to perform the scraping myself for more consistent access to the data. The page also has some analysis examples, which could be useful at some point.

The package itself is built to scrape data from the NFL API. The data includes play-by-play data, and also some individual player data. It’s most interesting because of the first one, but if it includes the same or better player data than other sources it would convenient to use this as well. The code below is an introduction to what the package can do.

I immediately ran into the problem that apparently this no longer works. The NFL has changed its API and this data can no longer be scraped. Very disappointing. It would have been nice if the package advertised this more clearly. I can still access the data repository for data through 2019, but this won’t work for updated data throughout the 2020 season. There is more discussion of this issue here and here and here.

library(nflscrapR)
library(tidyverse)
library(nflfastR)

#can generate a list of game IDs by season, week, and team here
#week_2_games <- scrape_game_ids(2018, weeks = 2)

nflfastR

This is a new package that scrapes data from the NFL website, and is expected to work during the 2020 season. It produces play-by-play data, so this could be a new main source. This package seems largely intended to duplicate the functionality, but faster. There are two vignettes that provide some information about how to use the package. It’s also important to note that they have repositories for old data, so its best to use that rather than continuously rescraping new data. Still, I test the scraping here (without running the chunk) to understand it in case the new data is not updated quickly enough after games.

rams_2019 <- fast_scraper_schedules(2019) %>%
  filter(away_team=='LA'|home_team=='LA') %>%
  select(game_id) #this will find the id's of all Rams games

rams_week1 <- fast_scraper('2019_01_LA_CAR') #scrap play by play data (pbp) using an id

rams_week1_cleaned <- fast_scraper('2019_01_LA_CAR') %>% 
  clean_pbp() #clean the pbp data a bit

This shows how we can scrape data, but for now we will use the repository to prevent unnecessary load. Below, I load in the play-by-play data for the first Rams game. It loads in 190 rows of 340 different variables. Given the significant amount of variables, it will be useful to highlight those that may be useful. In order to allow an exploration of each one, I use an interactive table below.

library(reactable)

rams_2020 <- fast_scraper_schedules(2020) %>%
  filter(away_team=='LA'|home_team=='LA') %>%
  select(game_id)
  
pbp <- readRDS(url(glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2020.rds"))) %>% #this reads in 2020 data. can choose different year or use a function for multiple ones
  filter(game_id== '2020_01_DAL_LA') 

vars <- data.frame(colnames(pbp))

reactable(vars)

This is a lot to look through, but here are some highlights about the structure of the data and some if its interesting elements. This CRAN documentation gives good definitions for many of the variables.

It contains all basic information such as field position, time left, team with ball, etc.
It has a variety variables which could be created manually but will make quick analysis easy. That includes variables that truck whether its a goal to go situation, the drive #, yards to go.
Description (desc) is a text variable that summarizes the play. All of the info in this variable should be contained in other variables that are easier to use for analysis.
Play_type values- run, pass, no_play, extra_point, punt, kickoff, etc.
There are a variety of binary variables about various play characteristics.
- shotgun
- No_huddle
- qb_dropback/spike/kneel/scramble
- pass_length (short, deep) and pass_location (left, right, center). This would be enough to create six different throw areas. could further examine by field position, time left, whether there’s pressure, point differential, down, etc.
- air yards and YAC
- run_location, run_gap (tackle, end, guard, etc.)
Probabilities
- For various outcomes (td, fgs, safety, etc)
- EP and EPA, including these statistics by a variety of play types
- qb_epa
- Win probability (wp, plus wp for home and away) and wpa (win prob added)
- Also has wp for vegas
- Probability of complete pass (cp) and over expected (cpoe)
- xyac_epa: expected value of epa gained after the catch, xyac_fd: the probability a play earns a first down based on where it was caught.
A variety of binary variables for particular plays (i.e. first down rush, pass, penalty; fourth down result; punt outcome; td etc.). Some interesting ones include Lateral pass, rush, etc.
passer_player_name, receiver_player_name etc. This includes player IDs for a variety of defensive plays as well.
series tracks the success of each set of downs, and series_expected tracks whether it was successful (i.e. first down). fixed_drive tracks drive number and fixed_drive_result tracks result.
drive_time_of_possession, drive_first_downs, drive_start_yard_line
play_clock tracks what time is on the clock when the ball is snapped
spread_line show spread

NFL Data from Lee Sharpe

This data is hosted on Github and contains a lot of useful information about teams besides player stats or play-by-play. For instance, it has data about draft picks and their estimated value from several sources; game characteristic such as the type of stadium (roof, surface type, temp, wind), spread line, and referee; team colors and logos; rosters; and trades. At least some of this game characteristics data is available through nflfastR. This data is all able to be easily joined to nflfastR data using the ‘name’ field. Below is a simple use of the draft pick data. It goes back to 2000, and can be filtered by things such as team, position, and pick.

rams_10 <- read_csv("https://raw.githubusercontent.com/leesharpe/nfldata/master/data/draft_picks.csv") %>%
  filter(team=='LA') %>% #rams are referred to as STL in earlier seasons
  head(n=10)

rams_10

## # A tibble: 10 x 10
##    season team  round  pick playerid full_name    name   side  category position
##     <dbl> <chr> <dbl> <dbl> <chr>    <chr>        <chr>  <chr> <chr>    <chr>   
##  1   2016 LA        1     1 GoffJa00 Jared Goff   J.Goff O     QB       QB      
##  2   2016 LA        4   110 HigbTy00 Tyler Higbee T.Hig~ O     TE       TE      
##  3   2016 LA        4   117 CoopPh00 Pharoh Coop~ P.Coo~ O     WR       WR      
##  4   2016 LA        6   177 HemiTe00 Temarrick H~ T.Hem~ O     TE       TE      
##  5   2016 LA        6   190 ForrJo00 Josh Forrest J.For~ D     LB       ILB     
##  6   2016 LA        6   206 ThomMi04 Mike Thomas  M.Tho~ O     WR       WR      
##  7   2017 LA        2    44 EverGe00 Gerald Ever~ G.Eve~ O     TE       TE      
##  8   2017 LA        3    69 KuppCo00 Cooper Kupp  C.Kupp O     WR       WR      
##  9   2017 LA        3    91 JohnJo10 John Johnson J.Joh~ D     DB       S       
## 10   2017 LA        4   117 ReynJo00 Josh Reynol~ J.Rey~ O     WR       WR

The data above shows the first ten picks of the LA Rams era, along with all of the included variables. We can also use data from this source to examine the value of these picks, according to several sources created by analysts and others (below). This could be used for various analysis, included projected vs actual draft value compared to other teams, comparing a player like Goff to similarly selected QBs, or predicted the accuracy of the draft values.

rams_value <- rams_10 %>%
  inner_join(read_csv("https://raw.githubusercontent.com/leesharpe/nfldata/master/data/draft_values.csv"), by= 'pick') %>%
  select(season:pick, name, position, stuart:otc)

rams_value

## # A tibble: 10 x 10
##    season team  round  pick name        position stuart johnson    hill   otc
##     <dbl> <chr> <dbl> <dbl> <chr>       <chr>     <dbl>   <dbl>   <dbl> <dbl>
##  1   2016 LA        1     1 J.Goff      QB         34.6    3000 1000     3000
##  2   2016 LA        4   110 T.Higbee    TE          4.7      74   28.9    525
##  3   2016 LA        4   117 P.Cooper    WR          4.3      60   24.7    492
##  4   2016 LA        6   177 T.Hemingway TE          1.6      21    6.49   274
##  5   2016 LA        6   190 J.Forrest   ILB         1.2      15    4.86   237
##  6   2016 LA        6   206 M.Thomas    WR          0.7       9    3.4    194
##  7   2017 LA        2    44 G.Everett   TE         10.5     460  135.    1007
##  8   2017 LA        3    69 C.Kupp      WR          7.6     245   71.4    770
##  9   2017 LA        3    91 J.Johnson   S           5.9     136   44.0    625
## 10   2017 LA        4   117 J.Reynolds  WR          4.3      60   24.7    492

Finally, we can quickly use the data to find recent Rams trades. It’s also easy to use the other data, which I’m sure I will at some point. Lee Sharpe also maintains a site called NFL Game Data which has various stats, game scores, and win percentages. It’s likely all of this is in nflfastR, but it may be worth looking more deeply at. The data below shows the 10 most recent trades the Rams have been involved in. Each observation is about a thing traded, so there are often multiple observations per trade. A few variables are excluded, but nothing overly important. It would be interesting to see if there is similar data about players that are cut or not resigned.

rams_trades <- read_csv("https://raw.githubusercontent.com/leesharpe/nfldata/master/data/trades.csv") %>%
  filter(gave=='LA'|received=='LA') %>%
  tail(n=10) %>%
  select(season, trade_date, gave, received, pfr_name, pick_round)

rams_trades

## # A tibble: 10 x 6
##    season trade_date gave  received pfr_name         pick_round
##     <dbl> <date>     <chr> <chr>    <chr>                 <dbl>
##  1   2019 2019-10-29 LA    MIA      Aqib Talib               NA
##  2   2019 2019-10-29 LA    MIA      Darnell Mooney            5
##  3   2019 2019-10-29 MIA   LA       <NA>                      7
##  4   2020 2020-04-09 HOU   LA       Van Jefferson             2
##  5   2020 2020-04-09 LA    HOU      Brandin Cooks            NA
##  6   2020 2020-04-09 LA    HOU      <NA>                      4
##  7   2020 2020-04-25 LA    HOU      Charlie Heck              4
##  8   2020 2020-04-25 HOU   LA       Brycen Hopkins            4
##  9   2020 2020-04-25 HOU   LA       Sam Sloman                7
## 10   2020 2020-04-25 HOU   LA       Tremayne Anchrum          7

Player Tracking Data from Big Data Bowl

This repository contains some player tracking data from a select group of 2017 games. This data is available to NFL teams and it looks like some media members (at ESPN, for instance), but not for the average person. It likely won’t be useful for any projects, but is still important to know about. There could be a variety of uses, like tracking player speed throughout a season, or speed/separation as a predictor of success for certain routes. I’m sure there are a bunch of other interesting things to explore as well.

Air Yards

This site is maintained by Josh Hermsmeyer, who writes for 538. I believe it currently draws nflfastR for its data, although it hasn’t yet produced data for the 2020 season. If it gets back up and running I can take a deeper look. It’s quite easy to read into R, shown below. It has data from runningbacks, wide receivers, and tight ends that seems to capture most facets of their performance and efficiency.

library(jsonlite)
df_air <- fromJSON('http://api.airyards.com/2019/weeks')

df_air %>%
  head(n=5)

##   index  player_id     full_name position team week tar td rush_td rec
## 1     0 00-0027944   Julio Jones       WR  ATL   15  20  2       0  13
## 2     1 00-0033040   Tyreek Hill       WR   KC   10  19  1       0  11
## 3     2 00-0030431  Robert Woods       WR   LA   13  18  0       0  13
## 4     3 00-0032211 Tyler Lockett       WR  SEA    9  18  2       0  13
## 5     4 00-0030279  Keenan Allen       WR  LAC    3  17  2       0  13
##   rec_yards rush_yards yac air_yards tm_att team_air aypt racr ms_air_yards
## 1       134          0  63       153     39      213  7.7 0.88         0.72
## 2       157          3  69       237     51      422 12.5 0.66         0.56
## 3       172          0 128        80     45      226  4.4 2.15         0.35
## 4       152          0  54       167     43      346  9.3 0.91         0.48
## 5       183          3  74       166     46      471  9.8 1.10         0.35
##   target_share wopr
## 1         0.51 1.27
## 2         0.37 0.95
## 3         0.40 0.85
## 4         0.42 0.97
## 5         0.37 0.80

Next Gen Stats

Next Gen Stats is maintained by the NFL and features a lot of pretty in-depth stats for different players. Scraped data from the site is available via a Github page. This only features the QB data, and includes all Qbs with at least 15 percent. The chunk below shows some of the variables that are included. It also available for past years and would be useful for comparing performance over seasons.

next_gen_2020 <- read_csv("https://raw.githubusercontent.com/Deryck97/nfl_nextgenstats_data/master/data/nextgen_2020.csv")

names(next_gen_2020)

##  [1] "shortName"                           
##  [2] "playerName"                          
##  [3] "aggressiveness"                      
##  [4] "attempts"                            
##  [5] "avgAirDistance"                      
##  [6] "avgAirYardsDifferential"             
##  [7] "avgAirYardsToSticks"                 
##  [8] "avgCompletedAirYards"                
##  [9] "avgIntendedAirYards"                 
## [10] "avgTimeToThrow"                      
## [11] "completionPercentage"                
## [12] "completionPercentageAboveExpectation"
## [13] "completions"                         
## [14] "expectedCompletionPercentage"        
## [15] "interceptions"                       
## [16] "maxAirDistance"                      
## [17] "maxCompletedAirDistance"             
## [18] "passTouchdowns"                      
## [19] "passYards"                           
## [20] "passerRating"                        
## [21] "season"                              
## [22] "seasonType"                          
## [23] "week"                                
## [24] "teamId"

Many of the variables listed above could contribute the picture when evaluating a QB. They would be a good complement to some of the data included in nflfastR that could explore success by position of a throw, down, and other factors. There is a glossary that provides definitions for each variable.

Some of the important ones will be time to throw (tt), vairous air yards measures, amount of passing attempts into tight coverage (aggressiveness where there is a defender within one yard), air yards to the sticks, and difference between completion percentage and expected completion percentage.

The website also has stats for receivers and runningbacks which could presumably be scraped if they were needed.