Introduction

This series of workshops is designed to be able to teach the basics of sports analytics to anyone, especially those who have never coded anything in their life. If that’s you. no worries. If you’re unable to make it feel free to message me on GroupMe or email me () as you troublshoot.

To start today, we are just going to look at an intro to R and RStudio, how to use packages for data analysis.

Any data we use can be found in this folder: https://drive.google.com/drive/folders/17x3TZ0Y2bsk3jwYpH1dQ9lbB3AdHpb18?usp=sharing

Set-up

To start, we just set up our environment. A good first step may be to create a folder somewhere easily accessible. This will be your working directory. The following chunk of code will include…

  • Markdown set-up

  • Setting working directory

  • Updating packages that need updating

knitr::opts_chunk$set(echo = TRUE)
#knitr::opts_knit$set(root.dir = "C:/Users/maxwh/Downloads/SSOUGA")

#setwd("C:/Users/maxwh/Downloads/SSOUGA")

# TODO Run ~once a week to keep up with package dependencies
#update.packages()

Our environment is now pretty much set up. Changing the directory gave us a place where all the data we need to import manually (not from packages) will live and where all our exports will go.

Packages

The foundation of these workshops will be built on packages, specifically one called the “tidyverse”, a collection of packages that aim to make R code more efficient and readable, and thus easier to learn. We also will use data from NFLFastR/NFLReadR. These packages essentially perform functions under the hood to save us time and eliminate technical barriers to entry.

Step 1: Installing Packages

This is a one time situation, you only need to install packages, the first time you use them. After that update.packages() will keep them up to date.

Step 3: Loading Packages

library(nflfastR)
## Warning: package 'nflfastR' was built under R version 4.1.3
library(scales)
## Warning: package 'scales' was built under R version 4.1.3
library(nflreadr)
## Warning: package 'nflreadr' was built under R version 4.1.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
library(ggthemes)
library(ggtext)
## Warning: package 'ggtext' was built under R version 4.1.3
library(ggimage)
## Warning: package 'ggimage' was built under R version 4.1.3
library(gghighlight)
## Warning: package 'gghighlight' was built under R version 4.1.3
library(gt)
## Warning: package 'gt' was built under R version 4.1.3
library(gtExtras)
## Warning: package 'gtExtras' was built under R version 4.1.3

The library([package_name]) function will load the packages into your environment each time you open RStudio.

There is a way of using functions without loading the whole package seen below.

carData::Mandel
##      x1    x2     y
## 1 16.85  1.46 41.38
## 2 24.81 -4.61 31.01
## 3 18.85 -0.21 37.41
## 4 12.63  4.93 50.05
## 5 21.38 -1.36 39.17
## 6 18.78 -0.08 38.86
## 7 15.58  2.98 46.14
## 8 16.30  1.73 44.47

Here we called a dataframe from the carData package without loading the whole package.

This is useful for a few reasons…

  • Using a function/dataframe/etc from a package without loading the entire package

  • Writing more readable code, so people know what package your functions come from

  • Calling a function that may have the same name in two different packages

Loading Data

So we want to data analysis. The first step is to load our data. There are a few ways to do this. Here we will mostly try and do this by using packages. We can start by using CSVs and XLSX files for now though.

To do this, we use the tidyverse functions read_csv().

Before we use this, it might be helpful to have an idea how. When using a function that you are unfamiliar with or need a quick reminder, go to the help tab in the bottom right of your environment and type in the name of the function. If you have loaded the package, there should be documentation to help you out.

# Remember to make sure this data is in your working directory

#readr::read_csv("combine.csv")

And there it is, we now have some data to work with.

Let’s try again with an XLSX file.

Using the readxl package, we didn’t load it though because we will only use it this once.

# Make sure to clarify the sheet when loading data from xlsx.

#readxl::read_xlsx("hockeyplayerdata2018.xlsx", sheet = 2)

So we have options for loading data, these are a couple of them. For the time being we will primarily be using packages to load data.

Let’s get started with last year’s NFL Data and assign it to an object. R is flexible in the sense you don’t need to initialize anything to save it for those of you coming from a Java-ish background.

We can also use this moment as an introduction to vectors. Vectors can be thought of as a list essentially of similar objects, here we are going to create a vector of numbers to use as the seasons we want to collect data on. To create a vector use c([obj1, obj2, …])

# Vector of one season
szn <- c(2021)
#Vector of select seasons
szns <- c(2011, 2021)
#Vector of series of seasons (all seasons between 2011 & 2021)
decade <- c(2011:2021)

# The assignment operator can be '=' or '<-", I use the latter typically.

# Here we are just going to use last season so we don't actually need a vector
pbp21 <- load_pbp(seasons = 2021)

# This could take a second
pbp_decade <- load_pbp(seasons = decade)

So now we have the data, let’s take a peek.

Use the head() function to look at the top of data, by default head() calls 6 rows.

Use tail() for the converse.

You can change the argument in head to view as many rows as you want, or even to “trim” a data frame to x amount of rows.

# Two ways of calling
head(pbp21)
## -- nflverse play by play data --------------------------------------------------
## i Data updated: 2022-09-27 07:35:02 EDT
## # A tibble: 6 x 372
##   play_id game_id  old_g~1 home_~2 away_~3 seaso~4  week posteam poste~5 defteam
##     <dbl> <chr>    <chr>   <chr>   <chr>   <chr>   <int> <chr>   <chr>   <chr>  
## 1       1 2021_01~ 202109~ TEN     ARI     REG         1 <NA>    <NA>    <NA>   
## 2      40 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## 3      55 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## 4      76 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## 5     100 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## 6     122 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## # ... with 362 more variables: side_of_field <chr>, yardline_100 <dbl>,
## #   game_date <chr>, quarter_seconds_remaining <dbl>,
## #   half_seconds_remaining <dbl>, game_seconds_remaining <dbl>,
## #   game_half <chr>, quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>,
## #   down <dbl>, goal_to_go <dbl>, time <chr>, yrdln <chr>, ydstogo <dbl>,
## #   ydsnet <dbl>, desc <chr>, play_type <chr>, yards_gained <dbl>,
## #   shotgun <dbl>, no_huddle <dbl>, qb_dropback <dbl>, qb_kneel <dbl>, ...
pbp21 %>% head()
## -- nflverse play by play data --------------------------------------------------
## i Data updated: 2022-09-27 07:35:02 EDT
## # A tibble: 6 x 372
##   play_id game_id  old_g~1 home_~2 away_~3 seaso~4  week posteam poste~5 defteam
##     <dbl> <chr>    <chr>   <chr>   <chr>   <chr>   <int> <chr>   <chr>   <chr>  
## 1       1 2021_01~ 202109~ TEN     ARI     REG         1 <NA>    <NA>    <NA>   
## 2      40 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## 3      55 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## 4      76 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## 5     100 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## 6     122 2021_01~ 202109~ TEN     ARI     REG         1 TEN     home    ARI    
## # ... with 362 more variables: side_of_field <chr>, yardline_100 <dbl>,
## #   game_date <chr>, quarter_seconds_remaining <dbl>,
## #   half_seconds_remaining <dbl>, game_seconds_remaining <dbl>,
## #   game_half <chr>, quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>,
## #   down <dbl>, goal_to_go <dbl>, time <chr>, yrdln <chr>, ydstogo <dbl>,
## #   ydsnet <dbl>, desc <chr>, play_type <chr>, yards_gained <dbl>,
## #   shotgun <dbl>, no_huddle <dbl>, qb_dropback <dbl>, qb_kneel <dbl>, ...

This introduces us to the pipe operator ( %>% ), it’s a very helpful operator. It’s windows shortcut is ctrl+shift+m. The pipe operator takes the output of one function and pushes it into the object that’s going to be operated on in the next object.

Think about how we used the head function above, we took the output from the pbp21 object and then pushed it into the object slot for the head() function.

Data Manipulation using Tidyverse

The filter function

The filter function is used to identify the rows in a dataframe we do or do not need.

Let’s filter by games Philadelphia (my favorite team)

pbp21 %>% 
    filter(home_team == "PHI" | away_team == "PHI")
## -- nflverse play by play data --------------------------------------------------
## i Data updated: 2022-09-27 07:35:02 EDT
## # A tibble: 3,170 x 372
##    play_id game_id old_g~1 home_~2 away_~3 seaso~4  week posteam poste~5 defteam
##      <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <int> <chr>   <chr>   <chr>  
##  1       1 2021_0~ 202109~ ATL     PHI     REG         1 <NA>    <NA>    <NA>   
##  2      41 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
##  3      56 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
##  4      80 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
##  5     101 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
##  6     122 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
##  7     146 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
##  8     176 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
##  9     210 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
## 10     231 2021_0~ 202109~ ATL     PHI     REG         1 ATL     home    PHI    
## # ... with 3,160 more rows, 362 more variables: side_of_field <chr>,
## #   yardline_100 <dbl>, game_date <chr>, quarter_seconds_remaining <dbl>,
## #   half_seconds_remaining <dbl>, game_seconds_remaining <dbl>,
## #   game_half <chr>, quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>,
## #   down <dbl>, goal_to_go <dbl>, time <chr>, yrdln <chr>, ydstogo <dbl>,
## #   ydsnet <dbl>, desc <chr>, play_type <chr>, yards_gained <dbl>,
## #   shotgun <dbl>, no_huddle <dbl>, qb_dropback <dbl>, qb_kneel <dbl>, ...

So what’s the logic here?

The syntax of filter is filter(row_name [operator] target)

So what are the types of operators, below are a few:

  • “==”: This operator checks for equivalency

  • “>’,”<“,”<=“, or”>=“: When comparing quantitative data we can use greater than or less than

  • “|”: This is the “or” operator. When we use this we are saying for each row one of the conditions on each side of the or operator must be true

  • “&”: This is the “and” operator. When we use this we are saying for each row both of the conditions on each side of the and operator must be true

The select function

pbp21 %>% 
    filter(home_team == "PHI" | away_team == "PHI") %>%
    select(home_team, away_team, desc, game_seconds_remaining, home_wp)
## -- nflverse play by play data --------------------------------------------------
## i Data updated: 2022-09-27 07:35:02 EDT
## # A tibble: 3,170 x 5
##    home_team away_team desc                                      game_~1 home_wp
##    <chr>     <chr>     <chr>                                       <dbl>   <dbl>
##  1 ATL       PHI       GAME                                         3600   0.546
##  2 ATL       PHI       4-J.Elliott kicks 65 yards from PHI 35 t~    3600   0.546
##  3 ATL       PHI       (15:00) 2-M.Ryan pass deep middle to 18-~    3600   0.546
##  4 ATL       PHI       (14:34) (No Huddle) 28-M.Davis right gua~    3574   0.573
##  5 ATL       PHI       (14:05) (No Huddle) 28-M.Davis left tack~    3545   0.584
##  6 ATL       PHI       (13:45) (No Huddle) 2-M.Ryan pass short ~    3525   0.569
##  7 ATL       PHI       (13:08) (No Huddle, Shotgun) 2-M.Ryan pa~    3488   0.542
##  8 ATL       PHI       (12:50) (No Huddle) 2-M.Ryan pass incomp~    3470   0.632
##  9 ATL       PHI       (12:44) 28-M.Davis right guard to PHI 28~    3464   0.618
## 10 ATL       PHI       (12:08) (Shotgun) 2-M.Ryan pass short mi~    3428   0.601
## # ... with 3,160 more rows, and abbreviated variable name
## #   1: game_seconds_remaining

The select function is pretty simple. The select function simply selects the columns you want to keep in your dataset. This is helpful for a lot of reasons, but mostly it’s just helpful to clean up data that has a lot of extraneous data.

The mutate function

pbp21 %>% 
    filter(home_team == "PHI" | away_team == "PHI") %>%
    select(home_team, away_team, desc, game_seconds_remaining, home_wp) %>% 
    mutate(away_wp = 1 - home_wp)
## -- nflverse play by play data --------------------------------------------------
## i Data updated: 2022-09-27 07:35:02 EDT
## # A tibble: 3,170 x 6
##    home_team away_team desc                              game_~1 home_wp away_wp
##    <chr>     <chr>     <chr>                               <dbl>   <dbl>   <dbl>
##  1 ATL       PHI       GAME                                 3600   0.546   0.454
##  2 ATL       PHI       4-J.Elliott kicks 65 yards from ~    3600   0.546   0.454
##  3 ATL       PHI       (15:00) 2-M.Ryan pass deep middl~    3600   0.546   0.454
##  4 ATL       PHI       (14:34) (No Huddle) 28-M.Davis r~    3574   0.573   0.427
##  5 ATL       PHI       (14:05) (No Huddle) 28-M.Davis l~    3545   0.584   0.416
##  6 ATL       PHI       (13:45) (No Huddle) 2-M.Ryan pas~    3525   0.569   0.431
##  7 ATL       PHI       (13:08) (No Huddle, Shotgun) 2-M~    3488   0.542   0.458
##  8 ATL       PHI       (12:50) (No Huddle) 2-M.Ryan pas~    3470   0.632   0.368
##  9 ATL       PHI       (12:44) 28-M.Davis right guard t~    3464   0.618   0.382
## 10 ATL       PHI       (12:08) (Shotgun) 2-M.Ryan pass ~    3428   0.601   0.399
## # ... with 3,160 more rows, and abbreviated variable name
## #   1: game_seconds_remaining

Mutate is one of the most useful functions we have. Mutate is super helpful for creating new columns based on really anything, but especially based on data we already have in other columns. So here we created a column that gives us the away_wp based on the measured home_wp.

So let’s combine the 3 here and see if we can create some unique insight. Since I was at this game, I know the eagles won. Let’s see if we can find the play that was the biggest momentum swing.

pbp21 %>% 
    filter(home_team == "ATL" & away_team == "PHI") %>%
    select(home_team, away_team, desc, game_seconds_remaining, home_wp) %>% 
    mutate(away_wp = 1 - home_wp) %>% 
    # This next line is straight from stack overflow, great resource
    mutate(PHI_wp_change = home_wp-lag(away_wp,default=first(away_wp))) %>% 
    select(desc, PHI_wp_change)
## -- nflverse play by play data --------------------------------------------------
## i Data updated: 2022-09-27 07:35:02 EDT
## # A tibble: 191 x 2
##    desc                                                                  PHI_w~1
##    <chr>                                                                   <dbl>
##  1 GAME                                                                   0.0925
##  2 4-J.Elliott kicks 65 yards from PHI 35 to end zone, Touchback.         0.0925
##  3 (15:00) 2-M.Ryan pass deep middle to 18-C.Ridley to ATL 41 for 16 ya~  0.0925
##  4 (14:34) (No Huddle) 28-M.Davis right guard to PHI 47 for 12 yards (5~  0.119 
##  5 (14:05) (No Huddle) 28-M.Davis left tackle to PHI 43 for 4 yards (57~  0.157 
##  6 (13:45) (No Huddle) 2-M.Ryan pass short right to 84-C.Patterson to P~  0.154 
##  7 (13:08) (No Huddle, Shotgun) 2-M.Ryan pass short right to 18-C.Ridle~  0.112 
##  8 (12:50) (No Huddle) 2-M.Ryan pass incomplete short right to 8-K.Pitt~  0.175 
##  9 (12:44) 28-M.Davis right guard to PHI 28 for 3 yards (94-J.Sweat; 58~  0.250 
## 10 (12:08) (Shotgun) 2-M.Ryan pass short middle to 18-C.Ridley to PHI 1~  0.218 
## # ... with 181 more rows, and abbreviated variable name 1: PHI_wp_change

So now we have all the plays from this game as well as how that changed the win probability for Philadelphia.

The arrange function

pbp21 %>% 
    filter(home_team == "ATL" & away_team == "PHI") %>%
    select(home_team, away_team, down, ydstogo, yardline_100, desc, qtr ,time,
           home_wp) %>% 
    mutate(away_wp = 1 - home_wp) %>% 
    # This next line is straight from stack overflow, great resource
    mutate(PHI_wp_change = home_wp-lag(away_wp,default=first(away_wp))) %>% 
    select(desc, down, ydstogo, qtr, time, yardline_100, PHI_wp_change) %>% 
    drop_na(PHI_wp_change) %>% 
    arrange(-PHI_wp_change) %>% 
    head(5)
## # A tibble: 5 x 7
##   desc                                  down ydstogo   qtr time  yardl~1 PHI_w~2
##   <chr>                                <dbl>   <dbl> <dbl> <chr>   <dbl>   <dbl>
## 1 (10:11) (No Huddle) 2-M.Ryan pass i~     2       3     1 10:11       3   0.374
## 2 (10:07) (Shotgun) 2-M.Ryan pass inc~     3       3     1 10:07       3   0.363
## 3 (10:37) (No Huddle) 28-M.Davis left~     1       6     1 10:37       6   0.341
## 4 (11:35) 28-M.Davis right guard to P~     1       5     1 11:35      11   0.335
## 5 (11:03) 2-M.Ryan pass short left to~     2       7     1 11:03      13   0.331
## # ... with abbreviated variable names 1: yardline_100, 2: PHI_wp_change

Now we have the top 5 most influential plays of the game in favor of the eagles.

The arrange function sorts the dataframe in descending order based on the specified column. The “-” sorts in descending order. Without the “-”, the table will be sorted in ascending order.

We also use the drop_na() function to get rid of any NA values our data might have.

The group_by and summarise function

Let’s try and create a table to see who the best RBs in the NFL are

pbp21 %>% 
    filter(rush_attempt == 1) %>% 
    group_by(rusher_player_name, posteam, rusher_id) %>% 
    drop_na(rusher_id) %>% 
    summarise(Rushes = sum(n()), Total_EPA = sum(epa), Yards = sum(yards_gained)) %>% 
    mutate(EPA_rush = Total_EPA/Rushes, Yds_rush = Yards/Rushes) %>% 
    filter(Rushes > 30) %>% 
    arrange(-Yds_rush)
## `summarise()` has grouped output by 'rusher_player_name', 'posteam'. You can
## override using the `.groups` argument.
## # A tibble: 110 x 8
## # Groups:   rusher_player_name, posteam [110]
##    rusher_player_name posteam rusher_id  Rushes Total_EPA Yards EPA_rush Yds_r~1
##    <chr>              <chr>   <chr>       <int>     <dbl> <dbl>    <dbl>   <dbl>
##  1 R.Penny            SEA     00-0034750    119   11.2      749  9.41e-2    6.29
##  2 D.Hilliard         TEN     00-0034253     56    9.06     350  1.62e-1    6.25
##  3 D.Samuel           SF      00-0035719     86   18.7      502  2.18e-1    5.84
##  4 N.Chubb            CLE     00-0034791    228   10.6     1259  4.66e-2    5.52
##  5 T.Pollard          DAL     00-0035261    134   12.4      733  9.25e-2    5.47
##  6 J.Taylor           IND     00-0036223    332   34.5     1811  1.04e-1    5.45
##  7 D.Johnson          CLE     00-0035628    100    6.44     534  6.44e-2    5.34
##  8 M.Sanders          PHI     00-0035243    145   -1.77     772 -1.22e-2    5.32
##  9 T.Williams         BAL     00-0036457     35   -0.730    185 -2.09e-2    5.29
## 10 J.Jackson          LAC     00-0034440     69    0.0122   364  1.77e-4    5.28
## # ... with 100 more rows, and abbreviated variable name 1: Yds_rush

The group_by() function is used to group data, hence the name. It can be hard to see what exactly this looks like in practice until we get into the summarise() function.

The summarise() function takes the grouping items and then applies a function to them as a unit. So here we grouped by player and added a couple other variables that we knew would be the same for each instance of the overarching grouping variable, the player.

Joins

One of the really nice things the NFlFastR package offers is a dataframe we can attach to other dataframes we have built that helps us create custom graphics based on team colors among other things. We will see what that looks like after using the left_join() function.

pbp21 %>% 
    filter(rush_attempt == 1) %>% 
    group_by(rusher_player_name, posteam, rusher_id) %>% 
    drop_na(rusher_id) %>% 
    summarise(Rushes = sum(n()), Total_EPA = sum(epa), Yards = sum(yards_gained)) %>% 
    mutate(EPA_rush = Total_EPA/Rushes, Yds_rush = Yards/Rushes) %>% 
    filter(Rushes > 50) %>% 
    arrange(-Yds_rush) %>% 
    left_join(teams_colors_logos, by =c("posteam" = "team_abbr"))
## `summarise()` has grouped output by 'rusher_player_name', 'posteam'. You can
## override using the `.groups` argument.
## # A tibble: 83 x 22
## # Groups:   rusher_player_name, posteam [83]
##    rushe~1 posteam rushe~2 Rushes Total~3 Yards EPA_rush Yds_r~4 team_~5 team_id
##    <chr>   <chr>   <chr>    <int>   <dbl> <dbl>    <dbl>   <dbl> <chr>   <chr>  
##  1 R.Penny SEA     00-003~    119 11.2      749  9.41e-2    6.29 Seattl~ 4600   
##  2 D.Hill~ TEN     00-003~     56  9.06     350  1.62e-1    6.25 Tennes~ 2100   
##  3 D.Samu~ SF      00-003~     86 18.7      502  2.18e-1    5.84 San Fr~ 4500   
##  4 N.Chubb CLE     00-003~    228 10.6     1259  4.66e-2    5.52 Clevel~ 1050   
##  5 T.Poll~ DAL     00-003~    134 12.4      733  9.25e-2    5.47 Dallas~ 1200   
##  6 J.Tayl~ IND     00-003~    332 34.5     1811  1.04e-1    5.45 Indian~ 2200   
##  7 D.John~ CLE     00-003~    100  6.44     534  6.44e-2    5.34 Clevel~ 1050   
##  8 M.Sand~ PHI     00-003~    145 -1.77     772 -1.22e-2    5.32 Philad~ 3700   
##  9 J.Jack~ LAC     00-003~     69  0.0122   364  1.77e-4    5.28 Los An~ 4400   
## 10 C.Edmo~ ARI     00-003~    124  8.68     620  7.00e-2    5    Arizon~ 3800   
## # ... with 73 more rows, 12 more variables: team_nick <chr>, team_conf <chr>,
## #   team_division <chr>, team_color <chr>, team_color2 <chr>,
## #   team_color3 <chr>, team_color4 <chr>, team_logo_wikipedia <chr>,
## #   team_logo_espn <chr>, team_wordmark <chr>, team_conference_logo <chr>,
## #   team_league_logo <chr>, and abbreviated variable names
## #   1: rusher_player_name, 2: rusher_id, 3: Total_EPA, 4: Yds_rush,
## #   5: team_name

As we can see– this join added a bunch of columns with team information, links to logos, color codes, and more.

Joining different dataframes with left_join() is pretty simple. The syntax is

old_df %>% left_join(new_df, by = c(“old_df_col” = “new_df_col”))

There are more ways to do joins but we will start with this and explore more on a as needed basis. The idea here is that the column in our two columns have similar data. For each instance of an item in the original data frame that has an equivalent in the new dataframe that data will be matched to the original dataframe. This can happen in a one-to-one fashion or a one-to-many fashion. I hope this makes sense.

Applications

To practice these concepts of data manipulation… we will walk through the functions we mentioned above to see if we can contort the data in order to answer our questions.

The functions we have available to use are

  • filter()

  • group_by()

  • summarise()

  • arrange()

  • left_join() (and other joins)

  • mutate()

  • select()

Other useful functions: head(), slice(), ungroup(), drop_na()

Who was the best WR of the last decade? Show the top 10.

Hint: Use the pbp_decade data

## # A tibble: 1,701 x 4
##    receiver_player_name Yards   TDs Total_EPA
##    <chr>                <dbl> <dbl>     <dbl>
##  1 J.Jones              17432    99      720.
##  2 A.Brown              16393   118      654.
##  3 T.Kelce              10337    69      543.
##  4 R.Gronkowski         10066    98      540.
##  5 D.Adams               9051    82      424.
##  6 T.Hill                8137    71      391.
##  7 D.Baldwin             7297    55      391.
##  8 T.Hilton             10476    59      381.
##  9 J.Nelson              9202    81      379.
## 10 M.Evans               9757    80      375.
## # ... with 1,691 more rows

What if we wanted to check the best single seasons?

## `summarise()` has grouped output by 'receiver_player_name'. You can override
## using the `.groups` argument.
## # A tibble: 5,356 x 5
## # Groups:   receiver_player_name [1,701]
##    receiver_player_name season Yards   TDs Total_EPA
##    <chr>                 <int> <dbl> <dbl>     <dbl>
##  1 C.Kupp                 2021  2427    23     139. 
##  2 J.Jones                2012  2192    17     117. 
##  3 J.Jones                2015  2827    16     115. 
##  4 R.Gronkowski           2011  1585    20     114. 
##  5 W.Welker               2011  1737    10     111. 
##  6 S.Diggs                2020  1846    10     103. 
##  7 M.Thomas               2019  1809     9     101. 
##  8 A.Boldin               2013  1406     8     101. 
##  9 V.Cruz                 2011  1805    11      99.9
## 10 A.Brown                2020  1722    19      98.8
## # ... with 5,346 more rows

Which 10 plays created the most momentum in the playoffs over the last decade?

hint: use the ‘wpa’ variable rather than the method we used above

## # A tibble: 10 x 4
##    desc                                                    posteam defteam   wpa
##    <chr>                                                   <chr>   <chr>   <dbl>
##  1 (:10) (Shotgun) 7-C.Keenum pass deep right to 14-S.Dig~ MIN     NO      0.699
##  2 (:26) (Field Goal formation) 3-B.Walsh 27 yard field g~ MIN     SEA     0.588
##  3 (3:59) 4-S.Koch punts 52 yards to DEN 14, Center-46-M.~ BAL     DEN     0.575
##  4 (:26) (Shotgun) 3-R.Wilson pass short right intended f~ SEA     NE      0.573
##  5 (9:42) 5-S.Weatherford punts 37 yards to SF 19, Center~ NYG     SF      0.540
##  6 (:17) (Shotgun) 17-J.Allen pass deep middle to 13-G.Da~ BUF     KC      0.508
##  7 (15:00) (Shotgun) 15-T.Tebow pass deep middle to 88-D.~ DEN     PIT     0.473
##  8 (:14) 11-Alex Smith pass short middle to 85-V.Davis fo~ SF      NO      0.440
##  9 (:10) (Field Goal formation) 1-C.Parkey 43 yard field ~ CHI     PHI     0.414
## 10 (15:00) 3-C.Palmer pass short left to 11-L.Fitzgerald ~ ARI     GB      0.389

A little about the slice function: The slice function is useful for taking one instance that we specify. So here it’s taking one instance where the most wpa was gained or lost out of each group. I’m personally not super familiar with it, so that may be a representation of its true use, but that’s the way I used it here.

Other NFL readR uses

Some more data we can play with…

Full nflReadR functions: https://nflreadr.nflverse.com/reference/index.html

draft <- load_draft_picks()

snaps <- load_snap_counts(seasons = 2021)
head(draft)
## -- nflverse Draft Picks, via Pro Football Reference ----------------------------
## i Data updated: 2022-11-02 01:42:17 EDT
## # A tibble: 6 x 36
##   season round  pick team  gsis_id pfr_p~1 cfb_p~2 pfr_p~3 hof   posit~4 categ~5
##    <int> <int> <int> <chr> <chr>   <chr>   <chr>   <chr>   <lgl> <chr>   <chr>  
## 1   1980     1     1 DET   <NA>    SimsBi~ billy-~ Billy ~ FALSE RB      RB     
## 2   1980     1     2 NYJ   <NA>    JoneLa~ lam-jo~ Lam Jo~ FALSE WR      WR     
## 3   1980     1     3 CIN   <NA>    MunoAn~ <NA>    Anthon~ TRUE  T       OL     
## 4   1980     1     4 GNB   <NA>    ClarBr~ bruce-~ Bruce ~ FALSE DE      DL     
## 5   1980     1     5 BAL   <NA>    DickCu~ curtis~ Curtis~ FALSE RB      RB     
## 6   1980     1     6 STL   <NA>    GreeCu~ curtis~ Curtis~ FALSE DE      DL     
## # ... with 25 more variables: side <chr>, college <chr>, age <dbl>, to <int>,
## #   allpro <dbl>, probowls <dbl>, seasons_started <dbl>, w_av <dbl>,
## #   car_av <dbl>, dr_av <dbl>, games <dbl>, pass_completions <dbl>,
## #   pass_attempts <dbl>, pass_yards <dbl>, pass_tds <dbl>, pass_ints <dbl>,
## #   rush_atts <dbl>, rush_yards <dbl>, rush_tds <dbl>, receptions <dbl>,
## #   rec_yards <dbl>, rec_tds <dbl>, def_solo_tackles <dbl>, def_ints <dbl>,
## #   def_sacks <dbl>, and abbreviated variable names 1: pfr_player_id, ...
draft %>% 
    filter(season == 2021) -> draft21
head(snaps)
## -- nflverse snap counts --------------------------------------------------------
## i Data updated: 2022-07-08 10:06:34 EDT
## # A tibble: 6 x 16
##   game_id      pfr_g~1 season game_~2  week player pfr_p~3 posit~4 team  oppon~5
##   <chr>        <chr>    <int> <chr>   <int> <chr>  <chr>   <chr>   <chr> <chr>  
## 1 2021_01_DAL~ 202109~   2021 REG         1 Trist~ WirfTr~ T       TB    DAL    
## 2 2021_01_DAL~ 202109~   2021 REG         1 Ali M~ MarpAl~ G       TB    DAL    
## 3 2021_01_DAL~ 202109~   2021 REG         1 Alex ~ CappAl~ G       TB    DAL    
## 4 2021_01_DAL~ 202109~   2021 REG         1 Donov~ SmitDo~ T       TB    DAL    
## 5 2021_01_DAL~ 202109~   2021 REG         1 Tom B~ BradTo~ QB      TB    DAL    
## 6 2021_01_DAL~ 202109~   2021 REG         1 Ryan ~ JensRy~ C       TB    DAL    
## # ... with 6 more variables: offense_snaps <dbl>, offense_pct <dbl>,
## #   defense_snaps <dbl>, defense_pct <dbl>, st_snaps <dbl>, st_pct <dbl>, and
## #   abbreviated variable names 1: pfr_game_id, 2: game_type, 3: pfr_player_id,
## #   4: position, 5: opponent

Which team got the most production from their draft class (disregard quality of

production)

We know have the 2020 draft class loaded. Let’s load all snaps from last year.

Join them and summarise…

draft21 %>% 
    left_join(snaps) %>% # No by() bc they have common variables
    group_by(player, team) %>% 
    mutate(Snaps = offense_snaps + defense_snaps) %>% 
    drop_na(Snaps) %>% 
    summarise(TotalSnaps = sum(Snaps)) %>% 
    ungroup() %>% 
    group_by(team) %>% 
    summarise(Snaps = sum(TotalSnaps)) %>% 
    arrange(-Snaps) -> production_draft_21
## Joining, by = c("season", "team", "pfr_player_id", "position")
## `summarise()` has grouped output by 'player'. You can override using the
## `.groups` argument.
production_draft_21
## # A tibble: 24 x 2
##    team  Snaps
##    <chr> <dbl>
##  1 NYJ    2628
##  2 PIT    2039
##  3 WAS    1763
##  4 HOU    1691
##  5 CIN    1668
##  6 CAR    1603
##  7 PHI    1573
##  8 JAX    1557
##  9 ARI    1480
## 10 ATL    1265
## # ... with 14 more rows

Creating Attractive Tables

Using our last dataset

production_draft_21 %>% 
    left_join(teams_colors_logos, by = c("team" = "team_abbr")) %>% 
    select(Team = team_wordmark, Snaps) %>% 
    head(10) %>% 
    gt() %>% 
    tab_header( 
        title = "Most Productive 2021 Draft Classes", 
        subtitle = "Measured in snaps played") %>% 
    cols_label(Snaps = "Rookie Snaps Played") %>% 
    opt_table_outline() %>% # Creating outline
    data_color( # Update cell colors...
        columns = c(Snaps), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(min(production_draft_21$Snaps), max(Snaps)) 
        )
    ) %>% 
    cols_align(
        align = "center",
        columns = Team # Selecting all columns
    ) %>% 
    gt_img_rows(columns = Team, height = 25)
Most Productive 2021 Draft Classes
Measured in snaps played
Team Rookie Snaps Played
2628
2039
1763
1691
1668
1603
1573
1557
1480
1265

Which running backs (or runners) matter the most so far this year?

Here we will take a lot of what we learned, step-by-step

Loading Data

## Redundant but just for exercise sake...
pbp22 <- load_pbp(seasons = 2022)

Manipulating Data

We can go step by step here to really see what each function is doing

pbp22 %>% 
    filter(rush_attempt == 1) %>%  #Including QB scrambles this time
    group_by(rusher_player_name, posteam) %>% 
    summarise(epa_per_run = mean(epa), yards_per_run = mean(yards_gained), 
              carries = n(), yards = sum(yards_gained), TDs = sum(touchdown)) %>% 
    filter(carries > 12) %>% 
    ungroup() %>% 
    arrange(-epa_per_run) -> runners
## `summarise()` has grouped output by 'rusher_player_name'. You can override
## using the `.groups` argument.
runners
## # A tibble: 110 x 7
##    rusher_player_name posteam epa_per_run yards_per_run carries yards   TDs
##    <chr>              <chr>         <dbl>         <dbl>   <int> <dbl> <dbl>
##  1 T.Hill             NO            0.578          8.64      39   337     5
##  2 J.Brissett         CLE           0.425          4.58      31   142     2
##  3 J.Allen            BUF           0.394          5.77      53   306     2
##  4 L.Jackson          BAL           0.378          7.37      75   553     2
##  5 K.Pickett          PIT           0.335          5.16      19    98     2
##  6 J.Fields           CHI           0.326          5.58      76   424     3
##  7 P.Mahomes          KC            0.310          5.38      21   113     0
##  8 C.Hubbard          CAR           0.308          6.47      15    97     1
##  9 T.Lance            SF            0.304          4.19      16    67     0
## 10 J.Burrow           CIN           0.272          4.26      31   132     3
## # ... with 100 more rows

Now let’s get rid of the non-qbs…

rosters <- load_rosters()

rosters %>% 
    mutate(rusher_player_name = paste(substr(first_name,1,1), last_name, sep = ".")) %>% 
    select(position, team, rusher_player_name, headshot_url) -> roster1

runners %>% 
    left_join(roster1, by = c("rusher_player_name" = "rusher_player_name",
                              "posteam" = "team")) %>% #Both must be the same 
    mutate(position = if_else(rusher_player_name == "T.Hill", "QB", position)) %>% 
    filter(position == "QB") -> gg_runners

gg_runners
## # A tibble: 27 x 9
##    rusher_player_n~1 posteam epa_p~2 yards~3 carries yards   TDs posit~4 heads~5
##    <chr>             <chr>     <dbl>   <dbl>   <int> <dbl> <dbl> <chr>   <chr>  
##  1 T.Hill            NO        0.578    8.64      39   337     5 QB      https:~
##  2 J.Brissett        CLE       0.425    4.58      31   142     2 QB      https:~
##  3 J.Allen           BUF       0.394    5.77      53   306     2 QB      https:~
##  4 L.Jackson         BAL       0.378    7.37      75   553     2 QB      https:~
##  5 K.Pickett         PIT       0.335    5.16      19    98     2 QB      https:~
##  6 J.Fields          CHI       0.326    5.58      76   424     3 QB      https:~
##  7 P.Mahomes         KC        0.310    5.38      21   113     0 QB      https:~
##  8 T.Lance           SF        0.304    4.19      16    67     0 QB      https:~
##  9 J.Burrow          CIN       0.272    4.26      31   132     3 QB      https:~
## 10 D.Jones           NYG       0.267    5.67      64   363     3 QB      https:~
## # ... with 17 more rows, and abbreviated variable names 1: rusher_player_name,
## #   2: epa_per_run, 3: yards_per_run, 4: position, 5: headshot_url

Now we have a reasonable ranking of the most efficient runners

Making this a presentable table

For this, we use the gt package.

gt has a ton of capabilites, very few of them I know off hand but there are loads of resources we can pull from the internet.

Here’s the one I use most often: https://themockup.blog/posts/2020-09-04-10-table-rules-in-r/

To start

gg_runners %>% 
    gt() 
rusher_player_name posteam epa_per_run yards_per_run carries yards TDs position headshot_url
T.Hill NO 0.577800511 8.641026 39 337 5 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pjp1aahvacigoh0sorpg
J.Brissett CLE 0.425142481 4.580645 31 142 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/z12an2n54jj6ycouwg9h
J.Allen BUF 0.393975384 5.773585 53 306 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/btfnqtymqsqgybnv4u6n
L.Jackson BAL 0.377596557 7.373333 75 553 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gylnuxkxgm3zd4j0d0ku
K.Pickett PIT 0.334632485 5.157895 19 98 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vbabaquryx2jw1uahay2
J.Fields CHI 0.325976183 5.578947 76 424 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wllk6sizrqmxhcjavard
P.Mahomes KC 0.310359993 5.380952 21 113 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vs40h82nvqaqvyephwwu
T.Lance SF 0.304378472 4.187500 16 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/q4iahwof4dq9xf3vqdcb
J.Burrow CIN 0.272125841 4.258065 31 132 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pbl27kxsr5ulgxmvtvfn
D.Jones NYG 0.266698466 5.671875 64 363 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/af6nbjqa6qubnu8oi4ms
K.Murray ARI 0.217541677 5.788462 52 301 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nkb9x92lbea6ayknzoba
K.Cousins MIN 0.157339211 3.076923 13 40 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/h9ndf9ralxifgjvot2q4
J.Hurts PHI 0.151733908 3.835443 79 303 6 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/iv04omcunr78zezpnf8t
T.Lawrence JAX 0.137825031 3.666667 27 99 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nel2x2crpaasltsb6vwj
C.Wentz WAS 0.126443902 4.157895 19 79 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/indtay8ngoq8ubhhydbp
M.Jones NE 0.098671547 3.333333 24 80 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/hhexcwf5oe7onhlemzq9
M.Mariota ATL 0.073667942 5.090909 55 280 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/sd59iwjj69atxaqjti1r
R.Wilson DEN 0.039562910 4.520000 25 113 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gebzv2aljujgleboe4ns
D.Carr LV 0.013295016 4.466667 15 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/l1eqjrajbnisr91dmlcl
J.Herbert LAC 0.005741635 2.285714 21 48 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/qknkhe4w6liovjubzmnh
J.Garoppolo SF -0.086321370 1.769231 13 23 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/fuepwtifp7wd9grpzvwb
Z.Wilson NYJ -0.095414433 2.750000 16 44 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gr2wm3rg1bcgtpj58i7l
J.Goff DET -0.275612063 2.461538 13 32 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/s4hzfab5iglq2tmjb0xy
M.Trubisky PIT -0.333272156 2.230769 13 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wx17o0eu4hffytrbdjgl
T.Tagovailoa MIA -0.360181131 2.692308 13 35 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/dyrrl1knhcxqjwuvhcwm
B.Mayfield CAR -0.381053088 2.857143 14 40 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/aztmh1hzf253ntkzjgkn
R.Tannehill TEN -0.465958769 1.526316 19 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/ns30g7th6gkhjaeoivgj

SO that’s maybe prettier, but not pretty yet.

Adding a header and formatting numbers

gg_runners %>% 
    gt() %>% 
     tab_header( 
        title = "Most Efficient Runners", # ...with this title
        subtitle = "Max Whalen") %>% 
    fmt_number( # A column (numeric data)
        columns = c(epa_per_run, yards_per_run), # What column variable? BOD$Time
        decimals = 2 # With two decimal places
    )
Most Efficient Runners
Max Whalen
rusher_player_name posteam epa_per_run yards_per_run carries yards TDs position headshot_url
T.Hill NO 0.58 8.64 39 337 5 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pjp1aahvacigoh0sorpg
J.Brissett CLE 0.43 4.58 31 142 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/z12an2n54jj6ycouwg9h
J.Allen BUF 0.39 5.77 53 306 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/btfnqtymqsqgybnv4u6n
L.Jackson BAL 0.38 7.37 75 553 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gylnuxkxgm3zd4j0d0ku
K.Pickett PIT 0.33 5.16 19 98 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vbabaquryx2jw1uahay2
J.Fields CHI 0.33 5.58 76 424 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wllk6sizrqmxhcjavard
P.Mahomes KC 0.31 5.38 21 113 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vs40h82nvqaqvyephwwu
T.Lance SF 0.30 4.19 16 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/q4iahwof4dq9xf3vqdcb
J.Burrow CIN 0.27 4.26 31 132 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pbl27kxsr5ulgxmvtvfn
D.Jones NYG 0.27 5.67 64 363 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/af6nbjqa6qubnu8oi4ms
K.Murray ARI 0.22 5.79 52 301 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nkb9x92lbea6ayknzoba
K.Cousins MIN 0.16 3.08 13 40 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/h9ndf9ralxifgjvot2q4
J.Hurts PHI 0.15 3.84 79 303 6 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/iv04omcunr78zezpnf8t
T.Lawrence JAX 0.14 3.67 27 99 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nel2x2crpaasltsb6vwj
C.Wentz WAS 0.13 4.16 19 79 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/indtay8ngoq8ubhhydbp
M.Jones NE 0.10 3.33 24 80 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/hhexcwf5oe7onhlemzq9
M.Mariota ATL 0.07 5.09 55 280 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/sd59iwjj69atxaqjti1r
R.Wilson DEN 0.04 4.52 25 113 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gebzv2aljujgleboe4ns
D.Carr LV 0.01 4.47 15 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/l1eqjrajbnisr91dmlcl
J.Herbert LAC 0.01 2.29 21 48 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/qknkhe4w6liovjubzmnh
J.Garoppolo SF −0.09 1.77 13 23 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/fuepwtifp7wd9grpzvwb
Z.Wilson NYJ −0.10 2.75 16 44 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gr2wm3rg1bcgtpj58i7l
J.Goff DET −0.28 2.46 13 32 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/s4hzfab5iglq2tmjb0xy
M.Trubisky PIT −0.33 2.23 13 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wx17o0eu4hffytrbdjgl
T.Tagovailoa MIA −0.36 2.69 13 35 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/dyrrl1knhcxqjwuvhcwm
B.Mayfield CAR −0.38 2.86 14 40 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/aztmh1hzf253ntkzjgkn
R.Tannehill TEN −0.47 1.53 19 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/ns30g7th6gkhjaeoivgj

Better, but the columns are ugly

Changing Column Names and Formatting

gg_runners %>% 
    gt() %>% 
     tab_header( 
        title = "Most Efficient Runners", # ...with this title
        subtitle = "Max Whalen") %>% 
    fmt_number( # A column (numeric data)
        columns = c(epa_per_run, yards_per_run), # What column variable? BOD$Time
        decimals = 2 # With two decimal places
    ) %>% 
    cols_label(rusher_player_name = "Rusher", epa_per_run = "EPA/Carry", posteam = "Team",
           yards_per_run = "Yards/Carry", carries = "Carries", yards = "Yards") %>% 
    cols_align(
        align = "center",
        columns = colnames(runners) # Selecting all columns
    )
Most Efficient Runners
Max Whalen
Rusher Team EPA/Carry Yards/Carry Carries Yards TDs position headshot_url
T.Hill NO 0.58 8.64 39 337 5 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pjp1aahvacigoh0sorpg
J.Brissett CLE 0.43 4.58 31 142 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/z12an2n54jj6ycouwg9h
J.Allen BUF 0.39 5.77 53 306 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/btfnqtymqsqgybnv4u6n
L.Jackson BAL 0.38 7.37 75 553 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gylnuxkxgm3zd4j0d0ku
K.Pickett PIT 0.33 5.16 19 98 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vbabaquryx2jw1uahay2
J.Fields CHI 0.33 5.58 76 424 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wllk6sizrqmxhcjavard
P.Mahomes KC 0.31 5.38 21 113 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vs40h82nvqaqvyephwwu
T.Lance SF 0.30 4.19 16 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/q4iahwof4dq9xf3vqdcb
J.Burrow CIN 0.27 4.26 31 132 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pbl27kxsr5ulgxmvtvfn
D.Jones NYG 0.27 5.67 64 363 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/af6nbjqa6qubnu8oi4ms
K.Murray ARI 0.22 5.79 52 301 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nkb9x92lbea6ayknzoba
K.Cousins MIN 0.16 3.08 13 40 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/h9ndf9ralxifgjvot2q4
J.Hurts PHI 0.15 3.84 79 303 6 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/iv04omcunr78zezpnf8t
T.Lawrence JAX 0.14 3.67 27 99 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nel2x2crpaasltsb6vwj
C.Wentz WAS 0.13 4.16 19 79 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/indtay8ngoq8ubhhydbp
M.Jones NE 0.10 3.33 24 80 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/hhexcwf5oe7onhlemzq9
M.Mariota ATL 0.07 5.09 55 280 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/sd59iwjj69atxaqjti1r
R.Wilson DEN 0.04 4.52 25 113 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gebzv2aljujgleboe4ns
D.Carr LV 0.01 4.47 15 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/l1eqjrajbnisr91dmlcl
J.Herbert LAC 0.01 2.29 21 48 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/qknkhe4w6liovjubzmnh
J.Garoppolo SF −0.09 1.77 13 23 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/fuepwtifp7wd9grpzvwb
Z.Wilson NYJ −0.10 2.75 16 44 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gr2wm3rg1bcgtpj58i7l
J.Goff DET −0.28 2.46 13 32 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/s4hzfab5iglq2tmjb0xy
M.Trubisky PIT −0.33 2.23 13 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wx17o0eu4hffytrbdjgl
T.Tagovailoa MIA −0.36 2.69 13 35 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/dyrrl1knhcxqjwuvhcwm
B.Mayfield CAR −0.38 2.86 14 40 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/aztmh1hzf253ntkzjgkn
R.Tannehill TEN −0.47 1.53 19 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/ns30g7th6gkhjaeoivgj

This looks presentable; now let’s change the formatting within the columns

gg_runners %>% 
    gt() %>% 
     tab_header( 
        title = "Most Efficient Runners", # ...with this title
        subtitle = "Max Whalen") %>% 
    fmt_number( # A column (numeric data)
        columns = c(epa_per_run, yards_per_run), # What column variable? BOD$Time
        decimals = 2 # With two decimal places
    ) %>% 
    cols_label(rusher_player_name = "Rusher", epa_per_run = "EPA/Carry", posteam = "Team",
           yards_per_run = "Yards/Carry", carries = "Carries", yards = "Yards") %>% 
    cols_align(
        align = "center",
        columns = colnames(runners) # Selecting all columns
    ) %>% 
    opt_table_outline() %>% # Creating outline
    data_color( # Update cell colors...
        columns = c(TDs), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(0,max(TDs)) # Column scale endpoints
        )
    ) %>%
    data_color( # Update cell colors...
        columns = c(yards_per_run), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(0,max(yards_per_run)) # Column scale endpoints
        )
    ) %>% 
    data_color( # Update cell colors...
        columns = c(epa_per_run), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(min(epa_per_run),max(epa_per_run)) # Column scale endpoints
        )
    ) %>% 
    data_color( # Update cell colors...
        columns = c(carries), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(0,max(carries)) # Column scale endpoints
        )
    ) %>% 
    data_color( # Update cell colors...
        columns = c(yards), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(0,max(yards)) # Column scale endpoints
        )
    )
Most Efficient Runners
Max Whalen
Rusher Team EPA/Carry Yards/Carry Carries Yards TDs position headshot_url
T.Hill NO 0.58 8.64 39 337 5 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pjp1aahvacigoh0sorpg
J.Brissett CLE 0.43 4.58 31 142 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/z12an2n54jj6ycouwg9h
J.Allen BUF 0.39 5.77 53 306 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/btfnqtymqsqgybnv4u6n
L.Jackson BAL 0.38 7.37 75 553 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gylnuxkxgm3zd4j0d0ku
K.Pickett PIT 0.33 5.16 19 98 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vbabaquryx2jw1uahay2
J.Fields CHI 0.33 5.58 76 424 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wllk6sizrqmxhcjavard
P.Mahomes KC 0.31 5.38 21 113 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vs40h82nvqaqvyephwwu
T.Lance SF 0.30 4.19 16 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/q4iahwof4dq9xf3vqdcb
J.Burrow CIN 0.27 4.26 31 132 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pbl27kxsr5ulgxmvtvfn
D.Jones NYG 0.27 5.67 64 363 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/af6nbjqa6qubnu8oi4ms
K.Murray ARI 0.22 5.79 52 301 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nkb9x92lbea6ayknzoba
K.Cousins MIN 0.16 3.08 13 40 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/h9ndf9ralxifgjvot2q4
J.Hurts PHI 0.15 3.84 79 303 6 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/iv04omcunr78zezpnf8t
T.Lawrence JAX 0.14 3.67 27 99 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nel2x2crpaasltsb6vwj
C.Wentz WAS 0.13 4.16 19 79 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/indtay8ngoq8ubhhydbp
M.Jones NE 0.10 3.33 24 80 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/hhexcwf5oe7onhlemzq9
M.Mariota ATL 0.07 5.09 55 280 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/sd59iwjj69atxaqjti1r
R.Wilson DEN 0.04 4.52 25 113 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gebzv2aljujgleboe4ns
D.Carr LV 0.01 4.47 15 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/l1eqjrajbnisr91dmlcl
J.Herbert LAC 0.01 2.29 21 48 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/qknkhe4w6liovjubzmnh
J.Garoppolo SF −0.09 1.77 13 23 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/fuepwtifp7wd9grpzvwb
Z.Wilson NYJ −0.10 2.75 16 44 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gr2wm3rg1bcgtpj58i7l
J.Goff DET −0.28 2.46 13 32 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/s4hzfab5iglq2tmjb0xy
M.Trubisky PIT −0.33 2.23 13 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wx17o0eu4hffytrbdjgl
T.Tagovailoa MIA −0.36 2.69 13 35 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/dyrrl1knhcxqjwuvhcwm
B.Mayfield CAR −0.38 2.86 14 40 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/aztmh1hzf253ntkzjgkn
R.Tannehill TEN −0.47 1.53 19 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/ns30g7th6gkhjaeoivgj

Adding Pictures to our table

For this we need to use a join…

First let’s look at the table we need to get our images

head(teams_colors_logos)
## # A tibble: 6 x 15
##   team_abbr team_name    team_id team_~1 team_~2 team_~3 team_~4 team_~5 team_~6
##   <chr>     <chr>        <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
## 1 ARI       Arizona Car~ 3800    Cardin~ NFC     NFC We~ #97233F #000000 #ffb612
## 2 ATL       Atlanta Fal~ 0200    Falcons NFC     NFC So~ #A71930 #000000 #a5acaf
## 3 BAL       Baltimore R~ 0325    Ravens  AFC     AFC No~ #241773 #9E7C0C #9e7c0c
## 4 BUF       Buffalo Bil~ 0610    Bills   AFC     AFC Ea~ #00338D #C60C30 #0c2e82
## 5 CAR       Carolina Pa~ 0750    Panthe~ NFC     NFC So~ #0085CA #000000 #bfc0bf
## 6 CHI       Chicago Bea~ 0810    Bears   NFC     NFC No~ #0B162A #C83803 #0b162a
## # ... with 6 more variables: team_color4 <chr>, team_logo_wikipedia <chr>,
## #   team_logo_espn <chr>, team_wordmark <chr>, team_conference_logo <chr>,
## #   team_league_logo <chr>, and abbreviated variable names 1: team_nick,
## #   2: team_conf, 3: team_division, 4: team_color, 5: team_color2,
## #   6: team_color3

Here, we only need the team logo for our purposes

teams_colors_logos %>% 
    select(posteam = team_abbr, logo = team_logo_espn) -> logos # Rename w/select
gg_runners %>% 
    left_join(logos) %>% 
    mutate(posteam = logo) %>% # Switching values in these columns
    select(-logo) %>% # '-' drops entire column
    gt() %>% 
     tab_header( 
        title = "Most Efficient Runners", # ...with this title
        subtitle = "Max Whalen") %>% 
    fmt_number( # A column (numeric data)
        columns = c(epa_per_run, yards_per_run), # What column variable? BOD$Time
        decimals = 2 # With two decimal places
    ) %>% 
    cols_label(rusher_player_name = "Rusher", epa_per_run = "EPA/Carry", posteam = "Team",
           yards_per_run = "Yards/Carry", carries = "Carries", yards = "Yards") %>% 
    cols_align(
        align = "center",
        columns = colnames(runners) # Selecting all columns
    ) %>% 
    opt_table_outline() %>% # Creating outline
    gt_img_rows(columns = posteam, height = 25) %>%
    data_color( # Update cell colors...
        columns = c(TDs), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(0,max(TDs)) # Column scale endpoints
        )
    ) %>%
    data_color( # Update cell colors...
        columns = c(yards_per_run), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(0,max(yards_per_run)) # Column scale endpoints
        )
    ) %>% 
    data_color( # Update cell colors...
        columns = c(epa_per_run), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(min(epa_per_run),max(epa_per_run)) # Column scale endpoints
        )
    ) %>% 
    data_color( # Update cell colors...
        columns = c(carries), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(0,max(carries)) # Column scale endpoints
        )
    ) %>% 
    data_color( # Update cell colors...
        columns = c(yards), # ...for mean_len column
        colors = scales::col_numeric(
            palette = c(
                "#FF6962", "white","#5FA777" ), # Overboard colors! 
            domain = c(0,max(yards)) # Column scale endpoints
        )
    ) 
## Joining, by = "posteam"
Most Efficient Runners
Max Whalen
Rusher Team EPA/Carry Yards/Carry Carries Yards TDs position headshot_url
T.Hill 0.58 8.64 39 337 5 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pjp1aahvacigoh0sorpg
J.Brissett 0.43 4.58 31 142 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/z12an2n54jj6ycouwg9h
J.Allen 0.39 5.77 53 306 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/btfnqtymqsqgybnv4u6n
L.Jackson 0.38 7.37 75 553 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gylnuxkxgm3zd4j0d0ku
K.Pickett 0.33 5.16 19 98 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vbabaquryx2jw1uahay2
J.Fields 0.33 5.58 76 424 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wllk6sizrqmxhcjavard
P.Mahomes 0.31 5.38 21 113 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/vs40h82nvqaqvyephwwu
T.Lance 0.30 4.19 16 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/q4iahwof4dq9xf3vqdcb
J.Burrow 0.27 4.26 31 132 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/pbl27kxsr5ulgxmvtvfn
D.Jones 0.27 5.67 64 363 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/af6nbjqa6qubnu8oi4ms
K.Murray 0.22 5.79 52 301 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nkb9x92lbea6ayknzoba
K.Cousins 0.16 3.08 13 40 2 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/h9ndf9ralxifgjvot2q4
J.Hurts 0.15 3.84 79 303 6 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/iv04omcunr78zezpnf8t
T.Lawrence 0.14 3.67 27 99 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/nel2x2crpaasltsb6vwj
C.Wentz 0.13 4.16 19 79 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/indtay8ngoq8ubhhydbp
M.Jones 0.10 3.33 24 80 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/hhexcwf5oe7onhlemzq9
M.Mariota 0.07 5.09 55 280 3 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/sd59iwjj69atxaqjti1r
R.Wilson 0.04 4.52 25 113 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gebzv2aljujgleboe4ns
D.Carr 0.01 4.47 15 67 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/l1eqjrajbnisr91dmlcl
J.Herbert 0.01 2.29 21 48 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/qknkhe4w6liovjubzmnh
J.Garoppolo −0.09 1.77 13 23 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/fuepwtifp7wd9grpzvwb
Z.Wilson −0.10 2.75 16 44 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/gr2wm3rg1bcgtpj58i7l
J.Goff −0.28 2.46 13 32 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/s4hzfab5iglq2tmjb0xy
M.Trubisky −0.33 2.23 13 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/wx17o0eu4hffytrbdjgl
T.Tagovailoa −0.36 2.69 13 35 0 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/dyrrl1knhcxqjwuvhcwm
B.Mayfield −0.38 2.86 14 40 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/aztmh1hzf253ntkzjgkn
R.Tannehill −0.47 1.53 19 29 1 QB https://static.www.nfl.com/image/private/f_auto,q_auto/league/ns30g7th6gkhjaeoivgj

ggplot

Starting with a scatter plot

Let’s continue with the RB Data

gg_runners %>% 
    left_join(logos) -> gg_data
## Joining, by = "posteam"

Creating the plot

First, pipe the dataframe into the function. Most of the work happens in the aes() function.

gg_data %>% 
    ggplot(aes(x = yards_per_run, y = epa_per_run)) +
    geom_point()

There we have a simple plot. One that doesn’t tell much of a story though

Adding Labels + more

Note we use a ‘+’ instead of a ‘%>%’ within the ggplot world

gg_data %>% 
    ggplot(aes(x = yards_per_run, y = epa_per_run, size = carries, 
               label = rusher_player_name)) +
    geom_point() +
    geom_text(nudge_y = 0.04)

Better but still ugly and pretty useless. There’s a great package to resolve our labeling issue though

library(ggrepel)

gg_data %>% 
    ggplot(aes(x = yards_per_run, y = epa_per_run, size = carries, 
               label = rusher_player_name)) +
    geom_point() +
    geom_text_repel() # replace w/repel function

So it’s still noisy, but this is a lot better. Now let’s clean up the labels

gg_data %>% 
    ggplot(aes(x = yards_per_run, y = epa_per_run, size = carries, 
               label = rusher_player_name)) +
    geom_point() +
    geom_text_repel() +
     labs( x = "Yards/Carry",
           y = "EPA/Carry",
           title = "Who are the NFL's best runners?",
           subtitle = "EPA is used measure the value of a run in the context
            of the game")

ggthemes

We can make our visualizations a lot cleaner, without much work, using the ggthemes package.

library(ggthemes)

It’s worth exploring the different options, but I always use the light theme or fivethirtyeight. Here’s a look…

Light theme

gg_data %>% 
    ggplot(aes(x = yards_per_run, y = epa_per_run, size = carries, 
               label = rusher_player_name)) +
    geom_point() +
    geom_text_repel() +
     labs( x = "Yards/Carry",
           y = "EPA/Carry",
           title = "Who are the NFL's best runners?",
           subtitle = "EPA is used measure the value of a run in the context
            of the game") +
    theme_light()

538 Theme

gg_data %>% 
    ggplot(aes(x = yards_per_run, y = epa_per_run, size = carries, 
               label = rusher_player_name)) +
    geom_point() +
    geom_text_repel() +
     labs( x = "Yards/Carry",
           y = "EPA/Carry",
           title = "Who are the NFL's best runners?",
           subtitle = "EPA is used measure the value of a run in the context
            of the game") +
    theme_fivethirtyeight()

gg_image

Here’s a cool package that helps us replace these black dots with something a little more visually appealling.

library(ggimage)
gg_data %>% 
    ggplot(aes(x = yards_per_run, y = epa_per_run, label = rusher_player_name)) +
    geom_image(aes(image=headshot_url, size = I(0.1))) + # UPDATE
    geom_text_repel() +
     labs( x = "Yards/Carry",
           y = "EPA/Carry",
           title = "Who are the NFL's best running QBs?",
           subtitle = "EPA is used measure the value of a run in the context
            of the game") +
    theme_fivethirtyeight()

gghighlight

This is a super helpful package for highlighting certain features of the data you want to highlight

# gg_data %>% 
#     ggplot(aes(x = yards_per_run, y = epa_per_run, label = rusher_player_name)) +
#     geom_image(aes(image=logo, size = I(0.1))) +
#     gghighlight(carries >= 25) +  # UPDATE
#     geom_text_repel() +
#      labs( x = "Yards/Carry",
#            y = "EPA/Carry",
#            title = "Who are the NFL's best runners?",
#            subtitle = "EPA is used measure the value of a run in the context
#             of the game") +
#     theme_fivethirtyeight()