Module 5: more dplyr

We’ll be picking up where we left off from the previous markdown to create the graph seen in Brightspace:

Game Data Set

We’ll be creating the same game data we’ve been looking at for the last few examples:

nfl_teams <- 
  pbp |> 
  # Keeping just regular season games:
  filter(season_type == 'REG') |> 
  # Keeping the specified columns
  dplyr::select(game_id, season, game_date, home_team, 
                away_team, home_score, away_score) |> 
  # Keeping each distinct row, dropping any duplicates
  distinct() |> 
  # Moving the home_team and away_team into one column
  pivot_longer(
    cols = home_team:away_team,
    names_to = 'location',
    values_to = 'team'
  ) |> 
  # Dropping the '_team' from location and finding the score_diff for each team
  mutate(
    location = str_remove(location, '_team'),
    point_diff = if_else(location == 'home', home_score - away_score, away_score - home_score)
  ) |> 
  # Calculating the total point differential per team by year
  summarize(
    .by = c(season, team),
    total_point_diff = sum(point_diff) 
  ) |> 
  arrange(season, team) 

nfl_teams

## # A tibble: 320 × 3
##    season team  total_point_diff
##     <int> <chr>            <int>
##  1   2016 ARI                 56
##  2   2016 ATL                134
##  3   2016 BAL                 22
##  4   2016 BUF                 21
##  5   2016 CAR                -33
##  6   2016 CHI               -120
##  7   2016 CIN                 10
##  8   2016 CLE               -188
##  9   2016 DAL                115
## 10   2016 DEN                 36
## # ℹ 310 more rows

Removing pbp and freeing up the memory:

rm(pbp); gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  2652059 141.7    5906554  315.5   3547180  189.5
## Vcells 62065971 473.6  373084051 2846.5 466239931 3557.2

In order to make the graph in Brightspace, we need to add two columns (really three, but we’ll see that later…):

team color: The color of the lines based on each team’s color.
team logo: a URL to an image of the team’s logo.

How can we do that?

If we use mutate(), we’d have to manually enter vectors that are 320 elements long each. That would take quite a while to do, in addition to having to find each team’s primary color. So what do we do instead?

Thankfully, the nflfastR package has a build in data frame that has team level info:

data(teams_colors_logos)
teams_colors_logos

## # A tibble: 36 × 15
##    team_abbr team_name      team_id team_nick team_conf team_division team_color
##    <chr>     <chr>          <chr>   <chr>     <chr>     <chr>         <chr>     
##  1 ARI       Arizona Cardi… 3800    Cardinals NFC       NFC West      #97233F   
##  2 ATL       Atlanta Falco… 0200    Falcons   NFC       NFC South     #A71930   
##  3 BAL       Baltimore Rav… 0325    Ravens    AFC       AFC North     #241773   
##  4 BUF       Buffalo Bills  0610    Bills     AFC       AFC East      #00338D   
##  5 CAR       Carolina Pant… 0750    Panthers  NFC       NFC South     #0085CA   
##  6 CHI       Chicago Bears  0810    Bears     NFC       NFC North     #0B162A   
##  7 CIN       Cincinnati Be… 0920    Bengals   AFC       AFC North     #FB4F14   
##  8 CLE       Cleveland Bro… 1050    Browns    AFC       AFC North     #FF3C00   
##  9 DAL       Dallas Cowboys 1200    Cowboys   NFC       NFC East      #002244   
## 10 DEN       Denver Broncos 1400    Broncos   AFC       AFC West      #002244   
## # ℹ 26 more rows
## # ℹ 8 more variables: team_color2 <chr>, team_color3 <chr>, team_color4 <chr>,
## #   team_logo_wikipedia <chr>, team_logo_espn <chr>, team_wordmark <chr>,
## #   team_conference_logo <chr>, team_league_logo <chr>

Joining Data Sets

We have all the info we need, but unfortunately it is in two different data frames :(

How can we combine the two data sets?

We need someway of merging the data sets together so that in nfl_teams, every row where team says ‘NE’ pulls the correct row of teams_colors_logos and adds its information to our current data set.

Key columns

The first task is to identify the ID columns in the two data sets that tells us which rows belong together, often called a key column.

What columns should we use as the key column from each data set?

Types of joins

There are four types of joins, depending on how you want to merge the data:

inner_join(): Keeps rows where the the value in the key columns appear in BOTH data sets.
full_join(): Keeps all the rows in both data sets, even in the value in a key column only appears in one of the two data sets.
left_join(): Keeps all the rows in the ‘left’ data set and adds the information from the right data set for matching values of the key column
right_join(): Like left_join(), but for the right data set instead.

Left and right join

Generally, when using left_join() or right_join(), the corresponding data set (left for left_join()) is called the master list while the other data set (right for left_join()) is called the reference list.

For these types of joins, it is very common for the master list to have repeat values in the key column.

For our NFL example, each team in the team column appears 10 times.

However, the values in the key column of the reference list should only occur once, otherwise there will be duplicates if a value in the key column appears multiple times (see the slides for an example).

If ‘LAR’ appears twice in teams_colors_logos, then each row that has ‘LAR’ in nfl_teams will appear twice for each season instead of only once!

We can check how many duplicates there are using count() and arrange()

teams_colors_logos |> 
  # Counting how many times each key value appears
  count(team_abbr) |> 
  # Arranging from highest to lowest count
  arrange(-n)

## # A tibble: 36 × 2
##    team_abbr     n
##    <chr>     <int>
##  1 ARI           1
##  2 ATL           1
##  3 BAL           1
##  4 BUF           1
##  5 CAR           1
##  6 CHI           1
##  7 CIN           1
##  8 CLE           1
##  9 DAL           1
## 10 DEN           1
## # ℹ 26 more rows

As long as the largest count is 1, that means there aren’t any duplicates!

Since we want to add the info in teams_colors_logos to nfl_teams, we can use left_join(), which has three arguments:

x = the left data set
y = the right data set
by = the key column(s)

If the key column has the same name: by = '{colname}'
If the key column has different names in the left set vs right: by = c('{col_left}' = '{col_right}')
If you need multiple columns to uniquely ID each row, you can use by = c('{col1}, {col2}, ...)

What are the key columns for our data set?

nfl_teams:________________

teams_colors_logos:___________________

Combine the two data sets together and call it nfl_teams2

nfl_teams2 <- 
  left_join(
    x = nfl_teams,
    y = teams_colors_logos,
    by = c('team' = 'team_abbr')
  )


nfl_teams2

## # A tibble: 320 × 17
##    season team  total_point_diff team_name          team_id team_nick team_conf
##     <int> <chr>            <int> <chr>              <chr>   <chr>     <chr>    
##  1   2016 ARI                 56 Arizona Cardinals  3800    Cardinals NFC      
##  2   2016 ATL                134 Atlanta Falcons    0200    Falcons   NFC      
##  3   2016 BAL                 22 Baltimore Ravens   0325    Ravens    AFC      
##  4   2016 BUF                 21 Buffalo Bills      0610    Bills     AFC      
##  5   2016 CAR                -33 Carolina Panthers  0750    Panthers  NFC      
##  6   2016 CHI               -120 Chicago Bears      0810    Bears     NFC      
##  7   2016 CIN                 10 Cincinnati Bengals 0920    Bengals   AFC      
##  8   2016 CLE               -188 Cleveland Browns   1050    Browns    AFC      
##  9   2016 DAL                115 Dallas Cowboys     1200    Cowboys   NFC      
## 10   2016 DEN                 36 Denver Broncos     1400    Broncos   AFC      
## # ℹ 310 more rows
## # ℹ 10 more variables: team_division <chr>, team_color <chr>,
## #   team_color2 <chr>, team_color3 <chr>, team_color4 <chr>,
## #   team_logo_wikipedia <chr>, team_logo_espn <chr>, team_wordmark <chr>,
## #   team_conference_logo <chr>, team_league_logo <chr>

AFCE data set

Now that we’ve joined the two data sets together, we can create AFCE_point_diff using nfl_teams2 a little easier than team == c('NE', 'NYJ', 'BUF', 'MIA'). The merged data set has a column called team_division that has the division of each team (like ‘AFC East’)

Create the AFCE_point_diff data set below:

AFCE_point_diff <- 
  nfl_teams2 |> 
  filter(team_division == 'AFC East')
  

AFCE_point_diff

## # A tibble: 40 × 17
##    season team  total_point_diff team_name           team_id team_nick team_conf
##     <int> <chr>            <int> <chr>               <chr>   <chr>     <chr>    
##  1   2016 BUF                 21 Buffalo Bills       0610    Bills     AFC      
##  2   2016 MIA                -17 Miami Dolphins      2700    Dolphins  AFC      
##  3   2016 NE                 191 New England Patrio… 3200    Patriots  AFC      
##  4   2016 NYJ               -134 New York Jets       3430    Jets      AFC      
##  5   2017 BUF                -57 Buffalo Bills       0610    Bills     AFC      
##  6   2017 MIA               -112 Miami Dolphins      2700    Dolphins  AFC      
##  7   2017 NE                 162 New England Patrio… 3200    Patriots  AFC      
##  8   2017 NYJ                -84 New York Jets       3430    Jets      AFC      
##  9   2018 BUF               -105 Buffalo Bills       0610    Bills     AFC      
## 10   2018 MIA               -114 Miami Dolphins      2700    Dolphins  AFC      
## # ℹ 30 more rows
## # ℹ 10 more variables: team_division <chr>, team_color <chr>,
## #   team_color2 <chr>, team_color3 <chr>, team_color4 <chr>,
## #   team_logo_wikipedia <chr>, team_logo_espn <chr>, team_wordmark <chr>,
## #   team_conference_logo <chr>, team_league_logo <chr>

ggplot(
  data = AFCE_point_diff,
  # Set up the global aesthetics here
  mapping = aes(
    x = season, 
    y = total_point_diff
  )
) + 
  # Adding the lines
  geom_line(
    # Add color for each of the lines
    mapping = aes(
      color = if_else(team == 'NE', team_color2, team_color)
    ),
    linewidth = 1
  ) + 
  # Adding the logos to the end of the line
  ggimage::geom_image(
    # Keeping only the rows for the last season
    data = AFCE_point_diff |> filter(season == max(season)),
    mapping = aes(image = team_logo_espn),
    nudge_x = 0.3,
    size = 0.1
  ) +
  # Horizontal line at 0
  geom_hline(
    yintercept = 0,
    linetype = 'dashed',
    color = 'grey70'
  ) +
  # Changing x-y labels
  labs(
    x = 'Season',
    y = 'Points Scored - Points Allowed'
  ) + 
  # changing the theme
  theme_bw() +
  # Removing the space on the left of the graph and adding a little to the right
  scale_x_continuous(
    expand = c(0, 0, 0.075, 0),
    breaks = unique(AFCE_point_diff$season),
    minor_breaks = NULL
  ) + 
  scale_color_identity()

And since we have the division for each team, we can facet by division instead of filter for just one:

ggplot(
  data = nfl_teams2,
  # Set up the global aesthetics here
  mapping = aes(
    x = factor(season),
    y = total_point_diff
  )
) + 
  geom_line(
    mapping = aes(
      color = team_color,
      group = team
    ),
    linewidth = 1
  ) + 
  # geom_text(
  #   data = AFCE_point_diff |> filter(season == max(season)),
  #   mapping = aes(label = team),
  #   nudge_x = 0.3
  # ) + 
  ggimage::geom_image(
    data = nfl_teams2 |> filter(season == max(season)),
    mapping = aes(image = team_logo_espn),
    nudge_x = 0.3,
    size = 0.15
  ) +
  geom_hline(
    yintercept = 0,
    linetype = 'dashed',
    color = 'grey70'
  ) +
  labs(
    x = 'Season',
    y = 'Points Scored - Points Allowed'
  ) +
  # Different line graph for each division
  facet_wrap(
    facets = vars(team_division),
    ncol = 2
  ) +
  theme_bw() + 
  theme(legend.position = 'none') +
  # Using the colors provided instead of the default group colors
  scale_color_identity() + 
  scale_x_discrete(
    expand = c(0, 0, 0.075, 0)
  )

Module 5: more dplyr - joins