We’ll be picking up where we left off from the previous markdown to create the graph seen in Brightspace:
We’ll be creating the same game data we’ve been looking at for the last few examples:
nfl_teams <-
pbp |>
# Keeping just regular season games:
filter(season_type == 'REG') |>
# Keeping the specified columns
dplyr::select(game_id, season, game_date, home_team,
away_team, home_score, away_score) |>
# Keeping each distinct row, dropping any duplicates
distinct() |>
# Moving the home_team and away_team into one column
pivot_longer(
cols = home_team:away_team,
names_to = 'location',
values_to = 'team'
) |>
# Dropping the '_team' from location and finding the score_diff for each team
mutate(
location = str_remove(location, '_team'),
point_diff = if_else(location == 'home', home_score - away_score, away_score - home_score)
) |>
# Calculating the total point differential per team by year
summarize(
.by = c(season, team),
total_point_diff = sum(point_diff)
) |>
arrange(season, team)
nfl_teams
## # A tibble: 320 × 3
## season team total_point_diff
## <int> <chr> <int>
## 1 2016 ARI 56
## 2 2016 ATL 134
## 3 2016 BAL 22
## 4 2016 BUF 21
## 5 2016 CAR -33
## 6 2016 CHI -120
## 7 2016 CIN 10
## 8 2016 CLE -188
## 9 2016 DAL 115
## 10 2016 DEN 36
## # ℹ 310 more rows
Removing pbp and freeing up the memory:
rm(pbp); gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2652059 141.7 5906554 315.5 3547180 189.5
## Vcells 62065971 473.6 373084051 2846.5 466239931 3557.2
In order to make the graph in Brightspace, we need to add two columns (really three, but we’ll see that later…):
How can we do that?
If we use mutate(), we’d have to manually enter vectors
that are 320 elements long each. That would take quite a while to do, in
addition to having to find each team’s primary color. So what do we do
instead?
Thankfully, the nflfastR package has a build in data
frame that has team level info:
data(teams_colors_logos)
teams_colors_logos
## # A tibble: 36 × 15
## team_abbr team_name team_id team_nick team_conf team_division team_color
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ARI Arizona Cardi… 3800 Cardinals NFC NFC West #97233F
## 2 ATL Atlanta Falco… 0200 Falcons NFC NFC South #A71930
## 3 BAL Baltimore Rav… 0325 Ravens AFC AFC North #241773
## 4 BUF Buffalo Bills 0610 Bills AFC AFC East #00338D
## 5 CAR Carolina Pant… 0750 Panthers NFC NFC South #0085CA
## 6 CHI Chicago Bears 0810 Bears NFC NFC North #0B162A
## 7 CIN Cincinnati Be… 0920 Bengals AFC AFC North #FB4F14
## 8 CLE Cleveland Bro… 1050 Browns AFC AFC North #FF3C00
## 9 DAL Dallas Cowboys 1200 Cowboys NFC NFC East #002244
## 10 DEN Denver Broncos 1400 Broncos AFC AFC West #002244
## # ℹ 26 more rows
## # ℹ 8 more variables: team_color2 <chr>, team_color3 <chr>, team_color4 <chr>,
## # team_logo_wikipedia <chr>, team_logo_espn <chr>, team_wordmark <chr>,
## # team_conference_logo <chr>, team_league_logo <chr>
We have all the info we need, but unfortunately it is in two different data frames :(
How can we combine the two data sets?
We need someway of merging the data sets together so that in
nfl_teams, every row where team says ‘NE’ pulls the correct
row of teams_colors_logos and adds its information to our
current data set.
The first task is to identify the ID columns in the two data sets that tells us which rows belong together, often called a key column.
What columns should we use as the key column from each data set?
There are four types of joins, depending on how you want to merge the data:
inner_join(): Keeps rows where the the value in the key
columns appear in BOTH data sets.full_join(): Keeps all the rows in both data sets, even
in the value in a key column only appears in one of the two data
sets.left_join(): Keeps all the rows in the ‘left’ data set
and adds the information from the right data set for matching values of
the key columnright_join(): Like left_join(), but for
the right data set instead.Generally, when using left_join() or
right_join(), the corresponding data set (left for
left_join()) is called the master list
while the other data set (right for left_join()) is called
the reference list.
For these types of joins, it is very common for the master list to have repeat values in the key column.
For our NFL example, each team in the team column
appears 10 times.
However, the values in the key column of the reference list should only occur once, otherwise there will be duplicates if a value in the key column appears multiple times (see the slides for an example).
If ‘LAR’ appears twice in teams_colors_logos, then each
row that has ‘LAR’ in nfl_teams will appear twice for each
season instead of only once!
We can check how many duplicates there are using count()
and arrange()
teams_colors_logos |>
# Counting how many times each key value appears
count(team_abbr) |>
# Arranging from highest to lowest count
arrange(-n)
## # A tibble: 36 × 2
## team_abbr n
## <chr> <int>
## 1 ARI 1
## 2 ATL 1
## 3 BAL 1
## 4 BUF 1
## 5 CAR 1
## 6 CHI 1
## 7 CIN 1
## 8 CLE 1
## 9 DAL 1
## 10 DEN 1
## # ℹ 26 more rows
As long as the largest count is 1, that means there aren’t any duplicates!
Since we want to add the info in teams_colors_logos to
nfl_teams, we can use left_join(), which has
three arguments:
x = the left data sety = the right data setby = the key column(s)by = '{colname}'by = c('{col_left}' = '{col_right}')by = c('{col1}, {col2}, ...)What are the key columns for our data set?
nfl_teams:________________
teams_colors_logos:___________________
Combine the two data sets together and call it
nfl_teams2
nfl_teams2 <-
left_join(
x = nfl_teams,
y = teams_colors_logos,
by = c('team' = 'team_abbr')
)
nfl_teams2
## # A tibble: 320 × 17
## season team total_point_diff team_name team_id team_nick team_conf
## <int> <chr> <int> <chr> <chr> <chr> <chr>
## 1 2016 ARI 56 Arizona Cardinals 3800 Cardinals NFC
## 2 2016 ATL 134 Atlanta Falcons 0200 Falcons NFC
## 3 2016 BAL 22 Baltimore Ravens 0325 Ravens AFC
## 4 2016 BUF 21 Buffalo Bills 0610 Bills AFC
## 5 2016 CAR -33 Carolina Panthers 0750 Panthers NFC
## 6 2016 CHI -120 Chicago Bears 0810 Bears NFC
## 7 2016 CIN 10 Cincinnati Bengals 0920 Bengals AFC
## 8 2016 CLE -188 Cleveland Browns 1050 Browns AFC
## 9 2016 DAL 115 Dallas Cowboys 1200 Cowboys NFC
## 10 2016 DEN 36 Denver Broncos 1400 Broncos AFC
## # ℹ 310 more rows
## # ℹ 10 more variables: team_division <chr>, team_color <chr>,
## # team_color2 <chr>, team_color3 <chr>, team_color4 <chr>,
## # team_logo_wikipedia <chr>, team_logo_espn <chr>, team_wordmark <chr>,
## # team_conference_logo <chr>, team_league_logo <chr>
Now that we’ve joined the two data sets together, we can create
AFCE_point_diff using nfl_teams2 a little
easier than team == c('NE', 'NYJ', 'BUF', 'MIA'). The
merged data set has a column called team_division that has
the division of each team (like ‘AFC East’)
Create the AFCE_point_diff data set below:
AFCE_point_diff <-
nfl_teams2 |>
filter(team_division == 'AFC East')
AFCE_point_diff
## # A tibble: 40 × 17
## season team total_point_diff team_name team_id team_nick team_conf
## <int> <chr> <int> <chr> <chr> <chr> <chr>
## 1 2016 BUF 21 Buffalo Bills 0610 Bills AFC
## 2 2016 MIA -17 Miami Dolphins 2700 Dolphins AFC
## 3 2016 NE 191 New England Patrio… 3200 Patriots AFC
## 4 2016 NYJ -134 New York Jets 3430 Jets AFC
## 5 2017 BUF -57 Buffalo Bills 0610 Bills AFC
## 6 2017 MIA -112 Miami Dolphins 2700 Dolphins AFC
## 7 2017 NE 162 New England Patrio… 3200 Patriots AFC
## 8 2017 NYJ -84 New York Jets 3430 Jets AFC
## 9 2018 BUF -105 Buffalo Bills 0610 Bills AFC
## 10 2018 MIA -114 Miami Dolphins 2700 Dolphins AFC
## # ℹ 30 more rows
## # ℹ 10 more variables: team_division <chr>, team_color <chr>,
## # team_color2 <chr>, team_color3 <chr>, team_color4 <chr>,
## # team_logo_wikipedia <chr>, team_logo_espn <chr>, team_wordmark <chr>,
## # team_conference_logo <chr>, team_league_logo <chr>
ggplot(
data = AFCE_point_diff,
# Set up the global aesthetics here
mapping = aes(
x = season,
y = total_point_diff
)
) +
# Adding the lines
geom_line(
# Add color for each of the lines
mapping = aes(
color = if_else(team == 'NE', team_color2, team_color)
),
linewidth = 1
) +
# Adding the logos to the end of the line
ggimage::geom_image(
# Keeping only the rows for the last season
data = AFCE_point_diff |> filter(season == max(season)),
mapping = aes(image = team_logo_espn),
nudge_x = 0.3,
size = 0.1
) +
# Horizontal line at 0
geom_hline(
yintercept = 0,
linetype = 'dashed',
color = 'grey70'
) +
# Changing x-y labels
labs(
x = 'Season',
y = 'Points Scored - Points Allowed'
) +
# changing the theme
theme_bw() +
# Removing the space on the left of the graph and adding a little to the right
scale_x_continuous(
expand = c(0, 0, 0.075, 0),
breaks = unique(AFCE_point_diff$season),
minor_breaks = NULL
) +
scale_color_identity()
And since we have the division for each team, we can facet by division instead of filter for just one:
ggplot(
data = nfl_teams2,
# Set up the global aesthetics here
mapping = aes(
x = factor(season),
y = total_point_diff
)
) +
geom_line(
mapping = aes(
color = team_color,
group = team
),
linewidth = 1
) +
# geom_text(
# data = AFCE_point_diff |> filter(season == max(season)),
# mapping = aes(label = team),
# nudge_x = 0.3
# ) +
ggimage::geom_image(
data = nfl_teams2 |> filter(season == max(season)),
mapping = aes(image = team_logo_espn),
nudge_x = 0.3,
size = 0.15
) +
geom_hline(
yintercept = 0,
linetype = 'dashed',
color = 'grey70'
) +
labs(
x = 'Season',
y = 'Points Scored - Points Allowed'
) +
# Different line graph for each division
facet_wrap(
facets = vars(team_division),
ncol = 2
) +
theme_bw() +
theme(legend.position = 'none') +
# Using the colors provided instead of the default group colors
scale_color_identity() +
scale_x_discrete(
expand = c(0, 0, 0.075, 0)
)