Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-02-04/attendance.csv')

## Rows: 10846 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): team, team_name
## dbl (6): year, total, home, away, week, weekly_attendance
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-02-04/games.csv')

## Rows: 5324 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (11): week, home_team, away_team, winner, tie, day, date, home_team_nam...
## dbl   (7): year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, turnov...
## time  (1): time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1

Columns:team city, week, weekly attendance
Rows: 10 rows

set.seed(1234)
attendance_small <- attendance %>% select(team, week, weekly_attendance) %>% sample_n(10)

attendance_small

## # A tibble: 10 × 3
##    team          week weekly_attendance
##    <chr>        <dbl>             <dbl>
##  1 Pittsburgh       6             76957
##  2 San Diego        9             70222
##  3 Cleveland        5             71239
##  4 Tampa Bay       11             77442
##  5 Indianapolis    10             66004
##  6 Tennessee       16             59621
##  7 Chicago         11             64214
##  8 Pittsburgh      16             63751
##  9 Kansas City      7             68350
## 10 Arizona          4             28109

Data 2

Columns: team city, week, time of day
Rows: 10 rows

set.seed(1234)
games_small <- games %>% select(home_team_city, week, time) %>% sample_n(10)
colnames(games_small)[1] = "team" 

games_small$week <- as.numeric(games_small$week)

games_small

## # A tibble: 10 × 3
##    team           week time  
##    <chr>         <dbl> <time>
##  1 Denver           15 16:05 
##  2 Baltimore         8 13:02 
##  3 New Orleans       3 13:03 
##  4 Cleveland        11 13:03 
##  5 Dallas           16 20:30 
##  6 San Francisco     2 16:05 
##  7 Seattle           2 16:06 
##  8 Cleveland        17 16:16 
##  9 Baltimore         9 13:02 
## 10 Baltimore        17 16:16

3. inner_join

Describe the resulting data:

Columns: team. x, week, time, team.y, weekly_attendance
Rows: 5

How is it different from the original two datasets?

*Only 5 rows compared to 10. Added columns team.y and weekly_attendance

inner_join(games_small, attendance_small, by = "week")

## # A tibble: 5 × 5
##   team.x     week time   team.y     weekly_attendance
##   <chr>     <dbl> <time> <chr>                  <dbl>
## 1 Cleveland    11 13:03  Tampa Bay              77442
## 2 Cleveland    11 13:03  Chicago                64214
## 3 Dallas       16 20:30  Tennessee              59621
## 4 Dallas       16 20:30  Pittsburgh             63751
## 5 Baltimore     9 13:02  San Diego              70222

4. left_join

Describe the resulting data:

Columns: team.x, week, time, team.y, weekly_attendance
Rows: 12

How is it different from the original two datasets?

Added columns team.y and weekly_attendance. Added two rows where attendance data had more than one match for week

left_join(games_small, attendance_small, by = "week")

## # A tibble: 12 × 5
##    team.x         week time   team.y     weekly_attendance
##    <chr>         <dbl> <time> <chr>                  <dbl>
##  1 Denver           15 16:05  <NA>                      NA
##  2 Baltimore         8 13:02  <NA>                      NA
##  3 New Orleans       3 13:03  <NA>                      NA
##  4 Cleveland        11 13:03  Tampa Bay              77442
##  5 Cleveland        11 13:03  Chicago                64214
##  6 Dallas           16 20:30  Tennessee              59621
##  7 Dallas           16 20:30  Pittsburgh             63751
##  8 San Francisco     2 16:05  <NA>                      NA
##  9 Seattle           2 16:06  <NA>                      NA
## 10 Cleveland        17 16:16  <NA>                      NA
## 11 Baltimore         9 13:02  San Diego              70222
## 12 Baltimore        17 16:16  <NA>                      NA

5. right_join

Describe the resulting data:

Columns: team.x, week, time, team.y, weekly_attendance
Rows: 10

How is it different from the original two datasets? *Added columns team.x and time. Did not add any rows becuase there were no repeat key matches in games data.

right_join(games_small, attendance_small, by = "week")

## # A tibble: 10 × 5
##    team.x     week time   team.y       weekly_attendance
##    <chr>     <dbl> <time> <chr>                    <dbl>
##  1 Cleveland    11 13:03  Tampa Bay                77442
##  2 Cleveland    11 13:03  Chicago                  64214
##  3 Dallas       16 20:30  Tennessee                59621
##  4 Dallas       16 20:30  Pittsburgh               63751
##  5 Baltimore     9 13:02  San Diego                70222
##  6 <NA>          6    NA  Pittsburgh               76957
##  7 <NA>          5    NA  Cleveland                71239
##  8 <NA>         10    NA  Indianapolis             66004
##  9 <NA>          7    NA  Kansas City              68350
## 10 <NA>          4    NA  Arizona                  28109

6. full_join

Describe the resulting data:

Columns: Team.x, week, time, team.y, weekly_attendance
Rows: 17

How is it different from the original two datasets? *Used columns from each of the datasets. Added 7 rows so that each key, from both data sets, was represented.

full_join(games_small, attendance_small, by = "week")

## # A tibble: 17 × 5
##    team.x         week time   team.y       weekly_attendance
##    <chr>         <dbl> <time> <chr>                    <dbl>
##  1 Denver           15 16:05  <NA>                        NA
##  2 Baltimore         8 13:02  <NA>                        NA
##  3 New Orleans       3 13:03  <NA>                        NA
##  4 Cleveland        11 13:03  Tampa Bay                77442
##  5 Cleveland        11 13:03  Chicago                  64214
##  6 Dallas           16 20:30  Tennessee                59621
##  7 Dallas           16 20:30  Pittsburgh               63751
##  8 San Francisco     2 16:05  <NA>                        NA
##  9 Seattle           2 16:06  <NA>                        NA
## 10 Cleveland        17 16:16  <NA>                        NA
## 11 Baltimore         9 13:02  San Diego                70222
## 12 Baltimore        17 16:16  <NA>                        NA
## 13 <NA>              6    NA  Pittsburgh               76957
## 14 <NA>              5    NA  Cleveland                71239
## 15 <NA>             10    NA  Indianapolis             66004
## 16 <NA>              7    NA  Kansas City              68350
## 17 <NA>              4    NA  Arizona                  28109

7. semi_join

Describe the resulting data:

Columns: team, week, time
Rows:3

How is it different from the original two datasets? *Only uses columns from games data. Rows come from key values (week) that are shared between the datasets, but the team and time data comes from games_small.

semi_join(games_small, attendance_small, by = "week")

## # A tibble: 3 × 3
##   team       week time  
##   <chr>     <dbl> <time>
## 1 Cleveland    11 13:03 
## 2 Dallas       16 20:30 
## 3 Baltimore     9 13:02

8. anti_join

Describe the resulting data:

Columns: team, week time
Rows: 7

How is it different from the original two datasets? *Uses columns from games data. Rows come from games data where there is no common key with attendance data.

anti_join(games_small, attendance_small, by = "week")

## # A tibble: 7 × 3
##   team           week time  
##   <chr>         <dbl> <time>
## 1 Denver           15 16:05 
## 2 Baltimore         8 13:02 
## 3 New Orleans       3 13:03 
## 4 San Francisco     2 16:05 
## 5 Seattle           2 16:06 
## 6 Cleveland        17 16:16 
## 7 Baltimore        17 16:16

Week 9: Apply it to your data 8

James Kelly

2025-03-27

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join