Import two related datasets from TidyTuesday Project.
attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-02-04/attendance.csv')
## Rows: 10846 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): team, team_name
## dbl (6): year, total, home, away, week, weekly_attendance
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-02-04/games.csv')
## Rows: 5324 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): week, home_team, away_team, winner, tie, day, date, home_team_nam...
## dbl (7): year, pts_win, pts_loss, yds_win, turnovers_win, yds_loss, turnov...
## time (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1
set.seed(1234)
attendance_small <- attendance %>% select(team, week, weekly_attendance) %>% sample_n(10)
attendance_small
## # A tibble: 10 × 3
## team week weekly_attendance
## <chr> <dbl> <dbl>
## 1 Pittsburgh 6 76957
## 2 San Diego 9 70222
## 3 Cleveland 5 71239
## 4 Tampa Bay 11 77442
## 5 Indianapolis 10 66004
## 6 Tennessee 16 59621
## 7 Chicago 11 64214
## 8 Pittsburgh 16 63751
## 9 Kansas City 7 68350
## 10 Arizona 4 28109
Data 2
set.seed(1234)
games_small <- games %>% select(home_team_city, week, time) %>% sample_n(10)
colnames(games_small)[1] = "team"
games_small$week <- as.numeric(games_small$week)
games_small
## # A tibble: 10 × 3
## team week time
## <chr> <dbl> <time>
## 1 Denver 15 16:05
## 2 Baltimore 8 13:02
## 3 New Orleans 3 13:03
## 4 Cleveland 11 13:03
## 5 Dallas 16 20:30
## 6 San Francisco 2 16:05
## 7 Seattle 2 16:06
## 8 Cleveland 17 16:16
## 9 Baltimore 9 13:02
## 10 Baltimore 17 16:16
Describe the resulting data:
How is it different from the original two datasets?
*Only 5 rows compared to 10. Added columns team.y and weekly_attendance
inner_join(games_small, attendance_small, by = "week")
## # A tibble: 5 × 5
## team.x week time team.y weekly_attendance
## <chr> <dbl> <time> <chr> <dbl>
## 1 Cleveland 11 13:03 Tampa Bay 77442
## 2 Cleveland 11 13:03 Chicago 64214
## 3 Dallas 16 20:30 Tennessee 59621
## 4 Dallas 16 20:30 Pittsburgh 63751
## 5 Baltimore 9 13:02 San Diego 70222
Describe the resulting data:
How is it different from the original two datasets?
Added columns team.y and weekly_attendance. Added two rows where attendance data had more than one match for week
left_join(games_small, attendance_small, by = "week")
## # A tibble: 12 × 5
## team.x week time team.y weekly_attendance
## <chr> <dbl> <time> <chr> <dbl>
## 1 Denver 15 16:05 <NA> NA
## 2 Baltimore 8 13:02 <NA> NA
## 3 New Orleans 3 13:03 <NA> NA
## 4 Cleveland 11 13:03 Tampa Bay 77442
## 5 Cleveland 11 13:03 Chicago 64214
## 6 Dallas 16 20:30 Tennessee 59621
## 7 Dallas 16 20:30 Pittsburgh 63751
## 8 San Francisco 2 16:05 <NA> NA
## 9 Seattle 2 16:06 <NA> NA
## 10 Cleveland 17 16:16 <NA> NA
## 11 Baltimore 9 13:02 San Diego 70222
## 12 Baltimore 17 16:16 <NA> NA
Describe the resulting data:
How is it different from the original two datasets? *Added columns team.x and time. Did not add any rows becuase there were no repeat key matches in games data.
right_join(games_small, attendance_small, by = "week")
## # A tibble: 10 × 5
## team.x week time team.y weekly_attendance
## <chr> <dbl> <time> <chr> <dbl>
## 1 Cleveland 11 13:03 Tampa Bay 77442
## 2 Cleveland 11 13:03 Chicago 64214
## 3 Dallas 16 20:30 Tennessee 59621
## 4 Dallas 16 20:30 Pittsburgh 63751
## 5 Baltimore 9 13:02 San Diego 70222
## 6 <NA> 6 NA Pittsburgh 76957
## 7 <NA> 5 NA Cleveland 71239
## 8 <NA> 10 NA Indianapolis 66004
## 9 <NA> 7 NA Kansas City 68350
## 10 <NA> 4 NA Arizona 28109
Describe the resulting data:
How is it different from the original two datasets? *Used columns from each of the datasets. Added 7 rows so that each key, from both data sets, was represented.
full_join(games_small, attendance_small, by = "week")
## # A tibble: 17 × 5
## team.x week time team.y weekly_attendance
## <chr> <dbl> <time> <chr> <dbl>
## 1 Denver 15 16:05 <NA> NA
## 2 Baltimore 8 13:02 <NA> NA
## 3 New Orleans 3 13:03 <NA> NA
## 4 Cleveland 11 13:03 Tampa Bay 77442
## 5 Cleveland 11 13:03 Chicago 64214
## 6 Dallas 16 20:30 Tennessee 59621
## 7 Dallas 16 20:30 Pittsburgh 63751
## 8 San Francisco 2 16:05 <NA> NA
## 9 Seattle 2 16:06 <NA> NA
## 10 Cleveland 17 16:16 <NA> NA
## 11 Baltimore 9 13:02 San Diego 70222
## 12 Baltimore 17 16:16 <NA> NA
## 13 <NA> 6 NA Pittsburgh 76957
## 14 <NA> 5 NA Cleveland 71239
## 15 <NA> 10 NA Indianapolis 66004
## 16 <NA> 7 NA Kansas City 68350
## 17 <NA> 4 NA Arizona 28109
Describe the resulting data:
How is it different from the original two datasets? *Only uses columns from games data. Rows come from key values (week) that are shared between the datasets, but the team and time data comes from games_small.
semi_join(games_small, attendance_small, by = "week")
## # A tibble: 3 × 3
## team week time
## <chr> <dbl> <time>
## 1 Cleveland 11 13:03
## 2 Dallas 16 20:30
## 3 Baltimore 9 13:02
Describe the resulting data:
How is it different from the original two datasets? *Uses columns from games data. Rows come from games data where there is no common key with attendance data.
anti_join(games_small, attendance_small, by = "week")
## # A tibble: 7 × 3
## team week time
## <chr> <dbl> <time>
## 1 Denver 15 16:05
## 2 Baltimore 8 13:02
## 3 New Orleans 3 13:03
## 4 San Francisco 2 16:05
## 5 Seattle 2 16:06
## 6 Cleveland 17 16:16
## 7 Baltimore 17 16:16