Import two related datasets from TidyTuesday Project.
canada_births_1991_2022 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-01-09/canada_births_1991_2022.csv')
## Rows: 384 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): year, month, births
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nhl_player_births <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-01-09/nhl_player_births.csv')
## Rows: 8474 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): first_name, last_name, birth_city, birth_country, birth_state_prov...
## dbl (3): player_id, birth_year, birth_month
## date (1): birth_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1: nhl_player_births
Data 2: canada_births_1991_2022
set.seed(1234)
nhl_player_births_small <- nhl_player_births %>% select(first_name, birth_date, birth_month) %>% sample_n(10)
canada_births_1991_2022_small <- canada_births_1991_2022 %>% select(month,year, births) %>% sample_n(10)
nhl_player_births_small
## # A tibble: 10 × 3
## first_name birth_date birth_month
## <chr> <date> <dbl>
## 1 Marc 1985-06-21 6
## 2 Igor 1995-12-30 12
## 3 Timo 1989-07-07 7
## 4 Gregory 1992-11-13 11
## 5 Greg 1962-01-22 1
## 6 Mike 1956-06-10 6
## 7 Bob 1961-04-22 4
## 8 Emory 1898-09-15 9
## 9 Egor 1991-09-17 9
## 10 Jamie 1975-03-16 3
canada_births_1991_2022_small
## # A tibble: 10 × 3
## month year births
## <dbl> <dbl> <dbl>
## 1 6 1998 29399
## 2 2 2018 27677
## 3 7 1997 31114
## 4 12 2021 28630
## 5 6 2013 31735
## 6 4 2006 28917
## 7 2 1996 28843
## 8 4 1991 35398
## 9 5 2003 29505
## 10 4 1994 33228
Describe the resulting data:
How is it different from the original two datasets? there is only 9 rows instead of 10
inner_result <- nhl_player_births_small %>%
inner_join(canada_births_1991_2022_small, by = c("birth_month" = "month"), relationship = "many-to-many")
inner_result
## # A tibble: 9 × 5
## first_name birth_date birth_month year births
## <chr> <date> <dbl> <dbl> <dbl>
## 1 Marc 1985-06-21 6 1998 29399
## 2 Marc 1985-06-21 6 2013 31735
## 3 Igor 1995-12-30 12 2021 28630
## 4 Timo 1989-07-07 7 1997 31114
## 5 Mike 1956-06-10 6 1998 29399
## 6 Mike 1956-06-10 6 2013 31735
## 7 Bob 1961-04-22 4 2006 28917
## 8 Bob 1961-04-22 4 1991 35398
## 9 Bob 1961-04-22 4 1994 33228
Describe the resulting data:
How is it different from the original two datasets? there are 14 rows instead of 10 from the original dataset
left_result <- nhl_player_births_small %>%
left_join(canada_births_1991_2022_small,
by = c("birth_month" = "month"),
relationship = "many-to-many")
left_result
## # A tibble: 14 × 5
## first_name birth_date birth_month year births
## <chr> <date> <dbl> <dbl> <dbl>
## 1 Marc 1985-06-21 6 1998 29399
## 2 Marc 1985-06-21 6 2013 31735
## 3 Igor 1995-12-30 12 2021 28630
## 4 Timo 1989-07-07 7 1997 31114
## 5 Gregory 1992-11-13 11 NA NA
## 6 Greg 1962-01-22 1 NA NA
## 7 Mike 1956-06-10 6 1998 29399
## 8 Mike 1956-06-10 6 2013 31735
## 9 Bob 1961-04-22 4 2006 28917
## 10 Bob 1961-04-22 4 1991 35398
## 11 Bob 1961-04-22 4 1994 33228
## 12 Emory 1898-09-15 9 NA NA
## 13 Egor 1991-09-17 9 NA NA
## 14 Jamie 1975-03-16 3 NA NA
Describe the resulting data:
How is it different from the original two datasets? Keeps all rows from the Canadian births dataset, even if no match in the NHL dataset.
right_join_result <- nhl_player_births_small %>%
right_join(canada_births_1991_2022_small,
by = c("birth_month" = "month"),
relationship = "many-to-many")
right_join_result
## # A tibble: 12 × 5
## first_name birth_date birth_month year births
## <chr> <date> <dbl> <dbl> <dbl>
## 1 Marc 1985-06-21 6 1998 29399
## 2 Marc 1985-06-21 6 2013 31735
## 3 Igor 1995-12-30 12 2021 28630
## 4 Timo 1989-07-07 7 1997 31114
## 5 Mike 1956-06-10 6 1998 29399
## 6 Mike 1956-06-10 6 2013 31735
## 7 Bob 1961-04-22 4 2006 28917
## 8 Bob 1961-04-22 4 1991 35398
## 9 Bob 1961-04-22 4 1994 33228
## 10 <NA> NA 2 2018 27677
## 11 <NA> NA 2 1996 28843
## 12 <NA> NA 5 2003 29505
Describe the resulting data:
How is it different from the original two datasets? Keeps all rows from both datasets; unmatched rows get NA values.
full_join_result <- nhl_player_births_small %>%
full_join(canada_births_1991_2022_small,
by = c("birth_month" = "month"),
relationship = "many-to-many")
full_join_result
## # A tibble: 17 × 5
## first_name birth_date birth_month year births
## <chr> <date> <dbl> <dbl> <dbl>
## 1 Marc 1985-06-21 6 1998 29399
## 2 Marc 1985-06-21 6 2013 31735
## 3 Igor 1995-12-30 12 2021 28630
## 4 Timo 1989-07-07 7 1997 31114
## 5 Gregory 1992-11-13 11 NA NA
## 6 Greg 1962-01-22 1 NA NA
## 7 Mike 1956-06-10 6 1998 29399
## 8 Mike 1956-06-10 6 2013 31735
## 9 Bob 1961-04-22 4 2006 28917
## 10 Bob 1961-04-22 4 1991 35398
## 11 Bob 1961-04-22 4 1994 33228
## 12 Emory 1898-09-15 9 NA NA
## 13 Egor 1991-09-17 9 NA NA
## 14 Jamie 1975-03-16 3 NA NA
## 15 <NA> NA 2 2018 27677
## 16 <NA> NA 2 1996 28843
## 17 <NA> NA 5 2003 29505
Describe the resulting data:
Columns: first_name, birth_date, birth_month
Rows: 5
How is it different from the original two datasets? Filters rows from nhl_player_births_small without adding new columns.
semi_join_result <- nhl_player_births_small %>%
semi_join(canada_births_1991_2022_small,
by = c("birth_month" = "month"))
semi_join_result
## # A tibble: 5 × 3
## first_name birth_date birth_month
## <chr> <date> <dbl>
## 1 Marc 1985-06-21 6
## 2 Igor 1995-12-30 12
## 3 Timo 1989-07-07 7
## 4 Mike 1956-06-10 6
## 5 Bob 1961-04-22 4
Describe the resulting data:
How is it different from the original two datasets? Shows unmatched rows from the left dataset (NHL).
anti_join_result <- nhl_player_births_small %>%
anti_join(canada_births_1991_2022_small,
by = c("birth_month" = "month"))
anti_join_result
## # A tibble: 5 × 3
## first_name birth_date birth_month
## <chr> <date> <dbl>
## 1 Gregory 1992-11-13 11
## 2 Greg 1962-01-22 1
## 3 Emory 1898-09-15 9
## 4 Egor 1991-09-17 9
## 5 Jamie 1975-03-16 3