Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

canada_births_1991_2022 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-01-09/canada_births_1991_2022.csv')

## Rows: 384 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): year, month, births
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nhl_player_births <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-01-09/nhl_player_births.csv')

## Rows: 8474 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): first_name, last_name, birth_city, birth_country, birth_state_prov...
## dbl  (3): player_id, birth_year, birth_month
## date (1): birth_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: nhl_player_births

Columns: first_name, birth_date, birth_country
Rows: 10

Data 2: canada_births_1991_2022

Columns: month,year, births
Rows: 10

set.seed(1234)
nhl_player_births_small <- nhl_player_births %>% select(first_name, birth_date, birth_month) %>% sample_n(10)
canada_births_1991_2022_small          <- canada_births_1991_2022 %>% select(month,year, births) %>% sample_n(10)

nhl_player_births_small

## # A tibble: 10 × 3
##    first_name birth_date birth_month
##    <chr>      <date>           <dbl>
##  1 Marc       1985-06-21           6
##  2 Igor       1995-12-30          12
##  3 Timo       1989-07-07           7
##  4 Gregory    1992-11-13          11
##  5 Greg       1962-01-22           1
##  6 Mike       1956-06-10           6
##  7 Bob        1961-04-22           4
##  8 Emory      1898-09-15           9
##  9 Egor       1991-09-17           9
## 10 Jamie      1975-03-16           3

canada_births_1991_2022_small

## # A tibble: 10 × 3
##    month  year births
##    <dbl> <dbl>  <dbl>
##  1     6  1998  29399
##  2     2  2018  27677
##  3     7  1997  31114
##  4    12  2021  28630
##  5     6  2013  31735
##  6     4  2006  28917
##  7     2  1996  28843
##  8     4  1991  35398
##  9     5  2003  29505
## 10     4  1994  33228

3. inner_join

Describe the resulting data:

Columns: first_name, birth_date, birth_month, year, births
Rows:9

How is it different from the original two datasets? there is only 9 rows instead of 10

inner_result <- nhl_player_births_small %>% 
  inner_join(canada_births_1991_2022_small, by = c("birth_month" = "month"), relationship = "many-to-many")

inner_result

## # A tibble: 9 × 5
##   first_name birth_date birth_month  year births
##   <chr>      <date>           <dbl> <dbl>  <dbl>
## 1 Marc       1985-06-21           6  1998  29399
## 2 Marc       1985-06-21           6  2013  31735
## 3 Igor       1995-12-30          12  2021  28630
## 4 Timo       1989-07-07           7  1997  31114
## 5 Mike       1956-06-10           6  1998  29399
## 6 Mike       1956-06-10           6  2013  31735
## 7 Bob        1961-04-22           4  2006  28917
## 8 Bob        1961-04-22           4  1991  35398
## 9 Bob        1961-04-22           4  1994  33228

4. left_join

Describe the resulting data:

Columns:first_name birth_date birth_month year births
Rows:14

How is it different from the original two datasets? there are 14 rows instead of 10 from the original dataset

left_result <- nhl_player_births_small %>%
  left_join(canada_births_1991_2022_small, 
            by = c("birth_month" = "month"),
            relationship = "many-to-many")

left_result

## # A tibble: 14 × 5
##    first_name birth_date birth_month  year births
##    <chr>      <date>           <dbl> <dbl>  <dbl>
##  1 Marc       1985-06-21           6  1998  29399
##  2 Marc       1985-06-21           6  2013  31735
##  3 Igor       1995-12-30          12  2021  28630
##  4 Timo       1989-07-07           7  1997  31114
##  5 Gregory    1992-11-13          11    NA     NA
##  6 Greg       1962-01-22           1    NA     NA
##  7 Mike       1956-06-10           6  1998  29399
##  8 Mike       1956-06-10           6  2013  31735
##  9 Bob        1961-04-22           4  2006  28917
## 10 Bob        1961-04-22           4  1991  35398
## 11 Bob        1961-04-22           4  1994  33228
## 12 Emory      1898-09-15           9    NA     NA
## 13 Egor       1991-09-17           9    NA     NA
## 14 Jamie      1975-03-16           3    NA     NA

5. right_join

Describe the resulting data:

Columns:player_name, birth_month, team, year, births)
Rows:

How is it different from the original two datasets? Keeps all rows from the Canadian births dataset, even if no match in the NHL dataset.

right_join_result <- nhl_player_births_small %>%
  right_join(canada_births_1991_2022_small, 
             by = c("birth_month" = "month"),
             relationship = "many-to-many")

right_join_result

## # A tibble: 12 × 5
##    first_name birth_date birth_month  year births
##    <chr>      <date>           <dbl> <dbl>  <dbl>
##  1 Marc       1985-06-21           6  1998  29399
##  2 Marc       1985-06-21           6  2013  31735
##  3 Igor       1995-12-30          12  2021  28630
##  4 Timo       1989-07-07           7  1997  31114
##  5 Mike       1956-06-10           6  1998  29399
##  6 Mike       1956-06-10           6  2013  31735
##  7 Bob        1961-04-22           4  2006  28917
##  8 Bob        1961-04-22           4  1991  35398
##  9 Bob        1961-04-22           4  1994  33228
## 10 <NA>       NA                   2  2018  27677
## 11 <NA>       NA                   2  1996  28843
## 12 <NA>       NA                   5  2003  29505

6. full_join

Describe the resulting data:

Columns: first_name, birth_date, birth_month, year, births
Rows: 17

How is it different from the original two datasets? Keeps all rows from both datasets; unmatched rows get NA values.

full_join_result <- nhl_player_births_small %>%
  full_join(canada_births_1991_2022_small, 
            by = c("birth_month" = "month"),
            relationship = "many-to-many")

full_join_result

## # A tibble: 17 × 5
##    first_name birth_date birth_month  year births
##    <chr>      <date>           <dbl> <dbl>  <dbl>
##  1 Marc       1985-06-21           6  1998  29399
##  2 Marc       1985-06-21           6  2013  31735
##  3 Igor       1995-12-30          12  2021  28630
##  4 Timo       1989-07-07           7  1997  31114
##  5 Gregory    1992-11-13          11    NA     NA
##  6 Greg       1962-01-22           1    NA     NA
##  7 Mike       1956-06-10           6  1998  29399
##  8 Mike       1956-06-10           6  2013  31735
##  9 Bob        1961-04-22           4  2006  28917
## 10 Bob        1961-04-22           4  1991  35398
## 11 Bob        1961-04-22           4  1994  33228
## 12 Emory      1898-09-15           9    NA     NA
## 13 Egor       1991-09-17           9    NA     NA
## 14 Jamie      1975-03-16           3    NA     NA
## 15 <NA>       NA                   2  2018  27677
## 16 <NA>       NA                   2  1996  28843
## 17 <NA>       NA                   5  2003  29505

7. semi_join

Describe the resulting data:

Columns: first_name, birth_date, birth_month
Rows: 5

How is it different from the original two datasets? Filters rows from nhl_player_births_small without adding new columns.

semi_join_result <- nhl_player_births_small %>%
  semi_join(canada_births_1991_2022_small, 
            by = c("birth_month" = "month"))

semi_join_result

## # A tibble: 5 × 3
##   first_name birth_date birth_month
##   <chr>      <date>           <dbl>
## 1 Marc       1985-06-21           6
## 2 Igor       1995-12-30          12
## 3 Timo       1989-07-07           7
## 4 Mike       1956-06-10           6
## 5 Bob        1961-04-22           4

8. anti_join

Describe the resulting data:

Columns: first_name, birth_date, birth_month
Rows: 5

How is it different from the original two datasets? Shows unmatched rows from the left dataset (NHL).

anti_join_result <- nhl_player_births_small %>%
  anti_join(canada_births_1991_2022_small, 
            by = c("birth_month" = "month"))

anti_join_result

## # A tibble: 5 × 3
##   first_name birth_date birth_month
##   <chr>      <date>           <dbl>
## 1 Gregory    1992-11-13          11
## 2 Greg       1962-01-22           1
## 3 Emory      1898-09-15           9
## 4 Egor       1991-09-17           9
## 5 Jamie      1975-03-16           3

Week 9: Apply it to your data 8

Thomas Kaufild

2022-10-22

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join