Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

nyt_full <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-05-10/nyt_full.tsv')

## Rows: 60386 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (2): title, author
## dbl  (3): year, rank, title_id
## date (1): week
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nyt_titles <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-05-10/nyt_titles.tsv')

## Rows: 7431 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (2): title, author
## dbl  (5): id, year, total_weeks, debut_rank, best_rank
## date (1): first_week
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: nyt_full

Columns: author, year, rank
Rows: 10 rows

Data 2: nyt_titles

Columns: author, year, best_rank
Rows: 10 rows

set.seed(1234)
nytfull_small <- nyt_full %>% select(author, year, rank) %>% sample_n(10)
nyttitles_small <- nyt_titles %>% select(author, year, best_rank) %>% sample_n(10)

nytfull_small

## # A tibble: 10 × 3
##    author          year  rank
##    <chr>          <dbl> <dbl>
##  1 James Redfield  1996     4
##  2 John Darnton    1996    13
##  3 Caleb Carr      1997     4
##  4 Graham Greene   1958    11
##  5 John Gardner    1987    11
##  6 Jimmy Stewart   1989    11
##  7 Fannie Flagg    2013    14
##  8 Irving Stone    1961     1
##  9 Rona Jaffe      1958     6
## 10 Len Deighton    1965     6

nyttitles_small

## # A tibble: 10 × 3
##    author                          year best_rank
##    <chr>                          <dbl>     <dbl>
##  1 Maeve Binchy                    2011         4
##  2 Jack Higgins                    1994         7
##  3 James Lee Burke                 2007         4
##  4 Hilary Mantel                   2014        15
##  5 Jacqueline Winspear             2016         4
##  6 John Steinbeck                  1937         4
##  7 Brad Meltzer                    2002         3
##  8 Richard Castle                  2014         6
##  9 Clive Cussler and Thomas Perry  2013         2
## 10 Brad Thor                       2010         6

3. inner_join

Describe the resulting data:

Columns: author, year, rank, best_rank
Rows: 0

How is it different from the original two datasets? * There are 0 rows compared to the 10 in the original data sets. This is because of the 10 rows randomly selected. * All of the columns from the original data sets are present.

nytfull_small %>% inner_join(nyttitles_small, by = c("author", "year"))

## # A tibble: 0 × 4
## # ℹ 4 variables: author <chr>, year <dbl>, rank <dbl>, best_rank <dbl>

4. left_join

Describe the resulting data:

Columns: author, year, rank, best_rank
Rows: 10

How is it different from the original two datasets? * All of the columns from the original data sets are present. * There are no matches in “best_rank”, so they show NA

nytfull_small %>% left_join(nyttitles_small, by = c("author", "year"))

## # A tibble: 10 × 4
##    author          year  rank best_rank
##    <chr>          <dbl> <dbl>     <dbl>
##  1 James Redfield  1996     4        NA
##  2 John Darnton    1996    13        NA
##  3 Caleb Carr      1997     4        NA
##  4 Graham Greene   1958    11        NA
##  5 John Gardner    1987    11        NA
##  6 Jimmy Stewart   1989    11        NA
##  7 Fannie Flagg    2013    14        NA
##  8 Irving Stone    1961     1        NA
##  9 Rona Jaffe      1958     6        NA
## 10 Len Deighton    1965     6        NA

5. right_join

Describe the resulting data:

Columns: author, year, rank, best_rank
Rows: 10

How is it different from the original two datasets? * All of the columns from the original data sets are present. * There are no matches in “rank”, so they show NA

nytfull_small %>% right_join(nyttitles_small, by = c("author", "year"))

## # A tibble: 10 × 4
##    author                          year  rank best_rank
##    <chr>                          <dbl> <dbl>     <dbl>
##  1 Maeve Binchy                    2011    NA         4
##  2 Jack Higgins                    1994    NA         7
##  3 James Lee Burke                 2007    NA         4
##  4 Hilary Mantel                   2014    NA        15
##  5 Jacqueline Winspear             2016    NA         4
##  6 John Steinbeck                  1937    NA         4
##  7 Brad Meltzer                    2002    NA         3
##  8 Richard Castle                  2014    NA         6
##  9 Clive Cussler and Thomas Perry  2013    NA         2
## 10 Brad Thor                       2010    NA         6

6. full_join

Describe the resulting data:

Columns: author, year, rank, best_rank
Rows: 20

How is it different from the original two datasets? * All observations from both data sets are present, even if they don’t have a match. * There are no matching author and year combinations, so either “rank” or “best_rank” show NA

nytfull_small %>% full_join(nyttitles_small, by = c("author", "year"))

## # A tibble: 20 × 4
##    author                          year  rank best_rank
##    <chr>                          <dbl> <dbl>     <dbl>
##  1 James Redfield                  1996     4        NA
##  2 John Darnton                    1996    13        NA
##  3 Caleb Carr                      1997     4        NA
##  4 Graham Greene                   1958    11        NA
##  5 John Gardner                    1987    11        NA
##  6 Jimmy Stewart                   1989    11        NA
##  7 Fannie Flagg                    2013    14        NA
##  8 Irving Stone                    1961     1        NA
##  9 Rona Jaffe                      1958     6        NA
## 10 Len Deighton                    1965     6        NA
## 11 Maeve Binchy                    2011    NA         4
## 12 Jack Higgins                    1994    NA         7
## 13 James Lee Burke                 2007    NA         4
## 14 Hilary Mantel                   2014    NA        15
## 15 Jacqueline Winspear             2016    NA         4
## 16 John Steinbeck                  1937    NA         4
## 17 Brad Meltzer                    2002    NA         3
## 18 Richard Castle                  2014    NA         6
## 19 Clive Cussler and Thomas Perry  2013    NA         2
## 20 Brad Thor                       2010    NA         6

7. semi_join

Describe the resulting data:

Columns: author, year, rank
Rows: 0

How is it different from the original two datasets? * Only the columns from nytfull_small are present. * Only rows that match between the two data sets are shown, and there are none, resulting in 0 rows.

nytfull_small %>% semi_join(nyttitles_small, by = c("author", "year"))

## # A tibble: 0 × 3
## # ℹ 3 variables: author <chr>, year <dbl>, rank <dbl>

8. anti_join

Describe the resulting data:

Columns: author, year, rank
Rows: 10

How is it different from the original two datasets? * Only columns from nytfull_small are present. * Rows are returned that do not have a matching value in the second data set, all rows are returned.

nytfull_small %>% anti_join(nyttitles_small, by = c("author", "year"))

## # A tibble: 10 × 3
##    author          year  rank
##    <chr>          <dbl> <dbl>
##  1 James Redfield  1996     4
##  2 John Darnton    1996    13
##  3 Caleb Carr      1997     4
##  4 Graham Greene   1958    11
##  5 John Gardner    1987    11
##  6 Jimmy Stewart   1989    11
##  7 Fannie Flagg    2013    14
##  8 Irving Stone    1961     1
##  9 Rona Jaffe      1958     6
## 10 Len Deighton    1965     6

Week 9: Apply it to your data 8

Taylor Nelson

2026-11-06

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join