Import two related datasets from TidyTuesday Project.
nyt_full <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-05-10/nyt_full.tsv')
## Rows: 60386 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): title, author
## dbl (3): year, rank, title_id
## date (1): week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nyt_titles <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-05-10/nyt_titles.tsv')
## Rows: 7431 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): title, author
## dbl (5): id, year, total_weeks, debut_rank, best_rank
## date (1): first_week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1: nyt_full
Data 2: nyt_titles
set.seed(1234)
nytfull_small <- nyt_full %>% select(author, year, rank) %>% sample_n(10)
nyttitles_small <- nyt_titles %>% select(author, year, best_rank) %>% sample_n(10)
nytfull_small
## # A tibble: 10 × 3
## author year rank
## <chr> <dbl> <dbl>
## 1 James Redfield 1996 4
## 2 John Darnton 1996 13
## 3 Caleb Carr 1997 4
## 4 Graham Greene 1958 11
## 5 John Gardner 1987 11
## 6 Jimmy Stewart 1989 11
## 7 Fannie Flagg 2013 14
## 8 Irving Stone 1961 1
## 9 Rona Jaffe 1958 6
## 10 Len Deighton 1965 6
nyttitles_small
## # A tibble: 10 × 3
## author year best_rank
## <chr> <dbl> <dbl>
## 1 Maeve Binchy 2011 4
## 2 Jack Higgins 1994 7
## 3 James Lee Burke 2007 4
## 4 Hilary Mantel 2014 15
## 5 Jacqueline Winspear 2016 4
## 6 John Steinbeck 1937 4
## 7 Brad Meltzer 2002 3
## 8 Richard Castle 2014 6
## 9 Clive Cussler and Thomas Perry 2013 2
## 10 Brad Thor 2010 6
Describe the resulting data:
How is it different from the original two datasets? * There are 0 rows compared to the 10 in the original data sets. This is because of the 10 rows randomly selected. * All of the columns from the original data sets are present.
nytfull_small %>% inner_join(nyttitles_small, by = c("author", "year"))
## # A tibble: 0 × 4
## # ℹ 4 variables: author <chr>, year <dbl>, rank <dbl>, best_rank <dbl>
Describe the resulting data:
How is it different from the original two datasets? * All of the columns from the original data sets are present. * There are no matches in “best_rank”, so they show NA
nytfull_small %>% left_join(nyttitles_small, by = c("author", "year"))
## # A tibble: 10 × 4
## author year rank best_rank
## <chr> <dbl> <dbl> <dbl>
## 1 James Redfield 1996 4 NA
## 2 John Darnton 1996 13 NA
## 3 Caleb Carr 1997 4 NA
## 4 Graham Greene 1958 11 NA
## 5 John Gardner 1987 11 NA
## 6 Jimmy Stewart 1989 11 NA
## 7 Fannie Flagg 2013 14 NA
## 8 Irving Stone 1961 1 NA
## 9 Rona Jaffe 1958 6 NA
## 10 Len Deighton 1965 6 NA
Describe the resulting data:
How is it different from the original two datasets? * All of the columns from the original data sets are present. * There are no matches in “rank”, so they show NA
nytfull_small %>% right_join(nyttitles_small, by = c("author", "year"))
## # A tibble: 10 × 4
## author year rank best_rank
## <chr> <dbl> <dbl> <dbl>
## 1 Maeve Binchy 2011 NA 4
## 2 Jack Higgins 1994 NA 7
## 3 James Lee Burke 2007 NA 4
## 4 Hilary Mantel 2014 NA 15
## 5 Jacqueline Winspear 2016 NA 4
## 6 John Steinbeck 1937 NA 4
## 7 Brad Meltzer 2002 NA 3
## 8 Richard Castle 2014 NA 6
## 9 Clive Cussler and Thomas Perry 2013 NA 2
## 10 Brad Thor 2010 NA 6
Describe the resulting data:
How is it different from the original two datasets? * All observations from both data sets are present, even if they don’t have a match. * There are no matching author and year combinations, so either “rank” or “best_rank” show NA
nytfull_small %>% full_join(nyttitles_small, by = c("author", "year"))
## # A tibble: 20 × 4
## author year rank best_rank
## <chr> <dbl> <dbl> <dbl>
## 1 James Redfield 1996 4 NA
## 2 John Darnton 1996 13 NA
## 3 Caleb Carr 1997 4 NA
## 4 Graham Greene 1958 11 NA
## 5 John Gardner 1987 11 NA
## 6 Jimmy Stewart 1989 11 NA
## 7 Fannie Flagg 2013 14 NA
## 8 Irving Stone 1961 1 NA
## 9 Rona Jaffe 1958 6 NA
## 10 Len Deighton 1965 6 NA
## 11 Maeve Binchy 2011 NA 4
## 12 Jack Higgins 1994 NA 7
## 13 James Lee Burke 2007 NA 4
## 14 Hilary Mantel 2014 NA 15
## 15 Jacqueline Winspear 2016 NA 4
## 16 John Steinbeck 1937 NA 4
## 17 Brad Meltzer 2002 NA 3
## 18 Richard Castle 2014 NA 6
## 19 Clive Cussler and Thomas Perry 2013 NA 2
## 20 Brad Thor 2010 NA 6
Describe the resulting data:
How is it different from the original two datasets? * Only the columns from nytfull_small are present. * Only rows that match between the two data sets are shown, and there are none, resulting in 0 rows.
nytfull_small %>% semi_join(nyttitles_small, by = c("author", "year"))
## # A tibble: 0 × 3
## # ℹ 3 variables: author <chr>, year <dbl>, rank <dbl>
Describe the resulting data:
How is it different from the original two datasets? * Only columns from nytfull_small are present. * Rows are returned that do not have a matching value in the second data set, all rows are returned.
nytfull_small %>% anti_join(nyttitles_small, by = c("author", "year"))
## # A tibble: 10 × 3
## author year rank
## <chr> <dbl> <dbl>
## 1 James Redfield 1996 4
## 2 John Darnton 1996 13
## 3 Caleb Carr 1997 4
## 4 Graham Greene 1958 11
## 5 John Gardner 1987 11
## 6 Jimmy Stewart 1989 11
## 7 Fannie Flagg 2013 14
## 8 Irving Stone 1961 1
## 9 Rona Jaffe 1958 6
## 10 Len Deighton 1965 6