Import two related datasets from TidyTuesday Project.
christmas_novel_authors <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-12-30/christmas_novel_authors.csv')
## Rows: 35 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): author, wikipedia, aliases
## dbl (3): gutenberg_author_id, birthdate, deathdate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
christmas_novels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-12-30/christmas_novels.csv')
## Rows: 42 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): title
## dbl (2): gutenberg_id, gutenberg_author_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1: christmas_novel_authors
Data 2: christmas_novels
set.seed(1234 )
novel_authors_small <- christmas_novel_authors %>% select(gutenberg_author_id, author, birthdate) %>% sample_n(10)
novels_small <- christmas_novels %>% select(gutenberg_author_id, title, gutenberg_id) %>% sample_n(10)
novel_authors_small
## # A tibble: 10 × 3
## gutenberg_author_id author birthdate
## <dbl> <chr> <dbl>
## 1 8194 Auerbach, Berthold 1812
## 2 7088 Parker, Theodore 1810
## 3 9034 McIntosh, Maria J. (Maria Jane) 1803
## 4 102 Alcott, Louisa May 1832
## 5 2497 Dawson, Coningsby 1883
## 6 1253 Barclay, Florence L. (Florence Louisa) 1862
## 7 2044 Finley, Martha 1828
## 8 2241 Frey, Hildegard G. 1891
## 9 1324 Locke, William John 1863
## 10 4001 Bacheller, Irving 1859
novels_small
## # A tibble: 10 × 3
## gutenberg_author_id title gutenberg_id
## <dbl> <chr> <dbl>
## 1 362 "The Lost Word: A Christmas Legend of Long … 4384
## 2 7166 "Christmas Holidays at Merryvale\nThe Merry… 23569
## 3 3796 "Angel Unawares: A Story of Christmas Eve" 42919
## 4 585 "The Thin Santa Claus: The Chicken Yard Tha… 17937
## 5 9034 "Evenings at Donaldson Manor; Or, The Chris… 20018
## 6 1324 "A Christmas Mystery: The Story of Three Wi… 10707
## 7 362 "The First Christmas Tree: A Story of the F… 16134
## 8 6328 "Uncle Noah's Christmas Inspiration" 15826
## 9 650 "Christmas Eve and Christmas Day: Ten Chris… 32455
## 10 3293 "Mr. Blake's Walking-Stick: A Christmas Sto… 52935
Describe the resulting data:
How is it different from the original two datasets?
2 row compared to 10 rows in the original dataset.
All columns from the two dataset.
novel_authors_small %>% inner_join(novels_small)
## Joining with `by = join_by(gutenberg_author_id)`
## # A tibble: 2 × 5
## gutenberg_author_id author birthdate title gutenberg_id
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 9034 McIntosh, Maria J. (Maria Ja… 1803 Even… 20018
## 2 1324 Locke, William John 1863 A Ch… 10707
Describe the resulting data:
How is it different from the original two datasets?
novel_authors_small %>% left_join(novels_small)
## Joining with `by = join_by(gutenberg_author_id)`
## # A tibble: 10 × 5
## gutenberg_author_id author birthdate title gutenberg_id
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 8194 Auerbach, Berthold 1812 <NA> NA
## 2 7088 Parker, Theodore 1810 <NA> NA
## 3 9034 McIntosh, Maria J. (Maria J… 1803 Even… 20018
## 4 102 Alcott, Louisa May 1832 <NA> NA
## 5 2497 Dawson, Coningsby 1883 <NA> NA
## 6 1253 Barclay, Florence L. (Flore… 1862 <NA> NA
## 7 2044 Finley, Martha 1828 <NA> NA
## 8 2241 Frey, Hildegard G. 1891 <NA> NA
## 9 1324 Locke, William John 1863 A Ch… 10707
## 10 4001 Bacheller, Irving 1859 <NA> NA
Describe the resulting data:
How is it different from the original two datasets?
There is still 10 rows, but there is a different set of gutenberg_author_id than from the left join function.
8 out of the 10 rows are missing values (NA) for the columns: author and birthday.
novel_authors_small %>% right_join(novels_small)
## Joining with `by = join_by(gutenberg_author_id)`
## # A tibble: 10 × 5
## gutenberg_author_id author birthdate title gutenberg_id
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 9034 McIntosh, Maria J. (Maria J… 1803 "Eve… 20018
## 2 1324 Locke, William John 1863 "A C… 10707
## 3 362 <NA> NA "The… 4384
## 4 7166 <NA> NA "Chr… 23569
## 5 3796 <NA> NA "Ang… 42919
## 6 585 <NA> NA "The… 17937
## 7 362 <NA> NA "The… 16134
## 8 6328 <NA> NA "Unc… 15826
## 9 650 <NA> NA "Chr… 32455
## 10 3293 <NA> NA "Mr.… 52935
Describe the resulting data:
How is it different from the original two datasets?
The two data sets are combined into one table.
Two rows have a full set of information, where as the remaining 16 are either missing information on title and gutenberg_id, or author and birthdate.
novel_authors_small %>% full_join(novels_small)
## Joining with `by = join_by(gutenberg_author_id)`
## # A tibble: 18 × 5
## gutenberg_author_id author birthdate title gutenberg_id
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 8194 Auerbach, Berthold 1812 <NA> NA
## 2 7088 Parker, Theodore 1810 <NA> NA
## 3 9034 McIntosh, Maria J. (Maria J… 1803 "Eve… 20018
## 4 102 Alcott, Louisa May 1832 <NA> NA
## 5 2497 Dawson, Coningsby 1883 <NA> NA
## 6 1253 Barclay, Florence L. (Flore… 1862 <NA> NA
## 7 2044 Finley, Martha 1828 <NA> NA
## 8 2241 Frey, Hildegard G. 1891 <NA> NA
## 9 1324 Locke, William John 1863 "A C… 10707
## 10 4001 Bacheller, Irving 1859 <NA> NA
## 11 362 <NA> NA "The… 4384
## 12 7166 <NA> NA "Chr… 23569
## 13 3796 <NA> NA "Ang… 42919
## 14 585 <NA> NA "The… 17937
## 15 362 <NA> NA "The… 16134
## 16 6328 <NA> NA "Unc… 15826
## 17 650 <NA> NA "Chr… 32455
## 18 3293 <NA> NA "Mr.… 52935
Describe the resulting data:
novel_author_small:
novels_small:
How is it different from the original two datasets?
Only shows 2 rows, which is the ones intersecting in the two datasets.
When using novel_author_small as the dominator, only that datasets columns is shown.
When using novels_small as the dominator, that satasets columns is the only ones applying.
novel_authors_small %>% semi_join(novels_small)
## Joining with `by = join_by(gutenberg_author_id)`
## # A tibble: 2 × 3
## gutenberg_author_id author birthdate
## <dbl> <chr> <dbl>
## 1 9034 McIntosh, Maria J. (Maria Jane) 1803
## 2 1324 Locke, William John 1863
novels_small %>% semi_join(novel_authors_small)
## Joining with `by = join_by(gutenberg_author_id)`
## # A tibble: 2 × 3
## gutenberg_author_id title gutenberg_id
## <dbl> <chr> <dbl>
## 1 9034 Evenings at Donaldson Manor; Or, The Christm… 20018
## 2 1324 A Christmas Mystery: The Story of Three Wise… 10707
Describe the resulting data:
novel_author_small:
novels_small:
How is it different from the original two datasets?
8 rows compared to 10 rows.
The 2 rows that intersect in the two datasets are removed
novel_authors_small %>% anti_join(novels_small)
## Joining with `by = join_by(gutenberg_author_id)`
## # A tibble: 8 × 3
## gutenberg_author_id author birthdate
## <dbl> <chr> <dbl>
## 1 8194 Auerbach, Berthold 1812
## 2 7088 Parker, Theodore 1810
## 3 102 Alcott, Louisa May 1832
## 4 2497 Dawson, Coningsby 1883
## 5 1253 Barclay, Florence L. (Florence Louisa) 1862
## 6 2044 Finley, Martha 1828
## 7 2241 Frey, Hildegard G. 1891
## 8 4001 Bacheller, Irving 1859
novels_small %>% anti_join(novel_authors_small)
## Joining with `by = join_by(gutenberg_author_id)`
## # A tibble: 8 × 3
## gutenberg_author_id title gutenberg_id
## <dbl> <chr> <dbl>
## 1 362 "The Lost Word: A Christmas Legend of Long A… 4384
## 2 7166 "Christmas Holidays at Merryvale\nThe Merryv… 23569
## 3 3796 "Angel Unawares: A Story of Christmas Eve" 42919
## 4 585 "The Thin Santa Claus: The Chicken Yard That… 17937
## 5 362 "The First Christmas Tree: A Story of the Fo… 16134
## 6 6328 "Uncle Noah's Christmas Inspiration" 15826
## 7 650 "Christmas Eve and Christmas Day: Ten Christ… 32455
## 8 3293 "Mr. Blake's Walking-Stick: A Christmas Stor… 52935