Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

christmas_novel_authors <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-12-30/christmas_novel_authors.csv')

## Rows: 35 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): author, wikipedia, aliases
## dbl (3): gutenberg_author_id, birthdate, deathdate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

christmas_novels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-12-30/christmas_novels.csv')

## Rows: 42 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): title
## dbl (2): gutenberg_id, gutenberg_author_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: christmas_novel_authors

Columns: gutenberg_author_id, author, birthdate
Rows: 10 rows

Data 2: christmas_novels

Columns: gutenberg_author_id, gutenberg_id, title
Rows:10 rows

set.seed(1234 )
novel_authors_small <- christmas_novel_authors %>% select(gutenberg_author_id, author, birthdate) %>% sample_n(10)


novels_small <- christmas_novels %>% select(gutenberg_author_id, title, gutenberg_id) %>% sample_n(10)

novel_authors_small

## # A tibble: 10 × 3
##    gutenberg_author_id author                                 birthdate
##                  <dbl> <chr>                                      <dbl>
##  1                8194 Auerbach, Berthold                          1812
##  2                7088 Parker, Theodore                            1810
##  3                9034 McIntosh, Maria J. (Maria Jane)             1803
##  4                 102 Alcott, Louisa May                          1832
##  5                2497 Dawson, Coningsby                           1883
##  6                1253 Barclay, Florence L. (Florence Louisa)      1862
##  7                2044 Finley, Martha                              1828
##  8                2241 Frey, Hildegard G.                          1891
##  9                1324 Locke, William John                         1863
## 10                4001 Bacheller, Irving                           1859

novels_small

## # A tibble: 10 × 3
##    gutenberg_author_id title                                        gutenberg_id
##                  <dbl> <chr>                                               <dbl>
##  1                 362 "The Lost Word: A Christmas Legend of Long …         4384
##  2                7166 "Christmas Holidays at Merryvale\nThe Merry…        23569
##  3                3796 "Angel Unawares: A Story of Christmas Eve"          42919
##  4                 585 "The Thin Santa Claus: The Chicken Yard Tha…        17937
##  5                9034 "Evenings at Donaldson Manor; Or, The Chris…        20018
##  6                1324 "A Christmas Mystery: The Story of Three Wi…        10707
##  7                 362 "The First Christmas Tree: A Story of the F…        16134
##  8                6328 "Uncle Noah's Christmas Inspiration"                15826
##  9                 650 "Christmas Eve and Christmas Day: Ten Chris…        32455
## 10                3293 "Mr. Blake's Walking-Stick: A Christmas Sto…        52935

3. inner_join

Describe the resulting data:

Columns: gutenberg_author_id, author, birthdate, title, gutenberg_id
Rows: 2

How is it different from the original two datasets?

2 row compared to 10 rows in the original dataset.
All columns from the two dataset.

novel_authors_small %>% inner_join(novels_small)

## Joining with `by = join_by(gutenberg_author_id)`

## # A tibble: 2 × 5
##   gutenberg_author_id author                        birthdate title gutenberg_id
##                 <dbl> <chr>                             <dbl> <chr>        <dbl>
## 1                9034 McIntosh, Maria J. (Maria Ja…      1803 Even…        20018
## 2                1324 Locke, William John                1863 A Ch…        10707

4. left_join

Describe the resulting data:

Columns: gutenberg_author_id, author, birthdate, title, gutenberg_id
Rows:10

How is it different from the original two datasets?

All 10 rows remained, but there is only two of them that have information for all the 5 columns. The remaining 8 rows, have NA under the columns title, and gutenberg_id.

novel_authors_small %>% left_join(novels_small)

## Joining with `by = join_by(gutenberg_author_id)`

## # A tibble: 10 × 5
##    gutenberg_author_id author                       birthdate title gutenberg_id
##                  <dbl> <chr>                            <dbl> <chr>        <dbl>
##  1                8194 Auerbach, Berthold                1812 <NA>            NA
##  2                7088 Parker, Theodore                  1810 <NA>            NA
##  3                9034 McIntosh, Maria J. (Maria J…      1803 Even…        20018
##  4                 102 Alcott, Louisa May                1832 <NA>            NA
##  5                2497 Dawson, Coningsby                 1883 <NA>            NA
##  6                1253 Barclay, Florence L. (Flore…      1862 <NA>            NA
##  7                2044 Finley, Martha                    1828 <NA>            NA
##  8                2241 Frey, Hildegard G.                1891 <NA>            NA
##  9                1324 Locke, William John               1863 A Ch…        10707
## 10                4001 Bacheller, Irving                 1859 <NA>            NA

5. right_join

Describe the resulting data:

Columns: gutenberg_author_id, author, birthdate, title, gutenberg_id
Rows: 10

How is it different from the original two datasets?

There is still 10 rows, but there is a different set of gutenberg_author_id than from the left join function.
8 out of the 10 rows are missing values (NA) for the columns: author and birthday.

novel_authors_small %>% right_join(novels_small)

## Joining with `by = join_by(gutenberg_author_id)`

## # A tibble: 10 × 5
##    gutenberg_author_id author                       birthdate title gutenberg_id
##                  <dbl> <chr>                            <dbl> <chr>        <dbl>
##  1                9034 McIntosh, Maria J. (Maria J…      1803 "Eve…        20018
##  2                1324 Locke, William John               1863 "A C…        10707
##  3                 362 <NA>                                NA "The…         4384
##  4                7166 <NA>                                NA "Chr…        23569
##  5                3796 <NA>                                NA "Ang…        42919
##  6                 585 <NA>                                NA "The…        17937
##  7                 362 <NA>                                NA "The…        16134
##  8                6328 <NA>                                NA "Unc…        15826
##  9                 650 <NA>                                NA "Chr…        32455
## 10                3293 <NA>                                NA "Mr.…        52935

6. full_join

Describe the resulting data:

Columns: gutenberg_author_id, author, birthdate, title, gutenberg_id
Rows: 18

How is it different from the original two datasets?

The two data sets are combined into one table.
Two rows have a full set of information, where as the remaining 16 are either missing information on title and gutenberg_id, or author and birthdate.

novel_authors_small %>% full_join(novels_small)

## Joining with `by = join_by(gutenberg_author_id)`

## # A tibble: 18 × 5
##    gutenberg_author_id author                       birthdate title gutenberg_id
##                  <dbl> <chr>                            <dbl> <chr>        <dbl>
##  1                8194 Auerbach, Berthold                1812  <NA>           NA
##  2                7088 Parker, Theodore                  1810  <NA>           NA
##  3                9034 McIntosh, Maria J. (Maria J…      1803 "Eve…        20018
##  4                 102 Alcott, Louisa May                1832  <NA>           NA
##  5                2497 Dawson, Coningsby                 1883  <NA>           NA
##  6                1253 Barclay, Florence L. (Flore…      1862  <NA>           NA
##  7                2044 Finley, Martha                    1828  <NA>           NA
##  8                2241 Frey, Hildegard G.                1891  <NA>           NA
##  9                1324 Locke, William John               1863 "A C…        10707
## 10                4001 Bacheller, Irving                 1859  <NA>           NA
## 11                 362 <NA>                                NA "The…         4384
## 12                7166 <NA>                                NA "Chr…        23569
## 13                3796 <NA>                                NA "Ang…        42919
## 14                 585 <NA>                                NA "The…        17937
## 15                 362 <NA>                                NA "The…        16134
## 16                6328 <NA>                                NA "Unc…        15826
## 17                 650 <NA>                                NA "Chr…        32455
## 18                3293 <NA>                                NA "Mr.…        52935

7. semi_join

Describe the resulting data:

novel_author_small:

Columns: gutenberg_author_id, author, birthdate
Rows: 2

novels_small:

Columns: gutenberg_author_id, title, gutenberg_id
Rows: 2

How is it different from the original two datasets?

Only shows 2 rows, which is the ones intersecting in the two datasets.
When using novel_author_small as the dominator, only that datasets columns is shown.
When using novels_small as the dominator, that satasets columns is the only ones applying.

novel_authors_small %>% semi_join(novels_small)

## Joining with `by = join_by(gutenberg_author_id)`

## # A tibble: 2 × 3
##   gutenberg_author_id author                          birthdate
##                 <dbl> <chr>                               <dbl>
## 1                9034 McIntosh, Maria J. (Maria Jane)      1803
## 2                1324 Locke, William John                  1863

novels_small %>% semi_join(novel_authors_small)

## Joining with `by = join_by(gutenberg_author_id)`

## # A tibble: 2 × 3
##   gutenberg_author_id title                                         gutenberg_id
##                 <dbl> <chr>                                                <dbl>
## 1                9034 Evenings at Donaldson Manor; Or, The Christm…        20018
## 2                1324 A Christmas Mystery: The Story of Three Wise…        10707

8. anti_join

Describe the resulting data:

novel_author_small:

Columns: gutenberg_author_id, author, birthdate
Rows: 8

novels_small:

Columns: gutenberg_author_id, title, gutenberg_id
Rows: 8

How is it different from the original two datasets?

8 rows compared to 10 rows.
The 2 rows that intersect in the two datasets are removed

novel_authors_small %>% anti_join(novels_small)

## Joining with `by = join_by(gutenberg_author_id)`

## # A tibble: 8 × 3
##   gutenberg_author_id author                                 birthdate
##                 <dbl> <chr>                                      <dbl>
## 1                8194 Auerbach, Berthold                          1812
## 2                7088 Parker, Theodore                            1810
## 3                 102 Alcott, Louisa May                          1832
## 4                2497 Dawson, Coningsby                           1883
## 5                1253 Barclay, Florence L. (Florence Louisa)      1862
## 6                2044 Finley, Martha                              1828
## 7                2241 Frey, Hildegard G.                          1891
## 8                4001 Bacheller, Irving                           1859

novels_small %>% anti_join(novel_authors_small)

## Joining with `by = join_by(gutenberg_author_id)`

## # A tibble: 8 × 3
##   gutenberg_author_id title                                         gutenberg_id
##                 <dbl> <chr>                                                <dbl>
## 1                 362 "The Lost Word: A Christmas Legend of Long A…         4384
## 2                7166 "Christmas Holidays at Merryvale\nThe Merryv…        23569
## 3                3796 "Angel Unawares: A Story of Christmas Eve"           42919
## 4                 585 "The Thin Santa Claus: The Chicken Yard That…        17937
## 5                 362 "The First Christmas Tree: A Story of the Fo…        16134
## 6                6328 "Uncle Noah's Christmas Inspiration"                 15826
## 7                 650 "Christmas Eve and Christmas Day: Ten Christ…        32455
## 8                3293 "Mr. Blake's Walking-Stick: A Christmas Stor…        52935

Week 9: Apply it to your data 8

Marlene Sophie Krohn

2026-March-19

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join