Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

gutenberg_languages = read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-06-03/gutenberg_languages.csv")

## Rows: 76205 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): language
## dbl (2): gutenberg_id, total_languages
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

gutenberg_metadata = read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-06-03/gutenberg_metadata.csv")

## Rows: 79491 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): title, author, language, gutenberg_bookshelf, rights
## dbl (2): gutenberg_id, gutenberg_author_id
## lgl (1): has_text
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: languages

Columns: gutenberg_id, language, total_languages
Rows:10

Data 2: metadata

Columns:gutenberg_id, language, title
Rows:10

set.seed(1234)
languages_small <- gutenberg_languages %>% select(gutenberg_id, language, total_languages) %>% sample_n(10)
metadata_small <- gutenberg_metadata %>% select(gutenberg_id, language, title) %>% sample_n(10)

languages_small

## # A tibble: 10 × 3
##    gutenberg_id language total_languages
##           <dbl> <chr>              <dbl>
##  1        41876 en                     1
##  2        15239 fr                     1
##  3        33643 en                     1
##  4        68029 en                     1
##  5        59821 en                     1
##  6        68415 fi                     1
##  7        17360 fr                     1
##  8        33190 en                     1
##  9        49686 en                     1
## 10        16943 en                     1

metadata_small

## # A tibble: 10 × 3
##    gutenberg_id language title                                                  
##           <dbl> <chr>    <chr>                                                  
##  1        30757 fi       "Perhe: Kuvauksia jokapäiväisestä elämästä"            
##  2        48212 fr       "Le feu (Journal d'une Escouade)"                      
##  3        31950 en       "Encyclopaedia Britannica, 11th Edition, \"Columbus\" …
##  4        47318 de       "Deutsche Humoristen, 2. Band (von 8)"                 
##  5        65485 da       "Eneboerne"                                            
##  6          874 en       "A History of Aeronautics"                             
##  7        44519 fr       "L'Illustration, No. 2501, 31 Janvier 1891"            
##  8        35731 en       "Blackwood's Edinburgh Magazine, Volume 60, No. 370, A…
##  9        49300 en       "Biscayne Bay, Dade Co., Florida, Between the 25th and…
## 10        67214 en       "The Book of History (Vol. 1 of 18)\r\nA History of Al…

3. inner_join

Describe the resulting data:

Columns: gutenberg_id, language, total_languages, title
Rows:0

How is it different from the original two datasets?

there are 0 rows instead of 10 rows

only keeps rows where there is a match in both datasets for both gutenberg_id and language which in this case there is none

languages_small %>% inner_join(metadata_small, by = c("gutenberg_id", "language"))

## # A tibble: 0 × 4
## # ℹ 4 variables: gutenberg_id <dbl>, language <chr>, total_languages <dbl>,
## #   title <chr>

4. left_join

Describe the resulting data:

Columns: gutenberg_id, language, total_languages, title
Rows: 10

How is it different from the original two datasets? it is similar to the langauges data except there is a title row added

languages_small %>% left_join(metadata_small, by = c("gutenberg_id", "language"))

## # A tibble: 10 × 4
##    gutenberg_id language total_languages title
##           <dbl> <chr>              <dbl> <chr>
##  1        41876 en                     1 <NA> 
##  2        15239 fr                     1 <NA> 
##  3        33643 en                     1 <NA> 
##  4        68029 en                     1 <NA> 
##  5        59821 en                     1 <NA> 
##  6        68415 fi                     1 <NA> 
##  7        17360 fr                     1 <NA> 
##  8        33190 en                     1 <NA> 
##  9        49686 en                     1 <NA> 
## 10        16943 en                     1 <NA>

5. right_join

Describe the resulting data:

Columns: gutenberg_id, language, total_languages, title
Rows:10

How is it different from the original two datasets?

it is similar to the metadata dataset except there is now a total_languages column

languages_small %>% right_join(metadata_small, by = c("gutenberg_id", "language"))

## # A tibble: 10 × 4
##    gutenberg_id language total_languages title                                  
##           <dbl> <chr>              <dbl> <chr>                                  
##  1        30757 fi                    NA "Perhe: Kuvauksia jokapäiväisestä eläm…
##  2        48212 fr                    NA "Le feu (Journal d'une Escouade)"      
##  3        31950 en                    NA "Encyclopaedia Britannica, 11th Editio…
##  4        47318 de                    NA "Deutsche Humoristen, 2. Band (von 8)" 
##  5        65485 da                    NA "Eneboerne"                            
##  6          874 en                    NA "A History of Aeronautics"             
##  7        44519 fr                    NA "L'Illustration, No. 2501, 31 Janvier …
##  8        35731 en                    NA "Blackwood's Edinburgh Magazine, Volum…
##  9        49300 en                    NA "Biscayne Bay, Dade Co., Florida, Betw…
## 10        67214 en                    NA "The Book of History (Vol. 1 of 18)\r\…

6. full_join

Describe the resulting data:

Columns: gutenberg_id, language, total_languages, title
Rows:

How is it different from the original two datasets?

it is a combination of all the data from the original including the data from each into one big one

languages_small %>% full_join(metadata_small, by = c("gutenberg_id", "language"))

## # A tibble: 20 × 4
##    gutenberg_id language total_languages title                                  
##           <dbl> <chr>              <dbl> <chr>                                  
##  1        41876 en                     1  <NA>                                  
##  2        15239 fr                     1  <NA>                                  
##  3        33643 en                     1  <NA>                                  
##  4        68029 en                     1  <NA>                                  
##  5        59821 en                     1  <NA>                                  
##  6        68415 fi                     1  <NA>                                  
##  7        17360 fr                     1  <NA>                                  
##  8        33190 en                     1  <NA>                                  
##  9        49686 en                     1  <NA>                                  
## 10        16943 en                     1  <NA>                                  
## 11        30757 fi                    NA "Perhe: Kuvauksia jokapäiväisestä eläm…
## 12        48212 fr                    NA "Le feu (Journal d'une Escouade)"      
## 13        31950 en                    NA "Encyclopaedia Britannica, 11th Editio…
## 14        47318 de                    NA "Deutsche Humoristen, 2. Band (von 8)" 
## 15        65485 da                    NA "Eneboerne"                            
## 16          874 en                    NA "A History of Aeronautics"             
## 17        44519 fr                    NA "L'Illustration, No. 2501, 31 Janvier …
## 18        35731 en                    NA "Blackwood's Edinburgh Magazine, Volum…
## 19        49300 en                    NA "Biscayne Bay, Dade Co., Florida, Betw…
## 20        67214 en                    NA "The Book of History (Vol. 1 of 18)\r\…

7. semi_join

Describe the resulting data:

Columns: gutenberg_id, language, total_languages
Rows:0

How is it different from the original two datasets?

it keeps the same columns from languages data and filters rows to only those that have matches in the metadata dataset, therefore metadata is only used to check for matches which there is none of

languages_small %>% semi_join(metadata_small, by = c("gutenberg_id", "language"))

## # A tibble: 0 × 3
## # ℹ 3 variables: gutenberg_id <dbl>, language <chr>, total_languages <dbl>

8. anti_join

Describe the resulting data:

Columns: futenberg_id, language, total_languages
Rows:10

How is it different from the original two datasets?

it is the same as the languages dataset as this function filters rows that do not have matches which in this case there is no matches

languages_small %>% anti_join(metadata_small, by = c("gutenberg_id", "language"))

## # A tibble: 10 × 3
##    gutenberg_id language total_languages
##           <dbl> <chr>              <dbl>
##  1        41876 en                     1
##  2        15239 fr                     1
##  3        33643 en                     1
##  4        68029 en                     1
##  5        59821 en                     1
##  6        68415 fi                     1
##  7        17360 fr                     1
##  8        33190 en                     1
##  9        49686 en                     1
## 10        16943 en                     1

Week 9: Apply it to your data 8

Connor Strobel

2022-03-19

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join