Import two related datasets from TidyTuesday Project.
gutenberg_languages = read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-06-03/gutenberg_languages.csv")
## Rows: 76205 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): language
## dbl (2): gutenberg_id, total_languages
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gutenberg_metadata = read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-06-03/gutenberg_metadata.csv")
## Rows: 79491 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): title, author, language, gutenberg_bookshelf, rights
## dbl (2): gutenberg_id, gutenberg_author_id
## lgl (1): has_text
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1: languages
Data 2: metadata
set.seed(1234)
languages_small <- gutenberg_languages %>% select(gutenberg_id, language, total_languages) %>% sample_n(10)
metadata_small <- gutenberg_metadata %>% select(gutenberg_id, language, title) %>% sample_n(10)
languages_small
## # A tibble: 10 × 3
## gutenberg_id language total_languages
## <dbl> <chr> <dbl>
## 1 41876 en 1
## 2 15239 fr 1
## 3 33643 en 1
## 4 68029 en 1
## 5 59821 en 1
## 6 68415 fi 1
## 7 17360 fr 1
## 8 33190 en 1
## 9 49686 en 1
## 10 16943 en 1
metadata_small
## # A tibble: 10 × 3
## gutenberg_id language title
## <dbl> <chr> <chr>
## 1 30757 fi "Perhe: Kuvauksia jokapäiväisestä elämästä"
## 2 48212 fr "Le feu (Journal d'une Escouade)"
## 3 31950 en "Encyclopaedia Britannica, 11th Edition, \"Columbus\" …
## 4 47318 de "Deutsche Humoristen, 2. Band (von 8)"
## 5 65485 da "Eneboerne"
## 6 874 en "A History of Aeronautics"
## 7 44519 fr "L'Illustration, No. 2501, 31 Janvier 1891"
## 8 35731 en "Blackwood's Edinburgh Magazine, Volume 60, No. 370, A…
## 9 49300 en "Biscayne Bay, Dade Co., Florida, Between the 25th and…
## 10 67214 en "The Book of History (Vol. 1 of 18)\r\nA History of Al…
Describe the resulting data:
How is it different from the original two datasets?
there are 0 rows instead of 10 rows
only keeps rows where there is a match in both datasets for both gutenberg_id and language which in this case there is none
languages_small %>% inner_join(metadata_small, by = c("gutenberg_id", "language"))
## # A tibble: 0 × 4
## # ℹ 4 variables: gutenberg_id <dbl>, language <chr>, total_languages <dbl>,
## # title <chr>
Describe the resulting data:
How is it different from the original two datasets? it is similar to the langauges data except there is a title row added
languages_small %>% left_join(metadata_small, by = c("gutenberg_id", "language"))
## # A tibble: 10 × 4
## gutenberg_id language total_languages title
## <dbl> <chr> <dbl> <chr>
## 1 41876 en 1 <NA>
## 2 15239 fr 1 <NA>
## 3 33643 en 1 <NA>
## 4 68029 en 1 <NA>
## 5 59821 en 1 <NA>
## 6 68415 fi 1 <NA>
## 7 17360 fr 1 <NA>
## 8 33190 en 1 <NA>
## 9 49686 en 1 <NA>
## 10 16943 en 1 <NA>
Describe the resulting data:
How is it different from the original two datasets?
it is similar to the metadata dataset except there is now a total_languages column
languages_small %>% right_join(metadata_small, by = c("gutenberg_id", "language"))
## # A tibble: 10 × 4
## gutenberg_id language total_languages title
## <dbl> <chr> <dbl> <chr>
## 1 30757 fi NA "Perhe: Kuvauksia jokapäiväisestä eläm…
## 2 48212 fr NA "Le feu (Journal d'une Escouade)"
## 3 31950 en NA "Encyclopaedia Britannica, 11th Editio…
## 4 47318 de NA "Deutsche Humoristen, 2. Band (von 8)"
## 5 65485 da NA "Eneboerne"
## 6 874 en NA "A History of Aeronautics"
## 7 44519 fr NA "L'Illustration, No. 2501, 31 Janvier …
## 8 35731 en NA "Blackwood's Edinburgh Magazine, Volum…
## 9 49300 en NA "Biscayne Bay, Dade Co., Florida, Betw…
## 10 67214 en NA "The Book of History (Vol. 1 of 18)\r\…
Describe the resulting data:
How is it different from the original two datasets?
it is a combination of all the data from the original including the data from each into one big one
languages_small %>% full_join(metadata_small, by = c("gutenberg_id", "language"))
## # A tibble: 20 × 4
## gutenberg_id language total_languages title
## <dbl> <chr> <dbl> <chr>
## 1 41876 en 1 <NA>
## 2 15239 fr 1 <NA>
## 3 33643 en 1 <NA>
## 4 68029 en 1 <NA>
## 5 59821 en 1 <NA>
## 6 68415 fi 1 <NA>
## 7 17360 fr 1 <NA>
## 8 33190 en 1 <NA>
## 9 49686 en 1 <NA>
## 10 16943 en 1 <NA>
## 11 30757 fi NA "Perhe: Kuvauksia jokapäiväisestä eläm…
## 12 48212 fr NA "Le feu (Journal d'une Escouade)"
## 13 31950 en NA "Encyclopaedia Britannica, 11th Editio…
## 14 47318 de NA "Deutsche Humoristen, 2. Band (von 8)"
## 15 65485 da NA "Eneboerne"
## 16 874 en NA "A History of Aeronautics"
## 17 44519 fr NA "L'Illustration, No. 2501, 31 Janvier …
## 18 35731 en NA "Blackwood's Edinburgh Magazine, Volum…
## 19 49300 en NA "Biscayne Bay, Dade Co., Florida, Betw…
## 20 67214 en NA "The Book of History (Vol. 1 of 18)\r\…
Describe the resulting data:
How is it different from the original two datasets?
it keeps the same columns from languages data and filters rows to only those that have matches in the metadata dataset, therefore metadata is only used to check for matches which there is none of
languages_small %>% semi_join(metadata_small, by = c("gutenberg_id", "language"))
## # A tibble: 0 × 3
## # ℹ 3 variables: gutenberg_id <dbl>, language <chr>, total_languages <dbl>
Describe the resulting data:
How is it different from the original two datasets?
it is the same as the languages dataset as this function filters rows that do not have matches which in this case there is no matches
languages_small %>% anti_join(metadata_small, by = c("gutenberg_id", "language"))
## # A tibble: 10 × 3
## gutenberg_id language total_languages
## <dbl> <chr> <dbl>
## 1 41876 en 1
## 2 15239 fr 1
## 3 33643 en 1
## 4 68029 en 1
## 5 59821 en 1
## 6 68415 fi 1
## 7 17360 fr 1
## 8 33190 en 1
## 9 49686 en 1
## 10 16943 en 1