Import two related datasets from TidyTuesday Project.
ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-01-25/ratings.csv', show_col_types = FALSE)
details <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-01-25/details.csv', show_col_types = FALSE)
Describe the two datasets:
Data1: ratings
Data2: details
set.seed(1234)
ratings_small <- ratings %>% select(id, rank, name) %>% sample_n(5)
details_small <- details %>% select(id, minplaytime, yearpublished) %>% sample_n(5)
ratings_small
## # A tibble: 5 × 3
## id rank name
## <dbl> <dbl> <chr>
## 1 7483 6221 Schatz der Drachen
## 2 234292 8733 13 Ghosts
## 3 304985 3541 Dark Ages: Holy Roman Empire
## 4 89403 8189 Big Five
## 5 208663 10188 Godforsaken Scavengers
details_small
## # A tibble: 5 × 3
## id minplaytime yearpublished
## <dbl> <dbl> <dbl>
## 1 213893 40 2017
## 2 256151 20 2018
## 3 143416 25 2015
## 4 180852 30 2016
## 5 98975 180 2012
Describe the resulting data:
How is it different from the original two datasets? There are no rows anymore. This tibble has all the columns from both datasets. It is by random chance that there are no matching id numbers so there can be no inner join.
inner_join(ratings_small, details_small, by = "id")
## # A tibble: 0 × 5
## # ℹ 5 variables: id <dbl>, rank <dbl>, name <chr>, minplaytime <dbl>,
## # yearpublished <dbl>
Describe the resulting data:
How is it different from the original two datasets? It has columns from both tibble’s columns combined, but it only uses ids from the ratings’s ids. Rank and name are preserved from the ratings dataset, but the minplaytime and yearpublished are NA because there were no rows with the ratings_small ids in the details_small table.
left_join(ratings_small, details_small, by = "id")
## # A tibble: 5 × 5
## id rank name minplaytime yearpublished
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 7483 6221 Schatz der Drachen NA NA
## 2 234292 8733 13 Ghosts NA NA
## 3 304985 3541 Dark Ages: Holy Roman Empire NA NA
## 4 89403 8189 Big Five NA NA
## 5 208663 10188 Godforsaken Scavengers NA NA
Describe the resulting data:
How is it different from the original two datasets? It has columns from both tibble’s columns combined, but it only uses ids from the details’s ids. minplaytime and yearpublished are preserved from the details dataset, but the rank and name are NA because there were no rows with the details_small ids in the ratings_small table.
right_join(ratings_small, details_small, by = "id")
## # A tibble: 5 × 5
## id rank name minplaytime yearpublished
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 213893 NA <NA> 40 2017
## 2 256151 NA <NA> 20 2018
## 3 143416 NA <NA> 25 2015
## 4 180852 NA <NA> 30 2016
## 5 98975 NA <NA> 180 2012
Describe the resulting data:
How is it different from the original two datasets? It has columns from both tibble’s columns combined. It also has the entire tables stacked on top of each other with the items for the oppsing table’s items as NA. For example, rows in the details table appear with a rank as NA in this table.
full_join(ratings_small, details_small, by = "id")
## # A tibble: 10 × 5
## id rank name minplaytime yearpublished
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 7483 6221 Schatz der Drachen NA NA
## 2 234292 8733 13 Ghosts NA NA
## 3 304985 3541 Dark Ages: Holy Roman Empire NA NA
## 4 89403 8189 Big Five NA NA
## 5 208663 10188 Godforsaken Scavengers NA NA
## 6 213893 NA <NA> 40 2017
## 7 256151 NA <NA> 20 2018
## 8 143416 NA <NA> 25 2015
## 9 180852 NA <NA> 30 2016
## 10 98975 NA <NA> 180 2012
Describe the resulting data:
How is it different from the original two datasets? It only has columns from the ratings dataset but it still does not have any rows because none of the ids match up.
semi_join(ratings_small, details_small, by = "id")
## # A tibble: 0 × 3
## # ℹ 3 variables: id <dbl>, rank <dbl>, name <chr>
Describe the resulting data:
How is it different from the original two datasets? It has only the columns from the ratings dataset, but it has all rows that do not share an id with a row from the details table.
anti_join(ratings_small, details_small, by = "id")
## # A tibble: 5 × 3
## id rank name
## <dbl> <dbl> <chr>
## 1 7483 6221 Schatz der Drachen
## 2 234292 8733 13 Ghosts
## 3 304985 3541 Dark Ages: Holy Roman Empire
## 4 89403 8189 Big Five
## 5 208663 10188 Godforsaken Scavengers