Import two related datasets from TidyTuesday Project.
episodes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-01-24/episodes.csv')
## Rows: 98 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): version, title, quote, author
## dbl (6): season, episode_number_overall, episode, viewers, imdb_rating, n_r...
## date (1): air_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
seasons <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-01-24/seasons.csv')
## Rows: 9 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): version, location, country
## dbl (4): season, n_survivors, lat, lon
## date (1): date_drop_off
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1: Episodes
Data2: Seasons
set.seed(1234)
episodes_small <- episodes %>% select(title, air_date, episode, season) %>% sample_n(10)
seasons_small <- seasons %>% select (location, season, version) %>% sample_n(9)
episodes_small
## # A tibble: 10 × 4
## title air_date episode season
## <chr> <date> <dbl> <dbl>
## 1 Outfoxed 2016-12-29 4 3
## 2 Far From Home 2021-06-24 4 8
## 3 Winter's Fury 2016-07-07 11 2
## 4 The Freeze 2015-08-06 9 1
## 5 Winds of Hell 2015-07-16 5 1
## 6 The Last Mile 2017-07-06 4 4
## 7 Storm Rising 2016-05-19 5 2
## 8 Stalked 2015-07-09 4 1
## 9 All In 2021-08-12 10 8
## 10 The Rock 2020-07-09 5 7
seasons_small
## # A tibble: 9 × 3
## location season version
## <chr> <dbl> <chr>
## 1 Quatsino 4 US
## 2 Great Slave Lake 6 US
## 3 Chilko Lake 8 US
## 4 Great Slave Lake 7 US
## 5 Nunatsiavut 9 US
## 6 Selenge Province 5 US
## 7 Quatsino 1 US
## 8 Quatsino 2 US
## 9 Patagonia 3 US
Describe the resulting data:
How is it different from the original two datasets?
This data is different from the first two data sets because it matched pairs of observations when the key, season, was equal between both sets. This new dataset includes every variable/column from both data sets together.
episodes_small %>% inner_join(seasons_small)
## Joining with `by = join_by(season)`
## # A tibble: 10 × 6
## title air_date episode season location version
## <chr> <date> <dbl> <dbl> <chr> <chr>
## 1 Outfoxed 2016-12-29 4 3 Patagonia US
## 2 Far From Home 2021-06-24 4 8 Chilko Lake US
## 3 Winter's Fury 2016-07-07 11 2 Quatsino US
## 4 The Freeze 2015-08-06 9 1 Quatsino US
## 5 Winds of Hell 2015-07-16 5 1 Quatsino US
## 6 The Last Mile 2017-07-06 4 4 Quatsino US
## 7 Storm Rising 2016-05-19 5 2 Quatsino US
## 8 Stalked 2015-07-09 4 1 Quatsino US
## 9 All In 2021-08-12 10 8 Chilko Lake US
## 10 The Rock 2020-07-09 5 7 Great Slave Lake US
Describe the resulting data:
How is it different from the original two datasets?
This data is different from the original two sets because it kept all of the observations from x, episodes, then dropped the unmatched observations from y, seasons.
episodes_small %>% left_join(seasons_small)
## Joining with `by = join_by(season)`
## # A tibble: 10 × 6
## title air_date episode season location version
## <chr> <date> <dbl> <dbl> <chr> <chr>
## 1 Outfoxed 2016-12-29 4 3 Patagonia US
## 2 Far From Home 2021-06-24 4 8 Chilko Lake US
## 3 Winter's Fury 2016-07-07 11 2 Quatsino US
## 4 The Freeze 2015-08-06 9 1 Quatsino US
## 5 Winds of Hell 2015-07-16 5 1 Quatsino US
## 6 The Last Mile 2017-07-06 4 4 Quatsino US
## 7 Storm Rising 2016-05-19 5 2 Quatsino US
## 8 Stalked 2015-07-09 4 1 Quatsino US
## 9 All In 2021-08-12 10 8 Chilko Lake US
## 10 The Rock 2020-07-09 5 7 Great Slave Lake US
Describe the resulting data:
How is it different from the original two datasets?
This dataset is different from the original two sets because it kept all of the observations from y, seasons, then added the unmatched observations from x, episodes, to the last rows as (NA).
episodes_small %>% right_join(seasons_small)
## Joining with `by = join_by(season)`
## # A tibble: 13 × 6
## title air_date episode season location version
## <chr> <date> <dbl> <dbl> <chr> <chr>
## 1 Outfoxed 2016-12-29 4 3 Patagonia US
## 2 Far From Home 2021-06-24 4 8 Chilko Lake US
## 3 Winter's Fury 2016-07-07 11 2 Quatsino US
## 4 The Freeze 2015-08-06 9 1 Quatsino US
## 5 Winds of Hell 2015-07-16 5 1 Quatsino US
## 6 The Last Mile 2017-07-06 4 4 Quatsino US
## 7 Storm Rising 2016-05-19 5 2 Quatsino US
## 8 Stalked 2015-07-09 4 1 Quatsino US
## 9 All In 2021-08-12 10 8 Chilko Lake US
## 10 The Rock 2020-07-09 5 7 Great Slave Lake US
## 11 <NA> NA NA 6 Great Slave Lake US
## 12 <NA> NA NA 9 Nunatsiavut US
## 13 <NA> NA NA 5 Selenge Province US
Describe the resulting data:
How is it different from the original two datasets?
This dataset is different from the original two sets because it kept all observations from both x, episodes, and y, seasons.
episodes_small %>% full_join(seasons_small)
## Joining with `by = join_by(season)`
## # A tibble: 13 × 6
## title air_date episode season location version
## <chr> <date> <dbl> <dbl> <chr> <chr>
## 1 Outfoxed 2016-12-29 4 3 Patagonia US
## 2 Far From Home 2021-06-24 4 8 Chilko Lake US
## 3 Winter's Fury 2016-07-07 11 2 Quatsino US
## 4 The Freeze 2015-08-06 9 1 Quatsino US
## 5 Winds of Hell 2015-07-16 5 1 Quatsino US
## 6 The Last Mile 2017-07-06 4 4 Quatsino US
## 7 Storm Rising 2016-05-19 5 2 Quatsino US
## 8 Stalked 2015-07-09 4 1 Quatsino US
## 9 All In 2021-08-12 10 8 Chilko Lake US
## 10 The Rock 2020-07-09 5 7 Great Slave Lake US
## 11 <NA> NA NA 6 Great Slave Lake US
## 12 <NA> NA NA 9 Nunatsiavut US
## 13 <NA> NA NA 5 Selenge Province US
Describe the resulting data:
How is it different from the original two datasets?
This dataset is different from the original two datasets because it kept all observations in x, episodes, that had a match in y, seasons.
episodes_small %>% semi_join(seasons_small)
## Joining with `by = join_by(season)`
## # A tibble: 10 × 4
## title air_date episode season
## <chr> <date> <dbl> <dbl>
## 1 Outfoxed 2016-12-29 4 3
## 2 Far From Home 2021-06-24 4 8
## 3 Winter's Fury 2016-07-07 11 2
## 4 The Freeze 2015-08-06 9 1
## 5 Winds of Hell 2015-07-16 5 1
## 6 The Last Mile 2017-07-06 4 4
## 7 Storm Rising 2016-05-19 5 2
## 8 Stalked 2015-07-09 4 1
## 9 All In 2021-08-12 10 8
## 10 The Rock 2020-07-09 5 7
Describe the resulting data:
How is it different from the original two datasets?
This dataset is different from the original two datasets because it dropped all observations in x, episodes, that had a match in y, seasons.
episodes_small %>% anti_join(seasons_small)
## Joining with `by = join_by(season)`
## # A tibble: 0 × 4
## # ℹ 4 variables: title <chr>, air_date <date>, episode <dbl>, season <dbl>