Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

episodes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-01-24/episodes.csv')

## Rows: 98 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): version, title, quote, author
## dbl  (6): season, episode_number_overall, episode, viewers, imdb_rating, n_r...
## date (1): air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

seasons <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-01-24/seasons.csv')

## Rows: 9 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): version, location, country
## dbl  (4): season, n_survivors, lat, lon
## date (1): date_drop_off
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: Episodes

Columns: title, air_date, episode
Rows: 10 rows

Data2: Seasons

Columns: location, season, version
Rows: 9 rows

set.seed(1234)
episodes_small <- episodes %>% select(title, air_date, episode, season) %>% sample_n(10)
seasons_small <- seasons %>% select (location, season, version) %>% sample_n(9)

episodes_small

## # A tibble: 10 × 4
##    title         air_date   episode season
##    <chr>         <date>       <dbl>  <dbl>
##  1 Outfoxed      2016-12-29       4      3
##  2 Far From Home 2021-06-24       4      8
##  3 Winter's Fury 2016-07-07      11      2
##  4 The Freeze    2015-08-06       9      1
##  5 Winds of Hell 2015-07-16       5      1
##  6 The Last Mile 2017-07-06       4      4
##  7 Storm Rising  2016-05-19       5      2
##  8 Stalked       2015-07-09       4      1
##  9 All In        2021-08-12      10      8
## 10 The Rock      2020-07-09       5      7

seasons_small

## # A tibble: 9 × 3
##   location         season version
##   <chr>             <dbl> <chr>  
## 1 Quatsino              4 US     
## 2 Great Slave Lake      6 US     
## 3 Chilko Lake           8 US     
## 4 Great Slave Lake      7 US     
## 5 Nunatsiavut           9 US     
## 6 Selenge Province      5 US     
## 7 Quatsino              1 US     
## 8 Quatsino              2 US     
## 9 Patagonia             3 US

3. inner_join

Describe the resulting data:

Columns: title, air_date, episode, season, location, version
Rows: 10 rows

How is it different from the original two datasets?

This data is different from the first two data sets because it matched pairs of observations when the key, season, was equal between both sets. This new dataset includes every variable/column from both data sets together.

episodes_small %>% inner_join(seasons_small)

## Joining with `by = join_by(season)`

## # A tibble: 10 × 6
##    title         air_date   episode season location         version
##    <chr>         <date>       <dbl>  <dbl> <chr>            <chr>  
##  1 Outfoxed      2016-12-29       4      3 Patagonia        US     
##  2 Far From Home 2021-06-24       4      8 Chilko Lake      US     
##  3 Winter's Fury 2016-07-07      11      2 Quatsino         US     
##  4 The Freeze    2015-08-06       9      1 Quatsino         US     
##  5 Winds of Hell 2015-07-16       5      1 Quatsino         US     
##  6 The Last Mile 2017-07-06       4      4 Quatsino         US     
##  7 Storm Rising  2016-05-19       5      2 Quatsino         US     
##  8 Stalked       2015-07-09       4      1 Quatsino         US     
##  9 All In        2021-08-12      10      8 Chilko Lake      US     
## 10 The Rock      2020-07-09       5      7 Great Slave Lake US

4. left_join

Describe the resulting data:

Columns: title, air_date, episode, season, location, version
Rows: 10 rows

How is it different from the original two datasets?

This data is different from the original two sets because it kept all of the observations from x, episodes, then dropped the unmatched observations from y, seasons.

episodes_small %>% left_join(seasons_small)

## Joining with `by = join_by(season)`

## # A tibble: 10 × 6
##    title         air_date   episode season location         version
##    <chr>         <date>       <dbl>  <dbl> <chr>            <chr>  
##  1 Outfoxed      2016-12-29       4      3 Patagonia        US     
##  2 Far From Home 2021-06-24       4      8 Chilko Lake      US     
##  3 Winter's Fury 2016-07-07      11      2 Quatsino         US     
##  4 The Freeze    2015-08-06       9      1 Quatsino         US     
##  5 Winds of Hell 2015-07-16       5      1 Quatsino         US     
##  6 The Last Mile 2017-07-06       4      4 Quatsino         US     
##  7 Storm Rising  2016-05-19       5      2 Quatsino         US     
##  8 Stalked       2015-07-09       4      1 Quatsino         US     
##  9 All In        2021-08-12      10      8 Chilko Lake      US     
## 10 The Rock      2020-07-09       5      7 Great Slave Lake US

5. right_join

Describe the resulting data:

Columns: title, air_date, episode, season, location, version
Rows: 13 rows

How is it different from the original two datasets?

This dataset is different from the original two sets because it kept all of the observations from y, seasons, then added the unmatched observations from x, episodes, to the last rows as (NA).

episodes_small %>% right_join(seasons_small)

## Joining with `by = join_by(season)`

## # A tibble: 13 × 6
##    title         air_date   episode season location         version
##    <chr>         <date>       <dbl>  <dbl> <chr>            <chr>  
##  1 Outfoxed      2016-12-29       4      3 Patagonia        US     
##  2 Far From Home 2021-06-24       4      8 Chilko Lake      US     
##  3 Winter's Fury 2016-07-07      11      2 Quatsino         US     
##  4 The Freeze    2015-08-06       9      1 Quatsino         US     
##  5 Winds of Hell 2015-07-16       5      1 Quatsino         US     
##  6 The Last Mile 2017-07-06       4      4 Quatsino         US     
##  7 Storm Rising  2016-05-19       5      2 Quatsino         US     
##  8 Stalked       2015-07-09       4      1 Quatsino         US     
##  9 All In        2021-08-12      10      8 Chilko Lake      US     
## 10 The Rock      2020-07-09       5      7 Great Slave Lake US     
## 11 <NA>          NA              NA      6 Great Slave Lake US     
## 12 <NA>          NA              NA      9 Nunatsiavut      US     
## 13 <NA>          NA              NA      5 Selenge Province US

6. full_join

Describe the resulting data:

Columns: title, air_date, episode, season, location, version
Rows: 13 rows

How is it different from the original two datasets?

This dataset is different from the original two sets because it kept all observations from both x, episodes, and y, seasons.

episodes_small %>% full_join(seasons_small)

## Joining with `by = join_by(season)`

## # A tibble: 13 × 6
##    title         air_date   episode season location         version
##    <chr>         <date>       <dbl>  <dbl> <chr>            <chr>  
##  1 Outfoxed      2016-12-29       4      3 Patagonia        US     
##  2 Far From Home 2021-06-24       4      8 Chilko Lake      US     
##  3 Winter's Fury 2016-07-07      11      2 Quatsino         US     
##  4 The Freeze    2015-08-06       9      1 Quatsino         US     
##  5 Winds of Hell 2015-07-16       5      1 Quatsino         US     
##  6 The Last Mile 2017-07-06       4      4 Quatsino         US     
##  7 Storm Rising  2016-05-19       5      2 Quatsino         US     
##  8 Stalked       2015-07-09       4      1 Quatsino         US     
##  9 All In        2021-08-12      10      8 Chilko Lake      US     
## 10 The Rock      2020-07-09       5      7 Great Slave Lake US     
## 11 <NA>          NA              NA      6 Great Slave Lake US     
## 12 <NA>          NA              NA      9 Nunatsiavut      US     
## 13 <NA>          NA              NA      5 Selenge Province US

7. semi_join

Describe the resulting data:

Columns: title, air_date, episode, season
Rows: 10 rows

How is it different from the original two datasets?

This dataset is different from the original two datasets because it kept all observations in x, episodes, that had a match in y, seasons.

episodes_small %>% semi_join(seasons_small)

## Joining with `by = join_by(season)`

## # A tibble: 10 × 4
##    title         air_date   episode season
##    <chr>         <date>       <dbl>  <dbl>
##  1 Outfoxed      2016-12-29       4      3
##  2 Far From Home 2021-06-24       4      8
##  3 Winter's Fury 2016-07-07      11      2
##  4 The Freeze    2015-08-06       9      1
##  5 Winds of Hell 2015-07-16       5      1
##  6 The Last Mile 2017-07-06       4      4
##  7 Storm Rising  2016-05-19       5      2
##  8 Stalked       2015-07-09       4      1
##  9 All In        2021-08-12      10      8
## 10 The Rock      2020-07-09       5      7

8. anti_join

Describe the resulting data:

Columns: title, air_date, episode, season
Rows: 0 rows

How is it different from the original two datasets?

This dataset is different from the original two datasets because it dropped all observations in x, episodes, that had a match in y, seasons.

episodes_small %>% anti_join(seasons_small)

## Joining with `by = join_by(season)`

## # A tibble: 0 × 4
## # ℹ 4 variables: title <chr>, air_date <date>, episode <dbl>, season <dbl>

Week 9: Apply it to your data 8

Alex Lenfest

2026-03-26

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join