Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

pixar_films = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-03-11/pixar_films.csv')

## Rows: 27 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): film, film_rating
## dbl  (2): number, run_time
## date (1): release_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

public_response = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-03-11/public_response.csv')

## Rows: 24 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): film, cinema_score
## dbl (3): rotten_tomatoes, metacritic, critics_choice
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: pixar_films

Columns: film, film_rating, release_date
Rows: 10

Data 2: public_response

Columns: film, cinema_score, metacritic
Rows: 10

set.seed(1234)
pixar_films_small <- pixar_films %>% select(film, film_rating, release_date) %>% sample_n(10)

public_response_small <- public_response %>% select(film, cinema_score, metacritic) %>% sample_n(10)

pixar_films_small

## # A tibble: 10 × 3
##    film              film_rating release_date
##    <chr>             <chr>       <date>      
##  1 The Good Dinosaur PG          2015-11-25  
##  2 Lightyear         N/A         2022-06-17  
##  3 Onward            PG          2020-03-06  
##  4 Finding Nemo      G           2003-05-30  
##  5 Cars 2            G           2011-06-24  
##  6 Inside Out        PG          2015-06-19  
##  7 WALL-E            G           2008-06-27  
##  8 Luca              N/A         2021-06-18  
##  9 The Incredibles   PG          2004-11-05  
## 10 <NA>              Not Rated   2023-06-16

public_response_small

## # A tibble: 10 × 3
##    film                cinema_score metacritic
##    <chr>               <chr>             <dbl>
##  1 Monsters, Inc.      A+                   79
##  2 A Bug's Life        A                    77
##  3 Cars                A                    73
##  4 The Incredibles     A+                   90
##  5 Inside Out          A                    94
##  6 Monsters University A                    65
##  7 Coco                A+                   81
##  8 Luca                <NA>                 NA
##  9 Finding Dory        A                    77
## 10 Finding Nemo        A+                   90

3. inner_join

Describe the resulting data:

Columns: film, film_rating, release_date, cinema_score, metacritic
Rows: 4

How is it different from the original two datasets?

4 rows compared to 10 rows in the original data sets.
all columns from both data sets

pixar_films_small %>% inner_join(public_response_small, by = c("film"))

## # A tibble: 4 × 5
##   film            film_rating release_date cinema_score metacritic
##   <chr>           <chr>       <date>       <chr>             <dbl>
## 1 Finding Nemo    G           2003-05-30   A+                   90
## 2 Inside Out      PG          2015-06-19   A                    94
## 3 Luca            N/A         2021-06-18   <NA>                 NA
## 4 The Incredibles PG          2004-11-05   A+                   90

4. left_join

Describe the resulting data:

Columns: film, film_rating, release_date, cinema_score, metacritic
Rows: 10

How is it different from the original two datasets?

*all columns from both data sets

pixar_films_small %>% left_join(public_response_small, by = c("film"))

## # A tibble: 10 × 5
##    film              film_rating release_date cinema_score metacritic
##    <chr>             <chr>       <date>       <chr>             <dbl>
##  1 The Good Dinosaur PG          2015-11-25   <NA>                 NA
##  2 Lightyear         N/A         2022-06-17   <NA>                 NA
##  3 Onward            PG          2020-03-06   <NA>                 NA
##  4 Finding Nemo      G           2003-05-30   A+                   90
##  5 Cars 2            G           2011-06-24   <NA>                 NA
##  6 Inside Out        PG          2015-06-19   A                    94
##  7 WALL-E            G           2008-06-27   <NA>                 NA
##  8 Luca              N/A         2021-06-18   <NA>                 NA
##  9 The Incredibles   PG          2004-11-05   A+                   90
## 10 <NA>              Not Rated   2023-06-16   <NA>                 NA

5. right_join

Describe the resulting data:

Columns: film, film_rating, release_date, cinema_score, metacritic
Rows: 10

How is it different from the original two datasets?

film_rating and release_date are only shown for the first 4 rows (the matching variables) all columns from both data sets

pixar_films_small %>% right_join(public_response_small, by = c("film"))

## # A tibble: 10 × 5
##    film                film_rating release_date cinema_score metacritic
##    <chr>               <chr>       <date>       <chr>             <dbl>
##  1 Finding Nemo        G           2003-05-30   A+                   90
##  2 Inside Out          PG          2015-06-19   A                    94
##  3 Luca                N/A         2021-06-18   <NA>                 NA
##  4 The Incredibles     PG          2004-11-05   A+                   90
##  5 Monsters, Inc.      <NA>        NA           A+                   79
##  6 A Bug's Life        <NA>        NA           A                    77
##  7 Cars                <NA>        NA           A                    73
##  8 Monsters University <NA>        NA           A                    65
##  9 Coco                <NA>        NA           A+                   81
## 10 Finding Dory        <NA>        NA           A                    77

6. full_join

Describe the resulting data:

Columns: film, film_rating, release_date, cinema_score, metacritic
Rows: 16

How is it different from the original two datasets?

all columns shown from original datasets 16 rows instead of 10

pixar_films_small %>% full_join(public_response_small, by = "film")

## # A tibble: 16 × 5
##    film                film_rating release_date cinema_score metacritic
##    <chr>               <chr>       <date>       <chr>             <dbl>
##  1 The Good Dinosaur   PG          2015-11-25   <NA>                 NA
##  2 Lightyear           N/A         2022-06-17   <NA>                 NA
##  3 Onward              PG          2020-03-06   <NA>                 NA
##  4 Finding Nemo        G           2003-05-30   A+                   90
##  5 Cars 2              G           2011-06-24   <NA>                 NA
##  6 Inside Out          PG          2015-06-19   A                    94
##  7 WALL-E              G           2008-06-27   <NA>                 NA
##  8 Luca                N/A         2021-06-18   <NA>                 NA
##  9 The Incredibles     PG          2004-11-05   A+                   90
## 10 <NA>                Not Rated   2023-06-16   <NA>                 NA
## 11 Monsters, Inc.      <NA>        NA           A+                   79
## 12 A Bug's Life        <NA>        NA           A                    77
## 13 Cars                <NA>        NA           A                    73
## 14 Monsters University <NA>        NA           A                    65
## 15 Coco                <NA>        NA           A+                   81
## 16 Finding Dory        <NA>        NA           A                    77

7. semi_join

Describe the resulting data:

Columns: film, film_rating, release_date
Rows: 4

How is it different from the original two datasets?

only 3 columns are shown from both datasets 4 rows instead of 10 rows

pixar_films_small %>%
semi_join(public_response_small, by = c("film"))

## # A tibble: 4 × 3
##   film            film_rating release_date
##   <chr>           <chr>       <date>      
## 1 Finding Nemo    G           2003-05-30  
## 2 Inside Out      PG          2015-06-19  
## 3 Luca            N/A         2021-06-18  
## 4 The Incredibles PG          2004-11-05

8. anti_join

Describe the resulting data:

Columns: film, film_rating, release_date
Rows: 6

How is it different from the original two datasets?

only 3 columns are shown from both datasets 6 rows instead of 10 rows

pixar_films_small %>% 
anti_join(public_response_small, by = c("film"))

## # A tibble: 6 × 3
##   film              film_rating release_date
##   <chr>             <chr>       <date>      
## 1 The Good Dinosaur PG          2015-11-25  
## 2 Lightyear         N/A         2022-06-17  
## 3 Onward            PG          2020-03-06  
## 4 Cars 2            G           2011-06-24  
## 5 WALL-E            G           2008-06-27  
## 6 <NA>              Not Rated   2023-06-16

Week 9: Apply it to your data 8

Ava Clark

2025-03-28

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join