Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

Joins in R

In this vignette, I will demonstrate how to perform a full join using the dplyr package in R. We’ll work with two datasets containing movie ratings for the years 2021 and 2022. The goal is to merge these datasets and create a comprehensive list of movie ratings, keeping all movies from both years and filling in missing values with NA where necessary.

Dataset Description

We have two datasets:

movieratings_2021 contains movie ratings for the year 2021. movieratings_2022 contains movie ratings for the year 2022. We want to merge these datasets based on a common key, which in this case is a unique identifier for each movie.

movieratings_2021 <- read_csv("https://raw.githubusercontent.com/MAB592/Data-607-Assignments/main/The%20Hollywood%20Inider%20-%20all%20data%20-%202021.csv")
## Rows: 101 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): Film, Rotten Tomatoes  critics, Metacritic  critics, Average criti...
## dbl  (7): Year, Rotten Tomatoes Audience, Average audience, Domestic Gross, ...
## lgl  (9): Script Type, Foreign Gross ($million), Foreign Gross, % of Gross e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
movieratings_2022 <- read_csv("https://raw.githubusercontent.com/MAB592/Data-607-Assignments/main/The%20Hollywood%20Inider%20-%20all%20data%20-%202022.csv")
## Rows: 92 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (20): Film, Metacritic  critics, Metacritic Audience, Rotten Tomatoes vs...
## dbl (10): Year, Rotten Tomatoes  critics, Average critics, Rotten Tomatoes A...
## lgl  (5): Script Type, Link, None, Release Date (US), film list here https:/...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(movieratings_2021)
head(movieratings_2022)

Performing a Full Join

To merge the datasets, we will use the full_join function from the dplyr package. This function performs a full join, ensuring that all rows from both datasets are included in the result, and missing values are filled with NA for non-matching rows.

full_movie_ratings <- full_join(movieratings_2021, movieratings_2022, by = "Film")
head(full_movie_ratings)

Conclusion

In this vignette, we demonstrated how to perform a full join using the dplyr package in R to merge movie ratings data for two different years. The full join ensures that all movies are included in the result, even if they are unique to a specific year. This technique is useful for combining data from multiple sources, ensuring that no information is lost during the merge.