library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
In this vignette, I will demonstrate how to perform a full join using the dplyr package in R. We’ll work with two datasets containing movie ratings for the years 2021 and 2022. The goal is to merge these datasets and create a comprehensive list of movie ratings, keeping all movies from both years and filling in missing values with NA where necessary.
We have two datasets:
movieratings_2021 contains movie ratings for the year 2021. movieratings_2022 contains movie ratings for the year 2022. We want to merge these datasets based on a common key, which in this case is a unique identifier for each movie.
movieratings_2021 <- read_csv("https://raw.githubusercontent.com/MAB592/Data-607-Assignments/main/The%20Hollywood%20Inider%20-%20all%20data%20-%202021.csv")
## Rows: 101 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): Film, Rotten Tomatoes critics, Metacritic critics, Average criti...
## dbl (7): Year, Rotten Tomatoes Audience, Average audience, Domestic Gross, ...
## lgl (9): Script Type, Foreign Gross ($million), Foreign Gross, % of Gross e...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
movieratings_2022 <- read_csv("https://raw.githubusercontent.com/MAB592/Data-607-Assignments/main/The%20Hollywood%20Inider%20-%20all%20data%20-%202022.csv")
## Rows: 92 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (20): Film, Metacritic critics, Metacritic Audience, Rotten Tomatoes vs...
## dbl (10): Year, Rotten Tomatoes critics, Average critics, Rotten Tomatoes A...
## lgl (5): Script Type, Link, None, Release Date (US), film list here https:/...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(movieratings_2021)
head(movieratings_2022)
To merge the datasets, we will use the full_join function from the dplyr package. This function performs a full join, ensuring that all rows from both datasets are included in the result, and missing values are filled with NA for non-matching rows.
full_movie_ratings <- full_join(movieratings_2021, movieratings_2022, by = "Film")
head(full_movie_ratings)
In this vignette, we demonstrated how to perform a full join using the dplyr package in R to merge movie ratings data for two different years. The full join ensures that all movies are included in the result, even if they are unique to a specific year. This technique is useful for combining data from multiple sources, ensuring that no information is lost during the merge.