library(tidyverse)
library(httr)
library(rvest)
library(lubridate)
library(magrittr)
library(ggplot2)Rotten Tomatoes Analysis
Purpose
In this project, I want to investigate what different factors and characteristics are associated with highly rated movies, specifically from Rotten Tomatoes’ “100 Greatest Movies of the 21st Century” list. In this project I will explore different relationships between critic and audience scores, release years, as well as directors and cast.
I find this topic interesting because typically in major forms of entertainment like movies, audiences and critics tend to disagree on what makes a film “great”. Also, this data set could help see trends in repeating directors or actors that tend to make critics or audiences score films higher.
Data, Location, and Collection
The data we will be pulling from the Rotten Tomatoes site is located at https://editorial.rottentomatoes.com/guide/best-movies-21st-century/, an editorial titled “100 Greatest Movies of the 21st Century”.
List of libraries needed:
Above is the list of libraries in RStudio needed to extract, transform, and analyze the Rotten Tomato movie data.
The tidyverse package provides a collection of tools useful for data wrangling, cleaning, and visualization. The httr package helps manage web requests and allows the project to identify as a browser or bot when accessing the Rotten Tomatoes webpage. The rvest package is the primary tool used for web scraping. It allows HTML elements from the Rotten Tomatoes editorial page to be located and extracted. The lubridate package provides tools for working with dates and years. This may be useful later in the analysis when examining trends across movie release years. The magrittr package improves readability by allowing the use of pipe operators, making data transformations easier to follow and organize. Finally, ggplot2 is used to create visualizations and graphs that help identify relationships and trends within the movie data, such as critic scores, audience scores, and release year patterns.
Identify as Google Bot:
set_config(user_agent("Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html"))We will identify as a Google Bot within RStudio so that Rotten Tomatoes allows use to look at and scrape their website. More specifically, the request we send to Rotten Tomatoes will identify itself similarly to Google’s web crawler. This step is commonly used in web scraping projects to improve website accessibility when extracting HTML data.
Load in HTML of Editorial:
tomato_html <-
read_html("https://editorial.rottentomatoes.com/guide/best-movies-21st-century/")Above is the code that reads in the exact webpage we will be scraping data from. The read_html() function from the rvest package connects to the Rotten Tomatoes editorial page and downloads the webpage’s HTML structure into RStudio.
Grab Data from Inspector:
movie_name <-
tomato_html %>%
html_elements("div.meta-data-wrapper") %>%
html_elements("div.meta-title-wrapper") %>%
html_elements("a") %>%
html_text2()
movie_link <-
tomato_html %>%
html_elements("div.meta-data-wrapper") %>%
html_elements("div.meta-title-wrapper") %>%
html_elements("a") %>%
html_attr("href")Above are two examples of how we pulled data from searching through the inspector window on the Rotten Tomatoes webpage. By using the browser’s inspect tool, we were able to identify the exact HTML elements that contain the information we wanted to extract.
The first block of code extracts the movie titles from the webpage. The html_elements() function searches through the HTML structure and locates the sections containing movie metadata and titles. The html_text2() function then converts the HTML text into text values that can be stored in R.
The second block of code extracts the hyperlink associated with each movie title. Instead of collecting a HTML element, the html_attr("href") function pulls the value that is currently stored in a HTML href attribute, which contains the movie’s Rotten Tomatoes page link.
This process was repeated for seven other variables, movie year, director, main actors, synopsis, critic consensus, critic score, and audience score.
Convert into Data Frame:
tomato_messy_df <-
tibble(movie_name, movie_year, stars, director,
synopsis, critic_consensus, critic_score,
audience_score, movie_link)The tibble() function is then used to take all of the variables we have extracted that are currently stored in lists, and convert them into one large data frame, consisting of 100 entries.
Example Outputs:
head(tomato_messy_df)# A tibble: 6 × 9
movie_name movie_year stars director synopsis critic_consensus critic_score
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Parasite (2019) Star… Directe… Synopsi… Critics Consens… 99%
2 Top Gun: Mav… (2022) Star… Directe… Synopsi… Critics Consens… 96%
3 Finding Nemo (2003) Star… Directe… Synopsi… Critics Consens… 99%
4 How to Train… (2010) Star… Directe… Synopsi… Critics Consens… 99%
5 Toy Story 3 (2010) Star… Directe… Synopsi… Critics Consens… 98%
6 Up (2009) Star… Directe… Synopsi… Critics Consens… 98%
# ℹ 2 more variables: audience_score <chr>, movie_link <chr>
Above we used the head() function to look at the first few rows of our new data set in order to see how it is naturally structured without any transformations.
In the next section of analysis, we will transform each of these variables to best suit future investigation between relationships.
Data Transformation
Create Transformed Data Frame:
tomato_df <- tomato_messy_df %>%
mutate(movie_year = str_replace_all(movie_year, "\\(", "")) %>%
mutate(movie_year = as.numeric(str_replace_all(movie_year, "\\)", ""))) %>%
mutate(stars = str_replace_all(stars, "Starring: ", "")) %>%
mutate(director = str_replace_all(director, "Directed By: ", "")) %>%
mutate(synopsis = str_replace_all(synopsis, "Synopsis: ", "")) %>%
mutate(critic_consensus = str_replace_all(critic_consensus, "Critics Consensus: ", "")) %>%
mutate(critic_score = as.numeric(str_replace_all(critic_score, "%", ""))) %>%
mutate(audience_score = as.numeric(str_replace_all(audience_score, "%", ""))) %>%
mutate(score_differ = audience_score - critic_score) Above is the code used to transform the original raw data set into a cleaner and more analysis-ready version. Data collected directly from the website is messy and contains extra text, symbols, or formatting that can make analysis more difficult. These transformations help standardize the variables so they can be used more effectively in calculations, summaries, and visualizations.
The mutate() function from dplyr is used repeatedly to modify and create variables within the data set. Several variables contained unnecessary labels or symbols that needed to be removed. For example, the movie_year variable originally included parentheses around the year, so the str_replace_all() function was used to remove both the opening and closing parentheses before converting the values into numeric form.
The stars, director, synopsis, and critic_consensus variables also contained labels directly copied from the webpage, such as "Starring: " or "Directed By: ". These labels were removed so that the variables only contain the important information needed for analysis.
The critic and audience scores originally included percent signs, which prevented them from being treated as numerical variables. The percent symbols were removed and both variables were converted into numeric data types using as.numeric(). This allows calculations and visualizations involving scores to be performed correctly.
Finally, a new variable called score_differ was created by subtracting the critic score from the audience score. This variable can help measure disagreement between critics and audiences, making it easier to identify movies where public opinion differed greatly from critic reviews.
Example Outputs:
head(tomato_df)# A tibble: 6 × 10
movie_name movie_year stars director synopsis critic_consensus critic_score
<chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Parasite 2019 Song… Bong Jo… Greed a… An urgent, bril… 99
2 Top Gun: Mav… 2022 Tom … Joseph … After m… Top Gun: Maveri… 96
3 Finding Nemo 2003 Albe… Andrew … Marlin … Breathtakingly … 99
4 How to Train… 2010 Jay … Christo… A misfi… Boasting dazzli… 99
5 Toy Story 3 2010 Tom … Lee Unk… With th… Deftly blending… 98
6 Up 2009 Ed A… Pete Do… Carl Fr… An exciting, fu… 98
# ℹ 3 more variables: audience_score <dbl>, movie_link <chr>,
# score_differ <dbl>
Again we can use the head() function to look at our newly transformed data.
Visualizations:
Below we will create and analyze three different visualizations concerning the data we pulled from Rotten Tomatoes.
This visualization shows us the top ten movies that audiences rated higher than critics on the top 100 list. Looking at the top four, we can see that all of these movies are action, hero-based, and with three being part of large franchises. This shows that audiences appreciate overall entertainment and value anticipation, while critics may value other elements like originality or artistic achievements.
Above is a visualization similar to the first, but now looking at the top ten films that critics scored better than audiences. The top five movies critics scored better consisted of movies that were Indie films, comedy, or dramas. This highlights how critics put more value on originality, new directors and cast, as well as impressive artistic feats with small budgets.
The last visualizations shows the average critic score over time from 2000-2024. From 2000 to 2018 there was a general trend upward in terms of critics scores maybe derived from advancing technology, new emerging studios, or new and upcoming actors. Then in 2020 a massive plunge happened, with COVID-19 being the main suspect. Movie releases and announcements were at a all time low and may have affected critics passion when experiencing these films for the first time. After 2020, critics scores bounced back and have been on an upward trend since.