Scraping 200 Best Movies of 2010s from Rotten Tomatoes

Overview

The goal of this project is to scrape data on the 200 best movies of the last decade from the Rotten Tomatoes website with the R rvest package, and finally create a dashboard in Tableau. The idea is to show all the movies in one place. Hovering over the movie should reveal relevant data in the tooltip for quick overview. Clicking on the movie should open the movie’s website for more information.

I’ve learned so much while working on this project (like web scraping, writing functions, iteration,…). The purrr package for functional programming is super-cool. It allows iteration with just one line of code (a very handy replacement for for loops).

I hope you’ll enjoy the process as much as I did. At times it was quite challenging, but that’s how we learn!

Setting up the programming evironment

# loading packages
library(tidyverse)
library(rvest)

Are we allowed to scrape data from the Rotten Tomatoes website?

robotstxt::paths_allowed("https://www.rottentomatoes.com/")

[1] TRUE

Plan

The data will be scraped from this page. Since it doesn’t contain all the data I am interested in, I have to visit every movie’s web page on the list and scrape data from there. Here is the plan:

Scrape data from the main page: the urls of movies, and the urls of images.
Scrape title, year_genre_runtime, critics_score, audiaece_score, and synopsis from the first movie to develop the code.
Write a function that scrapes data based on movie’s url.
Iteration - use this function to scrape data from each individual movie and create a data frame with the columns title, year_genre_runtime, critics_score, audiaece_score, synopsis, and url.
Download images
Wrangle data
Create a dashboard in Tableau

1. Scraping data from the main page

Read the main page with read_html().

main_url <- "https://editorial.rottentomatoes.com/guide/the-200-best-movies-of-the-2010s/"
main_page <- read_html(main_url)

Figure 1: The main page

I make use of the SelectorGadget to identify the tags for the relevant nodes. Here is the link for Chrome (recommended).

Extract urls of movies

The same nodes that contain the text for the titles also contain information on the links to individual movie pages for each title. We can extract this information using the html_attr() function, which extracts attributes.

movie_urls <- main_page %>% 
  html_nodes(".article_movie_title a") %>% 
  html_attr("href")

movie_urls %>% head()

[1] "https://www.rottentomatoes.com/m/12_years_a_slave"    
[2] "https://www.rottentomatoes.com/m/20_feet_from_stardom"
[3] "https://www.rottentomatoes.com/m/45_years"            
[4] "https://www.rottentomatoes.com/m/all_is_lost_2013"    
[5] "https://www.rottentomatoes.com/m/amazing_grace_2018"  
[6] "https://www.rottentomatoes.com/m/american_hustle"

Extract urls of images

image_urls <- main_page %>% 
  html_nodes(".article_poster") %>% 
  html_attr("src")

Let’s check the image for the 6th title.

knitr::include_graphics(image_urls[6])

2. Scraping data for the first movie on the list

I am going to scrape data for the movie 12 Years a Slave in order to develop the code.

Read page for the first movie.

url <- "https://www.rottentomatoes.com/m/12_years_a_slave"
movie_page <- read_html(url)

Figure 2: Title, year, genre, runtime, critics and audience score

Scroll down the page and you’ll find the movie synopsis.

Figure 3: Synopsis

Extract title

title <- movie_page %>% 
  html_node(".scoreboard__title") %>% 
  html_text()

title

[1] "12 Years a Slave"

Extract year, genre, and runtime

year_genre_runtime <- movie_page %>% 
  html_node(".scoreboard__info") %>% 
  html_text()

year_genre_runtime

[1] "2013, History/Drama, 2h 14m"

Extract critics score

The next two are tricky. I had to look at the page source and find them manually.

critics_score <- movie_page %>% 
  html_element("score-board") %>% 
  html_attr("tomatometerscore") %>% 
  str_c(.,"%")

critics_score

[1] "95%"

Extract audience score

audience_score <- movie_page %>% 
  html_element("score-board") %>% 
  html_attr("audiencescore") %>% 
  str_c(.,"%")

audience_score

[1] "90%"

Extract movie synopsis

synopsis <- movie_page %>% 
  html_node("#movieSynopsis") %>% 
  html_text2()

synopsis

[1] "In the years before the Civil War, Solomon Northup (Chiwetel Ejiofor), a free black man from upstate New York, is kidnapped and sold into slavery in the South. Subjected to the cruelty of one malevolent owner (Michael Fassbender), he also finds unexpected kindness from another, as he struggles continually to survive and maintain some of his dignity. Then in the 12th year of the disheartening ordeal, a chance meeting with an abolitionist from Canada changes Solomon's life forever."

Making a data frame of extracted elements

movie  <- tibble(title = title, 
                 year_genre_runtime = year_genre_runtime,
                 critics_score = critics_score,
                 audience_score = audience_score,
                 synopsis = synopsis,  
                 url = url)

movie %>% glimpse()

Rows: 1
Columns: 6
$ title              <chr> "12 Years a Slave"
$ year_genre_runtime <chr> "2013, History/Drama, 2h 14m"
$ critics_score      <chr> "95%"
$ audience_score     <chr> "90%"
$ synopsis           <chr> "In the years before the Civil War, Solom…
$ url                <chr> "https://www.rottentomatoes.com/m/12_year…

3. Writing a function

Instead of manually scraping individual movies, I’ll write a function to do the same.

scrape_movie <- function(x, ...){
  
  movie_page <- read_html(x)
  
  title <- movie_page %>% 
    html_node(".scoreboard__title") %>% 
    html_text()
  
  year_genre_runtime <- movie_page %>% 
    html_node(".scoreboard__info") %>% 
    html_text()
  
  critics_score <- movie_page %>% 
    html_element("score-board") %>% 
    html_attr("tomatometerscore") %>% 
    str_c(.,"%")
  
  audience_score <- movie_page %>% 
    html_element("score-board") %>% 
    html_attr("audiencescore") %>% 
    str_c(.,"%")
  
  synopsis <- movie_page %>% 
    html_node("#movieSynopsis") %>% 
    html_text2()
  
  movie_df <- tibble(title = title, 
                     year_genre_runtime = year_genre_runtime,
                     critics_score = critics_score,
                     audience_score = audience_score,
                     synopsis = synopsis,
                     url = x)
  
  return(movie_df)
  
}

Function in action

Now that we have the scrape_movie() function, let’s scrape data for the movie “American Hustle”.

scrape_movie(movie_urls[6]) %>% glimpse()

Rows: 1
Columns: 6
$ title              <chr> "American Hustle"
$ year_genre_runtime <chr> "2013, Crime/Drama, 2h 18m"
$ critics_score      <chr> "92%"
$ audience_score     <chr> "74%"
$ synopsis           <chr> "Irving Rosenfeld (Christian Bale) dabble…
$ url                <chr> "https://www.rottentomatoes.com/m/america…

Or “Ex Machina”, another great movie.

 scrape_movie(movie_urls[53]) %>% glimpse()

Rows: 1
Columns: 6
$ title              <chr> "Ex Machina"
$ year_genre_runtime <chr> "2014, Sci-fi/Mystery & thriller, 1h 47m"
$ critics_score      <chr> "92%"
$ audience_score     <chr> "86%"
$ synopsis           <chr> "Caleb Smith (Domhnall Gleeson) a program…
$ url                <chr> "https://www.rottentomatoes.com/m/ex_mach…

4. Iteration

To make my workflow a little more efficient, I make use of the map_dfr() function from the purrr package to iterate over all movie pages. map_dfr() will apply the scrape_movie()function to each element in the vector of links, and return a data frame created by row-binding.

movies <- map_dfr(movie_urls, scrape_movie)

movies %>% head()

# A tibble: 6 × 6
  title   year_genre_runt… critics_score audience_score synopsis url  
  <chr>   <chr>            <chr>         <chr>          <chr>    <chr>
1 12 Yea… 2013, History/D… 95%           90%            In the … http…
2 20 Fee… 2013, Documenta… 99%           82%            Filmmak… http…
3 45 Yea… 2015, Drama, 1h… 97%           67%            As thei… http…
4 All Is… 2013, Adventure… 94%           63%            During … http…
5 Amazin… 2018, Documenta… 99%           80%            Singer … http…
6 Americ… 2013, Crime/Dra… 92%           74%            Irving … http…

5. Downloading images

I’ve already extracted urls of images in the step 1 and saved them to image_urls. Now I’m going to create a directory and directory paths for the images.

fs::dir_create("images/top_200_images/")

paths <- c(str_c("images/top_200_images/", sprintf("%0.3d", 1:200), ".jpg"))

paths %>% head()

[1] "images/top_200_images/001.jpg" "images/top_200_images/002.jpg"
[3] "images/top_200_images/003.jpg" "images/top_200_images/004.jpg"
[5] "images/top_200_images/005.jpg" "images/top_200_images/006.jpg"

Download images

This time I’ll use map2() function from the purrr package, It will apply the download.file() function to pairs of elements from two vectors, image_urls and paths.

map2(image_urls, paths, function(.x, .y) download.file(.x, .y, mode="wb")) %>% 
  head(3)

[[1]]
[1] 0

[[2]]
[1] 0

[[3]]
[1] 0

Are the images properly saved? Let’s read in the image for the first movie.

knitr::include_graphics("images/top_200_images/001.jpg")

6. Data wrangling

Preparing the final dataset to be used in Tableau.

movies <- movies %>% 
  
  # separate year_genre_runtime column into year, genre, and runtime
  separate(year_genre_runtime, sep = ", ", into = c("year", "genre", "runtime")) %>% 
  mutate(year = as.factor(year)) %>% 
  
  # separate genre column into primary and secondary genre
  separate(genre, sep = "/", into = c("genre_1", "genre_2"), remove = FALSE) %>% 
  
  # create id column with leading zeroes so Tableau can automatically match the images
  mutate(id = sprintf("%0.3d", 1:200)) %>% 
  select(id, everything())

movies %>% head()

# A tibble: 6 × 11
  id    title        year  genre genre_1 genre_2 runtime critics_score
  <chr> <chr>        <fct> <chr> <chr>   <chr>   <chr>   <chr>        
1 001   12 Years a … 2013  Hist… History Drama   2h 14m  95%          
2 002   20 Feet Fro… 2013  Docu… Docume… <NA>    1h 30m  99%          
3 003   45 Years     2015  Drama Drama   <NA>    1h 33m  97%          
4 004   All Is Lost  2013  Adve… Advent… Myster… 1h 45m  94%          
5 005   Amazing Gra… 2018  Docu… Docume… Music   1h 27m  99%          
6 006   American Hu… 2013  Crim… Crime   Drama   2h 18m  92%          
# … with 3 more variables: audience_score <chr>, synopsis <chr>,
#   url <chr>

# number of unique values in genre column
movies$genre %>% unique() %>% length()

[1] 58

# unique values in genre_1
movies$genre_1 %>% unique()

 [1] "History"            "Documentary"        "Drama"             
 [4] "Adventure"          "Crime"              "Comedy"            
 [7] "Action"             "Sci-fi"             "Romance"           
[10] "Horror"             "Biography"          "Mystery & thriller"
[13] "Kids & family"      "War"                "Fantasy"           
[16] "Musical"            "Western"

# unique values in genre_2
movies$genre_2 %>% unique()

 [1] "Drama"              NA                   "Mystery & thriller"
 [4] "Music"              "Biography"          "Adventure"         
 [7] "History"            "Romance"            "Comedy"            
[10] "Lgbtq+"             "Action"             "War"               
[13] "Fantasy"            "Sci-fi"             "Crime"             
[16] "Musical"            "Western"            "Anime"             
[19] "Horror"

Finding values in genre_2, that are not in genre_1. This will help when creating a list parameter for filtering by primary or secondary genre.

setdiff(movies$genre_2, movies$genre_1)

[1] NA       "Music"  "Lgbtq+" "Anime"

DT table

movies %>% 
  select(1:9) %>% 
  DT::datatable(rownames = FALSE)

Writing file

Saving the dataset to excel file for dashboard creation in Tableau. Note: csv would remove leading zeros in the id column.

movies %>% writexl::write_xlsx("datasets/top_200_movies_2010s_rotten_tomatoes.xlsx")

7. Tableau dashboard

The final dashboard is created in Tableau. It’s actually a jitter plot, which separates overlapping movies with the same critics’ score.

To avoid two filters, one for primary and one for secondary genre, a list parameter is created that filters movies by primary or secondary genre, or “All” values.

For the best viewing experience, please click on the full screen in the bottom right corner.

You can nteract with the dashboard on Tableau Public. Enjoy!