Scraping data on movies from Rotten Tomatoes for dashboard creation in Tableau.
The goal of this project is to scrape data on the 200 best movies of
the last decade from the Rotten
Tomatoes website with the R rvest package, and finally
create a dashboard in Tableau. The idea is to show all the
movies in one place. Hovering over the movie should reveal relevant data
in the tooltip for quick overview. Clicking on the movie should open the
movie’s website for more information.
I’ve learned so much while working on this project (like web
scraping, writing functions, iteration,…). The purrr
package for functional programming is super-cool. It allows iteration
with just one line of code (a very handy replacement for
for loops).
I hope you’ll enjoy the process as much as I did. At times it was quite challenging, but that’s how we learn!
Are we allowed to scrape data from the Rotten Tomatoes website?
robotstxt::paths_allowed("https://www.rottentomatoes.com/")
[1] TRUE
The data will be scraped from this page. Since it doesn’t contain all the data I am interested in, I have to visit every movie’s web page on the list and scrape data from there. Here is the plan:
Scrape data from the main page: the urls of movies,
and the urls of images.
Scrape title, year_genre_runtime,
critics_score, audiaece_score, and
synopsis from the first movie to develop the code.
Write a function that scrapes data based on movie’s url.
Iteration - use this function to scrape data from each individual
movie and create a data frame with the columns title,
year_genre_runtime, critics_score,
audiaece_score, synopsis, and
url.
Download images
Wrangle data
Create a dashboard in Tableau
Read the main page with read_html().
main_url <- "https://editorial.rottentomatoes.com/guide/the-200-best-movies-of-the-2010s/"
main_page <- read_html(main_url)
Figure 1: The main page
I make use of the SelectorGadget to identify the tags
for the relevant nodes. Here is the link
for Chrome (recommended).
Extract urls of movies
The same nodes that contain the text for the titles also contain
information on the links to individual movie pages for each title. We
can extract this information using the html_attr()
function, which extracts attributes.
movie_urls <- main_page %>%
html_nodes(".article_movie_title a") %>%
html_attr("href")
movie_urls %>% head()
[1] "https://www.rottentomatoes.com/m/12_years_a_slave"
[2] "https://www.rottentomatoes.com/m/20_feet_from_stardom"
[3] "https://www.rottentomatoes.com/m/45_years"
[4] "https://www.rottentomatoes.com/m/all_is_lost_2013"
[5] "https://www.rottentomatoes.com/m/amazing_grace_2018"
[6] "https://www.rottentomatoes.com/m/american_hustle"
Extract urls of images
image_urls <- main_page %>%
html_nodes(".article_poster") %>%
html_attr("src")
Let’s check the image for the 6th title.
knitr::include_graphics(image_urls[6])
I am going to scrape data for the movie 12 Years a Slave in order to develop the code.
Read page for the first movie.
url <- "https://www.rottentomatoes.com/m/12_years_a_slave"
movie_page <- read_html(url)
Figure 2: Title, year, genre, runtime, critics and audience score
Scroll down the page and you’ll find the movie synopsis.
Figure 3: Synopsis
Extract title
[1] "12 Years a Slave"
Extract year, genre, and
runtime
year_genre_runtime <- movie_page %>%
html_node(".scoreboard__info") %>%
html_text()
year_genre_runtime
[1] "2013, History/Drama, 2h 14m"
Extract critics score
The next two are tricky. I had to look at the page source and find them manually.
critics_score <- movie_page %>%
html_element("score-board") %>%
html_attr("tomatometerscore") %>%
str_c(.,"%")
critics_score
[1] "95%"
Extract audience score
audience_score <- movie_page %>%
html_element("score-board") %>%
html_attr("audiencescore") %>%
str_c(.,"%")
audience_score
[1] "90%"
Extract movie synopsis
synopsis <- movie_page %>%
html_node("#movieSynopsis") %>%
html_text2()
synopsis
[1] "In the years before the Civil War, Solomon Northup (Chiwetel Ejiofor), a free black man from upstate New York, is kidnapped and sold into slavery in the South. Subjected to the cruelty of one malevolent owner (Michael Fassbender), he also finds unexpected kindness from another, as he struggles continually to survive and maintain some of his dignity. Then in the 12th year of the disheartening ordeal, a chance meeting with an abolitionist from Canada changes Solomon's life forever."
Making a data frame of extracted elements
movie <- tibble(title = title,
year_genre_runtime = year_genre_runtime,
critics_score = critics_score,
audience_score = audience_score,
synopsis = synopsis,
url = url)
movie %>% glimpse()
Rows: 1
Columns: 6
$ title <chr> "12 Years a Slave"
$ year_genre_runtime <chr> "2013, History/Drama, 2h 14m"
$ critics_score <chr> "95%"
$ audience_score <chr> "90%"
$ synopsis <chr> "In the years before the Civil War, Solom…
$ url <chr> "https://www.rottentomatoes.com/m/12_year…
Instead of manually scraping individual movies, I’ll write a function to do the same.
scrape_movie <- function(x, ...){
movie_page <- read_html(x)
title <- movie_page %>%
html_node(".scoreboard__title") %>%
html_text()
year_genre_runtime <- movie_page %>%
html_node(".scoreboard__info") %>%
html_text()
critics_score <- movie_page %>%
html_element("score-board") %>%
html_attr("tomatometerscore") %>%
str_c(.,"%")
audience_score <- movie_page %>%
html_element("score-board") %>%
html_attr("audiencescore") %>%
str_c(.,"%")
synopsis <- movie_page %>%
html_node("#movieSynopsis") %>%
html_text2()
movie_df <- tibble(title = title,
year_genre_runtime = year_genre_runtime,
critics_score = critics_score,
audience_score = audience_score,
synopsis = synopsis,
url = x)
return(movie_df)
}
Function in action
Now that we have the scrape_movie() function, let’s
scrape data for the movie “American Hustle”.
Rows: 1
Columns: 6
$ title <chr> "American Hustle"
$ year_genre_runtime <chr> "2013, Crime/Drama, 2h 18m"
$ critics_score <chr> "92%"
$ audience_score <chr> "74%"
$ synopsis <chr> "Irving Rosenfeld (Christian Bale) dabble…
$ url <chr> "https://www.rottentomatoes.com/m/america…
Or “Ex Machina”, another great movie.
Rows: 1
Columns: 6
$ title <chr> "Ex Machina"
$ year_genre_runtime <chr> "2014, Sci-fi/Mystery & thriller, 1h 47m"
$ critics_score <chr> "92%"
$ audience_score <chr> "86%"
$ synopsis <chr> "Caleb Smith (Domhnall Gleeson) a program…
$ url <chr> "https://www.rottentomatoes.com/m/ex_mach…
To make my workflow a little more efficient, I make use of the
map_dfr() function from the purrr package to
iterate over all movie pages. map_dfr() will apply the
scrape_movie()function to each element in the vector of
links, and return a data frame created by row-binding.
# A tibble: 6 × 6
title year_genre_runt… critics_score audience_score synopsis url
<chr> <chr> <chr> <chr> <chr> <chr>
1 12 Yea… 2013, History/D… 95% 90% In the … http…
2 20 Fee… 2013, Documenta… 99% 82% Filmmak… http…
3 45 Yea… 2015, Drama, 1h… 97% 67% As thei… http…
4 All Is… 2013, Adventure… 94% 63% During … http…
5 Amazin… 2018, Documenta… 99% 80% Singer … http…
6 Americ… 2013, Crime/Dra… 92% 74% Irving … http…
I’ve already extracted urls of images in the step 1 and saved them to
image_urls. Now I’m going to create a directory and
directory paths for the images.
fs::dir_create("images/top_200_images/")
paths <- c(str_c("images/top_200_images/", sprintf("%0.3d", 1:200), ".jpg"))
paths %>% head()
[1] "images/top_200_images/001.jpg" "images/top_200_images/002.jpg"
[3] "images/top_200_images/003.jpg" "images/top_200_images/004.jpg"
[5] "images/top_200_images/005.jpg" "images/top_200_images/006.jpg"
Download images
This time I’ll use map2() function from the
purrr package, It will apply the
download.file() function to pairs of elements from two
vectors, image_urls and paths.
map2(image_urls, paths, function(.x, .y) download.file(.x, .y, mode="wb")) %>%
head(3)
[[1]]
[1] 0
[[2]]
[1] 0
[[3]]
[1] 0
Are the images properly saved? Let’s read in the image for the first movie.
knitr::include_graphics("images/top_200_images/001.jpg")
Preparing the final dataset to be used in Tableau.
movies <- movies %>%
# separate year_genre_runtime column into year, genre, and runtime
separate(year_genre_runtime, sep = ", ", into = c("year", "genre", "runtime")) %>%
mutate(year = as.factor(year)) %>%
# separate genre column into primary and secondary genre
separate(genre, sep = "/", into = c("genre_1", "genre_2"), remove = FALSE) %>%
# create id column with leading zeroes so Tableau can automatically match the images
mutate(id = sprintf("%0.3d", 1:200)) %>%
select(id, everything())
movies %>% head()
# A tibble: 6 × 11
id title year genre genre_1 genre_2 runtime critics_score
<chr> <chr> <fct> <chr> <chr> <chr> <chr> <chr>
1 001 12 Years a … 2013 Hist… History Drama 2h 14m 95%
2 002 20 Feet Fro… 2013 Docu… Docume… <NA> 1h 30m 99%
3 003 45 Years 2015 Drama Drama <NA> 1h 33m 97%
4 004 All Is Lost 2013 Adve… Advent… Myster… 1h 45m 94%
5 005 Amazing Gra… 2018 Docu… Docume… Music 1h 27m 99%
6 006 American Hu… 2013 Crim… Crime Drama 2h 18m 92%
# … with 3 more variables: audience_score <chr>, synopsis <chr>,
# url <chr>
[1] "History" "Documentary" "Drama"
[4] "Adventure" "Crime" "Comedy"
[7] "Action" "Sci-fi" "Romance"
[10] "Horror" "Biography" "Mystery & thriller"
[13] "Kids & family" "War" "Fantasy"
[16] "Musical" "Western"
[1] "Drama" NA "Mystery & thriller"
[4] "Music" "Biography" "Adventure"
[7] "History" "Romance" "Comedy"
[10] "Lgbtq+" "Action" "War"
[13] "Fantasy" "Sci-fi" "Crime"
[16] "Musical" "Western" "Anime"
[19] "Horror"
Finding values in genre_2, that are not in
genre_1. This will help when creating a list parameter for
filtering by primary or secondary genre.
setdiff(movies$genre_2, movies$genre_1)
[1] NA "Music" "Lgbtq+" "Anime"
DT table
Writing file
Saving the dataset to excel file for dashboard creation in Tableau.
Note: csv would remove leading zeros in the id column.
movies %>% writexl::write_xlsx("datasets/top_200_movies_2010s_rotten_tomatoes.xlsx")
The final dashboard is created in Tableau. It’s actually a jitter plot, which separates overlapping movies with the same critics’ score.
To avoid two filters, one for primary and one for secondary genre, a list parameter is created that filters movies by primary or secondary genre, or “All” values.
For the best viewing experience, please click on the full screen in the bottom right corner.
You can nteract with the dashboard on Tableau Public. Enjoy!