IMDB Scraping for Quinton Tarantino Movies

The goal of this scraping was to see is if movies directerd Quinton Tarantino are more positively leaning reviews or if there are more negative reviews. There has been many different options on if Tarantino saying that his movies are to violent and have a negative view but others have called him a true film maker. This is to see that if his reviews have the numeric backing as well as the sentiment analysis of the words. This analysis is to see what people believe about Quinton Tarantino.

Background

The website that I have taken these reviews from is IMDB which described by IMDB is a database of movies, TV and other forms of media. On their website they offer a review section which allows for user reviews to write a review and leave a rating out of five stars. In this scraping process The scraping has taken 25 reviews for each of the Tarantino movies which will be used for analysis.

Process

The way to gain the data was to take it from the websites user review page. First we will need a few packages which are the ones below and set what you should be called which is the generic bot name below.

library(tidyverse)  
library(httr)       
library(rvest)      
library(lubridate)  
library(magrittr)  
set_config(user_agent("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z Safari/537.36"))

Example of Data Scrape

In this example a movie that is not directed by Quinton Tarantino and will instead use a movie that is well know which will be Star Wars: A New Hope. This code will go through how each variable was acquired. The first step is to read in the URL into a item. Then to remove parts of the HTML that are not applicable and cause data to not be collected.

Star_Wars_url <- 
  read_html("https://www.imdb.com/title/tt0076759/reviews")
Star_Wars_url %>% 
  html_elements("span.point-scale") %>% 
  xml2::xml_remove()

The next step is to collect the variables that will be used for analysis which is similar for a lot of the variables. The variables that are picked are are below:

reviewer_name - The user name of the poster

reviewer_date - The date the review was posted

reviewer_title - A title the user has labled to the review

reviewer_content - The review in written form

reviewer_rating - A numeric variable that is a rating out of 10

spoiler_warning - If the review contains a spoiler is 1 or else is 0

reviewer_name <- 
  DarkKnight_url %>% 
  html_elements("div.review-container") %>% 
  html_element("div.display-name-date") %>% 
  html_element("span.display-name-link") %>% 
  html_text2()

reviewer_date <-
  DarkKnight_url %>% 
  html_elements("div.review-container") %>% 
  html_elements("div.display-name-date") %>% 
  html_elements("span.review-date") %>% 
  html_text2() %>% 
  dmy()

reviewer_title <-
  DarkKnight_url %>% 
  html_elements("div.lister-item-content") %>% 
  html_elements("a.title") %>% 
  html_text2()

reviewer_content <-
  DarkKnight_url %>% 
  html_elements("div.review-container") %>% 
  html_elements("div.text.show-more__control") %>% 
  html_text2()

reviewer_rating <- 
  ifelse(
  str_detect(
      DarkKnight_url %>% 
        html_elements("div.review-container") %>% 
        as.character(),"ipl-ratings-bar"),
  DarkKnight_url %>% 
    html_elements("div.review-container") %>% 
    html_elements("div.ipl-ratings-bar") %>% 
    html_elements("span.rating-other-user-rating") %>% 
    html_elements("span") %>% 
    html_text2() %>% 
    as.numeric(),
  NA
)

spoiler_warning <-
  ifelse(
    str_detect(
      DarkKnight_url %>% 
        html_elements("div.review-container") %>% 
        as.character(),"spoiler-warning"),
    1,0)

The next step it to combined this these variables into a data frame.

imdb_df <-
  data.frame(reviewer_name,reviewer_date,reviewer_rating,reviewer_title,reviewer_content.     ,spoiler_warning)

This example is used but then needs to be combined into a function that can be used for any URL entered.

Function

This function will take the steps above and combined them into a function that can be used over again. This function does add one aspect for the movie which is the title which is an entry that should be enter with the URL. This will add the movie title to every movie that has that review.

imdb_page_scrape <- function(url, movie_title){
  
  imdb_url <- 
    read_html(url)
  
  imdb_url %>% 
    html_elements("span.point-scale") %>% 
    xml2::xml_remove()
  
  reviewer_name <- 
    imdb_url %>% 
    html_elements("div.review-container") %>% 
    html_element("div.display-name-date") %>% 
    html_element("span.display-name-link") %>% 
    html_text2()
  
  reviewer_date <-
    imdb_url %>% 
    html_elements("div.review-container") %>% 
    html_elements("div.display-name-date") %>% 
    html_elements("span.review-date") %>% 
    html_text2() %>% 
    dmy()
  
  reviewer_title <-
    imdb_url %>% 
    html_elements("div.lister-item-content") %>% 
    html_elements("a.title") %>% 
    html_text2()
  
  reviewer_content <-
    imdb_url %>% 
    html_elements("div.review-container") %>% 
    html_elements("div.text.show-more__control") %>% 
    html_text2()
  
  reviewer_rating <- 
    ifelse(
      # Logical test checking if there is an `ipl-ratings-bar` element in the 
      # review-container element
      str_detect(
        imdb_url %>% 
          html_elements("div.review-container") %>% 
          as.character(),"ipl-ratings-bar"),
      # True condition: if the `ipl-ratings-bar` exists in the review-container
      # element, record the ratings that exists.
      imdb_url %>% 
        html_elements("div.review-container") %>% 
        html_elements("div.ipl-ratings-bar") %>% 
        html_elements("span.rating-other-user-rating") %>% 
        html_elements("span") %>% 
        html_text2() %>% 
        as.numeric(),
      # False condition: if the `ipl-ratings-bar` element does not exist, there must
      # logically not be a rating to scrape, so we should record an NA instead
      NA
    )
  
  spoiler_warning <-
    ifelse(
      str_detect(
        imdb_url %>% 
          html_elements("div.review-container") %>% 
          as.character(),"spoiler-warning"),
      1,0)
  
  imdb_df <-
    data.frame(reviewer_name, reviewer_date, reviewer_rating, reviewer_title, reviewer_content, spoiler_warning,movie_title)
  
  return(imdb_df)
}

Reading in Tarantino Movies

The next step is just to simply add all the Tarantino movies which there is an understanding that this is not the most efficient way but IMDB does not make it able to be listed by dirctor in the most efficient way and uses a numeric system for storing their movies seemly in the order they are entered into the database. At the end they are all combined into one database.

Reservior_Dogs <- imdb_page_scrape("https://www.imdb.com/title/tt0105236/reviews","Reservior Dogs") 

Pulp_Fiction <- imdb_page_scrape("https://www.imdb.com/title/tt0110912/reviews","Pulp Fiction")

Jackie_Brown <- imdb_page_scrape("https://www.imdb.com/title/tt0119396/reviews","Jackie Brown")

Kill_Bill_1 <- imdb_page_scrape("https://www.imdb.com/title/tt0266697/reviews","Kill Bill Vol 1")                                                 

Kill_Bill_2 <- imdb_page_scrape("https://www.imdb.com/title/tt0378194/reviews","Kill Bill Vol 2")                                                 
Inglourious_Basterds <- imdb_page_scrape("https://www.imdb.com/title/tt0361748/reviews","Inglourious Basterds")

Django_Unchained <- imdb_page_scrape("https://www.imdb.com/title/tt1853728/reviews","Django Unchained")

The_Hateful_Eight <- imdb_page_scrape("https://www.imdb.com/title/tt3460252/reviews","The Hateful Eight")

Once_Upon_a_Time_in_Hollywood <- imdb_page_scrape("https://www.imdb.com/title/tt7131622/reviews","Once Upon a Time in Hollywood")

Tarantino_df <-
  bind_rows(Reservior_Dogs,Pulp_Fiction,Kill_Bill_1,Kill_Bill_2,Inglourious_Basterds,Django_Unchained,The_Hateful_Eight,Once_Upon_a_Time_in_Hollywood)

Analysis

The first step of analysis is to see what people rate the movies with by numbers.

{r} Tarantino_df %>% group_by(movie_title) %>% mutate(avg_rating = mean(reviewer_rating,na.rm = TRUE)) %>% ggplot(aes(x = movie_title, y = (avg_rating/250)*10))+ geom_col()+ labs(title = "

“, y =”Average Rating”, x = “Movie”)

In this graph it is clear to see that all of Tarantino’s movies are above 7.5 rating which means that most people like all his movies through his career.

Sentiment

First Load in the lexicon and remove all the word that can not be scored which should be stop words and the titles of the movies because they will be required to be used in review. Then put these in a graph to see what should be used and remove what does not make sense to use. Plot is seen as negative but in this context it should be not be used as well as the word top which was seen as positive.

 bing <- 
  get_sentiments("bing")
 
 movie_word <-
  Tarantio_df %>% 
  unnest_tokens(word,movie_title)

tidy_Tarantio <- 
  Tarantio_df %>%
  unnest_tokens(word,reviewer_content) %>%
  anti_join(stop_words) %>% 
  anti_join(movie_word)

Tarantio_counts <-
  tidy_Tarantio %>% 
  group_by(word) %>%
  summarise(n=n()) %>% 
  inner_join(bing)

Tarantio_counts %>%
  filter(!word == 'plot') %>% 
  filter(!word == 'top') %>% 
  filter(n > 25) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette="Set1") +
  labs(title = "Tarantio Sentiment Scores by Word",
       subtitle = "Scorable words appearing at least 25 times",
       x = "Word",
       y = "Contribution to sentiment")

In this graph it can be seen that there are more positive words in the reviews than negative. It seems like the films Tarantino makes are more positively reviewed than negatively reviewed.