In this assignment, we practiced a technique known as Web Scraping. Essentially, this allows us to convert data in HTML tags on the web to structured data that can be easily accessed and used. In this assignment specifically, we web scraped data for the most popular films of 2016 from the IMDb website.

First, we need the necessary packages. Notice “rvest”, which is the package that allowed me to scrape data from the web page.

# install.packages("rvest")

library(rvest)
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
library(dplyr)

Finding the data and assigning it

Now that we have our desired data, we specify the url and read the HTML code from the website.

url <- "http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature"

webpage <- read_html(url)

Scraping the Actual Data

Scraping Rankings

The website gives us lots of data, from movie rank, genre, Metascore, Actors, etc. We’ll need to scrape all of this. But let’s start with the rankings first.

First, we needed to find specific CSS selector that defined the rankings. We did this with the help of selector gadget and determined that it was “.text-primary”. Then we covert the data to text.

rank_data_html <- html_nodes(webpage, '.text-primary')

rank_data <- html_text(rank_data_html)

head(rank_data)
## [1] "1." "2." "3." "4." "5." "6."

Looking at the head data, we can see that the code worked. However, we still need to convert the data to numerics.

rank_data<-as.numeric(rank_data)

head(rank_data)
## [1] 1 2 3 4 5 6

Scraping Titles

Now that we’ve converted the data for ranks we need to do the same for the other variables. Next we’ll be doing the titles.

Once again, we find the CSS selector, change it into text, and observe the data again.

title_data_html <- html_nodes(webpage,'.lister-item-header a')

title_data <- html_text(title_data_html)

head(title_data)
## [1] "Suicide Squad"     "The Conjuring 2"   "Captain Fantastic"
## [4] "Sing"              "Deadpool"          "Hidden Figures"

Let’s finish scraping for the other variables.

Scraping Movie Descriptions

description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')

description_data <- html_text(description_data_html)

# head(description_data)

The descriptions contatin the extraneous characters “” so its best if we get rid of this.

description_data<-gsub("\n","",description_data)

# head(description_data)

Scraping Movie Runtimes

runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

runtime_data <- html_text(runtime_data_html)

# head(runtime_data)
# Removing " min"

runtime_data<-gsub(" min","",runtime_data)

runtime_data<-as.numeric(runtime_data)

head(runtime_data)
## [1] 123 134 118 108 108 127

Scraping Movie Genre

genre_data_html <- html_nodes(webpage,'.genre')

genre_data <- html_text(genre_data_html)

# head(genre_data)

This one has a lot of extraneous characters, so let’s get rid of all of them.

# removing \n
genre_data<-gsub("\n","",genre_data)

# removing spaces
genre_data<-gsub(" ","",genre_data)

# only looking at the first genre of the movie

genre_data<-gsub(",.*","",genre_data)

# converting genres from texts to factors
genre_data<-as.factor(genre_data)

head(genre_data)
## [1] Action    Horror    Comedy    Animation Action    Biography
## Levels: Action Adventure Animation Biography Comedy Crime Drama Horror

Scraping IMDb Movie Ratings

rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

rating_data <- html_text(rating_data_html)

head(rating_data)
## [1] "5.9" "7.3" "7.9" "7.1" "8.0" "7.8"

Scraping Movie Votes

votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

votes_data <- html_text(votes_data_html)

# head(votes_data)
votes_data<-gsub(",","",votes_data)

votes_data<-as.numeric(votes_data)

head(votes_data)
## [1] 622787 239737 199912 138656 928635 208154

Scraping Movie Directors

directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

directors_data <- html_text(directors_data_html)

# head(directors_data)
directors_data<-as.factor(directors_data)

head(directors_data)
## [1] David Ayer     James Wan      Matt Ross      Garth Jennings Tim Miller    
## [6] Theodore Melfi
## 99 Levels: Alex Proyas Ana Lily Amirpour André Øvredal ... Zack Snyder

Scraping Movie Actors

actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

actors_data <- html_text(actors_data_html)

# head(actors_data)
actors_data<-as.factor(actors_data)

head(actors_data)
## [1] Will Smith          Vera Farmiga        Viggo Mortensen    
## [4] Matthew McConaughey Ryan Reynolds       Taraji P. Henson   
## 90 Levels: Aamir Khan Alexander Skarsgård Amy Adams ... Zoey Deutch

Metascore Problem

metascore_data_html <- html_nodes(webpage,'.metascore')

metascore_data <- html_text(metascore_data_html)

# head(metascore_data)
metascore_data<-gsub(" ","",metascore_data)

length(metascore_data)
## [1] 96

The length of the metascore data is 96, even though there are 100 movies. This is because 4 of the Metascores are missing. This should be fixed otherwise it may end up messing up the data.

Fixing the Issue

I manually found the Metascores for the missing movies (movies 34, 64, 92, 88) to fix the issue.

for (i in c(34, 64, 82, 88)){

a<-metascore_data[1:(i-1)]

b<-metascore_data[i:length(metascore_data)]

metascore_data<-append(a,list("NA"))

metascore_data<-append(metascore_data,b)}


metascore_data<-as.numeric(metascore_data)
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion
length(metascore_data)
## [1] 100

The length now shows 100 indicating that it was fixed.

summary(metascore_data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   23.00   46.75   59.50   59.15   72.00   99.00       4

Checking the data to see if it was fixed.

Similar Issue With the Gross Variable

gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')

gross_data <- html_text(gross_data_html)

head(gross_data)
## [1] "$325.10M" "$102.47M" "$5.88M"   "$270.40M" "$363.07M" "$169.61M"
gross_data<-gsub("M","",gross_data)

gross_data<-substring(gross_data,2,6)

length(gross_data)
## [1] 89

Fixing the Issue

Similarly, the Gross variable was missing data for 11 movies. So once again I manually found the data and revised the list.

for (i in c(34, 50, 55, 60, 62, 63, 82, 87, 88, 89, 96)){

a<-gross_data[1:(i-1)]

b<-gross_data[i:length(gross_data)]

gross_data<-append(a,list("NA"))

gross_data<-append(gross_data,b)}


gross_data<-as.numeric(gross_data)
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion
length(gross_data)
## [1] 100

The lengths shows 100

summary(gross_data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.18   26.86   58.70   96.47  125.00  532.10      11

Combining the Lists to Create the Dataframe

Now that all the data was found and converted for each variable. I combined all of it to create the dataframe. With this dataframe analyses and further observations such as creating plots using ggplot2 can now be done with ease.

movies_df <- data.frame(Rank = rank_data, Title = title_data,

Description = description_data, Runtime = runtime_data,

Genre = genre_data, Rating = rating_data,

Metascore = metascore_data, Votes = votes_data,                                                             Gross_Earning_in_Mil = gross_data,

Director = directors_data, Actor = actors_data)


str(movies_df)
## 'data.frame':    100 obs. of  11 variables:
##  $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title               : chr  "Suicide Squad" "The Conjuring 2" "Captain Fantastic" "Sing" ...
##  $ Description         : chr  "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "Ed and Lorraine Warren travel to North London to help a single mother raising four children alone in a house pl"| __truncated__ "In the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical and "| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
##  $ Runtime             : num  123 134 118 108 108 127 107 117 132 115 ...
##  $ Genre               : Factor w/ 8 levels "Action","Adventure",..: 1 8 5 3 1 4 3 8 1 1 ...
##  $ Rating              : chr  "5.9" "7.3" "7.9" "7.1" ...
##  $ Metascore           : num  40 65 72 59 65 74 81 62 54 72 ...
##  $ Votes               : num  622787 239737 199912 138656 928635 ...
##  $ Gross_Earning_in_Mil: num  325.1 102.4 5.88 270.4 363 ...
##  $ Director            : Factor w/ 99 levels "Alex Proyas",..: 23 42 59 35 95 93 83 56 8 87 ...
##  $ Actor               : Factor w/ 90 levels "Aamir Khan","Alexander Skarsgård",..: 88 86 87 59 73 81 6 39 22 8 ...

Such as this plot below:

qplot(data = movies_df, Runtime, fill = Genre, bins = 30)

Questions

Question 1: Based on the above data, which movie from which Genre had the longest runtime?

runtimeorder <- movies_df %>%
  arrange(desc(Runtime))

# runtimeorder
ggplot(movies_df, aes(x = Runtime, y = Rating)) +
geom_point(aes(size = Votes, col = Genre)) +
  ggtitle("Movie Runtimes and Their Ratings") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(
    x = "Runtime (in min)",
    y = "Movie Rating (out of 10.0)")

An adventure movie called American Honey had the longest runtime.

Question 2: Based on the above data, in the Runtime of 130 - 160 mins, which genre has the highest votes?

ggplot(movies_df, aes(x = Runtime, y = Gross_Earning_in_Mil)) +
geom_point(aes(col = Genre)) +
  xlim(130, 160) +
  ggtitle("Movie Runtimes and Their Gross Earning") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(
    x = "Runtime (in min)",
    y = "Gross Earning (in mil)")
## Warning: Removed 86 rows containing missing values (geom_point).

An Action movie had the highest number of votes.

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120.

ggplot() +
geom_bar(data = movies_df, aes(x = Genre, y = Gross_Earning_in_Mil, fill = Genre), 
         position = "dodge", stat = "identity") +
  ggtitle("Movie Genres and Their Gross Earning") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(
    x = "Genres",
    y = "Gross Earning (in mil)")
## Warning: Removed 11 rows containing missing values (geom_bar).

The Animation genre had the highest Gross Earning.