In this assignment, we practiced a technique known as Web Scraping. Essentially, this allows us to convert data in HTML tags on the web to structured data that can be easily accessed and used. In this assignment specifically, we web scraped data for the most popular films of 2016 from the IMDb website.
First, we need the necessary packages. Notice “rvest”, which is the package that allowed me to scrape data from the web page.
# install.packages("rvest")
library(rvest)
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
library(dplyr)
Now that we have our desired data, we specify the url and read the HTML code from the website.
url <- "http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature"
webpage <- read_html(url)
The website gives us lots of data, from movie rank, genre, Metascore, Actors, etc. We’ll need to scrape all of this. But let’s start with the rankings first.
First, we needed to find specific CSS selector that defined the rankings. We did this with the help of selector gadget and determined that it was “.text-primary”. Then we covert the data to text.
rank_data_html <- html_nodes(webpage, '.text-primary')
rank_data <- html_text(rank_data_html)
head(rank_data)
## [1] "1." "2." "3." "4." "5." "6."
Looking at the head data, we can see that the code worked. However, we still need to convert the data to numerics.
rank_data<-as.numeric(rank_data)
head(rank_data)
## [1] 1 2 3 4 5 6
Now that we’ve converted the data for ranks we need to do the same for the other variables. Next we’ll be doing the titles.
Once again, we find the CSS selector, change it into text, and observe the data again.
title_data_html <- html_nodes(webpage,'.lister-item-header a')
title_data <- html_text(title_data_html)
head(title_data)
## [1] "Suicide Squad" "The Conjuring 2" "Captain Fantastic"
## [4] "Sing" "Deadpool" "Hidden Figures"
Let’s finish scraping for the other variables.
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')
description_data <- html_text(description_data_html)
# head(description_data)
The descriptions contatin the extraneous characters “” so its best if we get rid of this.
description_data<-gsub("\n","",description_data)
# head(description_data)
runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')
runtime_data <- html_text(runtime_data_html)
# head(runtime_data)
# Removing " min"
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)
head(runtime_data)
## [1] 123 134 118 108 108 127
genre_data_html <- html_nodes(webpage,'.genre')
genre_data <- html_text(genre_data_html)
# head(genre_data)
This one has a lot of extraneous characters, so let’s get rid of all of them.
# removing \n
genre_data<-gsub("\n","",genre_data)
# removing spaces
genre_data<-gsub(" ","",genre_data)
# only looking at the first genre of the movie
genre_data<-gsub(",.*","",genre_data)
# converting genres from texts to factors
genre_data<-as.factor(genre_data)
head(genre_data)
## [1] Action Horror Comedy Animation Action Biography
## Levels: Action Adventure Animation Biography Comedy Crime Drama Horror
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')
rating_data <- html_text(rating_data_html)
head(rating_data)
## [1] "5.9" "7.3" "7.9" "7.1" "8.0" "7.8"
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')
votes_data <- html_text(votes_data_html)
# head(votes_data)
votes_data<-gsub(",","",votes_data)
votes_data<-as.numeric(votes_data)
head(votes_data)
## [1] 622787 239737 199912 138656 928635 208154
directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')
directors_data <- html_text(directors_data_html)
# head(directors_data)
directors_data<-as.factor(directors_data)
head(directors_data)
## [1] David Ayer James Wan Matt Ross Garth Jennings Tim Miller
## [6] Theodore Melfi
## 99 Levels: Alex Proyas Ana Lily Amirpour André Øvredal ... Zack Snyder
actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')
actors_data <- html_text(actors_data_html)
# head(actors_data)
actors_data<-as.factor(actors_data)
head(actors_data)
## [1] Will Smith Vera Farmiga Viggo Mortensen
## [4] Matthew McConaughey Ryan Reynolds Taraji P. Henson
## 90 Levels: Aamir Khan Alexander Skarsgård Amy Adams ... Zoey Deutch
metascore_data_html <- html_nodes(webpage,'.metascore')
metascore_data <- html_text(metascore_data_html)
# head(metascore_data)
metascore_data<-gsub(" ","",metascore_data)
length(metascore_data)
## [1] 96
The length of the metascore data is 96, even though there are 100 movies. This is because 4 of the Metascores are missing. This should be fixed otherwise it may end up messing up the data.
I manually found the Metascores for the missing movies (movies 34, 64, 92, 88) to fix the issue.
for (i in c(34, 64, 82, 88)){
a<-metascore_data[1:(i-1)]
b<-metascore_data[i:length(metascore_data)]
metascore_data<-append(a,list("NA"))
metascore_data<-append(metascore_data,b)}
metascore_data<-as.numeric(metascore_data)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
length(metascore_data)
## [1] 100
The length now shows 100 indicating that it was fixed.
summary(metascore_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 23.00 46.75 59.50 59.15 72.00 99.00 4
Checking the data to see if it was fixed.
gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
gross_data <- html_text(gross_data_html)
head(gross_data)
## [1] "$325.10M" "$102.47M" "$5.88M" "$270.40M" "$363.07M" "$169.61M"
gross_data<-gsub("M","",gross_data)
gross_data<-substring(gross_data,2,6)
length(gross_data)
## [1] 89
Similarly, the Gross variable was missing data for 11 movies. So once again I manually found the data and revised the list.
for (i in c(34, 50, 55, 60, 62, 63, 82, 87, 88, 89, 96)){
a<-gross_data[1:(i-1)]
b<-gross_data[i:length(gross_data)]
gross_data<-append(a,list("NA"))
gross_data<-append(gross_data,b)}
gross_data<-as.numeric(gross_data)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
length(gross_data)
## [1] 100
The lengths shows 100
summary(gross_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.18 26.86 58.70 96.47 125.00 532.10 11
Now that all the data was found and converted for each variable. I combined all of it to create the dataframe. With this dataframe analyses and further observations such as creating plots using ggplot2 can now be done with ease.
movies_df <- data.frame(Rank = rank_data, Title = title_data,
Description = description_data, Runtime = runtime_data,
Genre = genre_data, Rating = rating_data,
Metascore = metascore_data, Votes = votes_data, Gross_Earning_in_Mil = gross_data,
Director = directors_data, Actor = actors_data)
str(movies_df)
## 'data.frame': 100 obs. of 11 variables:
## $ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : chr "Suicide Squad" "The Conjuring 2" "Captain Fantastic" "Sing" ...
## $ Description : chr "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "Ed and Lorraine Warren travel to North London to help a single mother raising four children alone in a house pl"| __truncated__ "In the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical and "| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
## $ Runtime : num 123 134 118 108 108 127 107 117 132 115 ...
## $ Genre : Factor w/ 8 levels "Action","Adventure",..: 1 8 5 3 1 4 3 8 1 1 ...
## $ Rating : chr "5.9" "7.3" "7.9" "7.1" ...
## $ Metascore : num 40 65 72 59 65 74 81 62 54 72 ...
## $ Votes : num 622787 239737 199912 138656 928635 ...
## $ Gross_Earning_in_Mil: num 325.1 102.4 5.88 270.4 363 ...
## $ Director : Factor w/ 99 levels "Alex Proyas",..: 23 42 59 35 95 93 83 56 8 87 ...
## $ Actor : Factor w/ 90 levels "Aamir Khan","Alexander Skarsgård",..: 88 86 87 59 73 81 6 39 22 8 ...
Such as this plot below:
qplot(data = movies_df, Runtime, fill = Genre, bins = 30)
runtimeorder <- movies_df %>%
arrange(desc(Runtime))
# runtimeorder
ggplot(movies_df, aes(x = Runtime, y = Rating)) +
geom_point(aes(size = Votes, col = Genre)) +
ggtitle("Movie Runtimes and Their Ratings") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(
x = "Runtime (in min)",
y = "Movie Rating (out of 10.0)")
An adventure movie called American Honey had the longest runtime.
ggplot(movies_df, aes(x = Runtime, y = Gross_Earning_in_Mil)) +
geom_point(aes(col = Genre)) +
xlim(130, 160) +
ggtitle("Movie Runtimes and Their Gross Earning") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(
x = "Runtime (in min)",
y = "Gross Earning (in mil)")
## Warning: Removed 86 rows containing missing values (geom_point).
An Action movie had the highest number of votes.
ggplot() +
geom_bar(data = movies_df, aes(x = Genre, y = Gross_Earning_in_Mil, fill = Genre),
position = "dodge", stat = "identity") +
ggtitle("Movie Genres and Their Gross Earning") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(
x = "Genres",
y = "Gross Earning (in mil)")
## Warning: Removed 11 rows containing missing values (geom_bar).
The Animation genre had the highest Gross Earning.