Webscraping Tutorial

Install necessary packages for this project

#install.packages('rvest')
#Loading the rvest package
#install.packages('plotly')
library(rvest)

## Warning: package 'rvest' was built under R version 4.2.1

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()

library(plotly)

## Warning: package 'plotly' was built under R version 4.2.1

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Scrape the IMDB website to create a dataframe of

information from 2016 top 100 movies

url <- 'http://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature'
webpage <- read_html(url)
#save_url(webpage, filename='webpage.html')

Load various elements and clean data using gsub.

Scrape for Movie Rank Information

Use the command, length, to ensure that each list contains 100 elements or NAs for missing data to sum to 100 elements.

#Using CSS selectors to scrape the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text
rank_data <- html_text(rank_data_html)

#Let's have a look at the rankings
head(rank_data)

## [1] "1." "2." "3." "4." "5." "6."

#Data-Preprocessing: Converting rankings to numerical
rank_data<-as.numeric(rank_data)

#Let's have another look at the rankings
head(rank_data)

## [1] 1 2 3 4 5 6

length(rank_data)

## [1] 100

Scrape for Title Information

#Using CSS selectors to scrape the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')

#Converting the title data to text
title_data <- html_text(title_data_html)

#Let's have a look at the title
head(title_data)

## [1] "Doctor Strange"                         
## [2] "Rogue One: A Star Wars Story"           
## [3] "Suicide Squad"                          
## [4] "Fantastic Beasts and Where to Find Them"
## [5] "La La Land"                             
## [6] "Moana"

length(title_data)

## [1] 100

Scrape for Movie Description Information

#Using CSS selectors to scrape the description section
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')

#Converting the description data to text
description_data <- html_text(description_data_html)

#Let's have a look at the description data
head(description_data)

## [1] "\nWhile on a journey of physical and spiritual healing, a brilliant neurosurgeon is drawn into the world of the mystic arts."                                                         
## [2] "\nIn a time of conflict, a group of unlikely heroes band together on a mission to steal the plans to the Death Star, the Empire's ultimate weapon of destruction."                    
## [3] "\nA secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."
## [4] "\nThe adventures of writer Newt Scamander in New York's secret community of witches and wizards seventy years before Harry Potter reads his book in school."                          
## [5] "\nWhile navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconcile their aspirations for the future."                               
## [6] "\nIn Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches Moana's island, she answers the Ocean's call to seek out the Demigod to set things right."

#Data-Preprocessing: removing '\n'
description_data<-gsub("\n","",description_data)

#Let's have another look at the description data
head(description_data)

## [1] "While on a journey of physical and spiritual healing, a brilliant neurosurgeon is drawn into the world of the mystic arts."                                                         
## [2] "In a time of conflict, a group of unlikely heroes band together on a mission to steal the plans to the Death Star, the Empire's ultimate weapon of destruction."                    
## [3] "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."
## [4] "The adventures of writer Newt Scamander in New York's secret community of witches and wizards seventy years before Harry Potter reads his book in school."                          
## [5] "While navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconcile their aspirations for the future."                               
## [6] "In Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches Moana's island, she answers the Ocean's call to seek out the Demigod to set things right."

length(description_data)

## [1] 100

Scrape for Movie Run Times

#Using CSS selectors to scrape the Movie runtime section
runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

#Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)

#Let's have a look at the runtime
head(runtime_data)

## [1] "115 min" "133 min" "123 min" "132 min" "128 min" "107 min"

#Data-Preprocessing: removing mins and converting it to numerical
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

#Let's have another look at the runtime data
head(runtime_data)

## [1] 115 133 123 132 128 107

length(runtime_data)

## [1] 100

Scrape for Movie Genre Information

#Using CSS selectors to scrape the Movie genre section
genre_data_html <- html_nodes(webpage,'.genre')

#Converting the genre data to text
genre_data <- html_text(genre_data_html)

#Let's have a look at the runtime
head(genre_data)

## [1] "\nAction, Adventure, Fantasy            "  
## [2] "\nAction, Adventure, Sci-Fi            "   
## [3] "\nAction, Adventure, Fantasy            "  
## [4] "\nAdventure, Family, Fantasy            "  
## [5] "\nComedy, Drama, Music            "        
## [6] "\nAnimation, Adventure, Comedy            "

#Data-Preprocessing: removing \n
genre_data<-gsub("\n","",genre_data)

#Data-Preprocessing: removing excess spaces
genre_data<-gsub(" ","",genre_data)

#taking only the first genre of each movie
genre_data<-gsub(",.*","",genre_data)

#Convering each genre from text to factor
genre_data<-as.factor(genre_data)

#Let's have another look at the genre data
head(genre_data)

## [1] Action    Action    Action    Adventure Comedy    Animation
## Levels: Action Adventure Animation Biography Comedy Crime Drama Horror

length(genre_data)

## [1] 100

Scrape for Movie Rating Information

This information changes as the webpage updates regularly

#Using CSS selectors to scrape the IMDB rating section
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

#Converting the ratings data to text
rating_data <- html_text(rating_data_html)

#Let's have a look at the ratings
head(rating_data)

## [1] "7.5" "7.8" "5.9" "7.2" "8.0" "7.6"

#Data-Preprocessing: converting ratings to numerical
rating_data<-as.numeric(rating_data)

#Let's have another look at the ratings data
head(rating_data)

## [1] 7.5 7.8 5.9 7.2 8.0 7.6

length(rating_data)

## [1] 100

Scrape for Voting Information

#Using CSS selectors to scrape the votes section
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

#Converting the votes data to text
votes_data <- html_text(votes_data_html)

#Let's have a look at the votes data
head(votes_data)

## [1] "716,649" "613,137" "675,128" "466,933" "569,578" "324,382"

#Data-Preprocessing: removing commas
votes_data<-gsub(",","",votes_data)

#Data-Preprocessing: converting votes to numerical
votes_data<-as.numeric(votes_data)

#Let's have another look at the votes data
head(votes_data)

## [1] 716649 613137 675128 466933 569578 324382

length(votes_data)

## [1] 100

Scrape for Movie Director Information

#Using CSS selectors to scrape the directors section
directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

#Converting the directors data to text
directors_data <- html_text(directors_data_html)

#Let's have a look at the directors data
head(directors_data)

## [1] "Scott Derrickson" "Gareth Edwards"   "David Ayer"       "David Yates"     
## [5] "Damien Chazelle"  "Ron Clements"

#Data-Preprocessing: converting directors data into factors
directors_data<-as.factor(directors_data)
length(directors_data)

## [1] 100

Scrape for Movie Actor Information

#Using CSS selectors to scrape the actors section
actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

#Converting the gross actors data to text
actors_data <- html_text(actors_data_html)

#Let's have a look at the actors data
head(actors_data)

## [1] "Benedict Cumberbatch" "Felicity Jones"       "Will Smith"          
## [4] "Eddie Redmayne"       "Ryan Gosling"         "Auli'i Cravalho"

#Data-Preprocessing: converting actors data into factors
actors_data<-as.factor(actors_data)
length(actors_data)

## [1] 100

Find metascore data with missing values and replace with NAs

(this is an automated method instead of the fallible method provided in the tutorial)

ratings_bar_data <- html_nodes(webpage,'.ratings-bar') %>%
# scrape the ratings bar and convert to text
html_text2()

head(ratings_bar_data)

## [1] "7.5\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.5/10 X \n72 Metascore"
## [2] "7.8\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.8/10 X \n65 Metascore"
## [3] "5.9\nRate this\n 1 2 3 4 5 6 7 8 9 10 5.9/10 X \n40 Metascore"
## [4] "7.2\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.2/10 X \n66 Metascore"
## [5] "8.0\nRate this\n 1 2 3 4 5 6 7 8 9 10 8/10 X \n94 Metascore"  
## [6] "7.6\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.6/10 X \n81 Metascore"

metascore_data <- str_match(ratings_bar_data, "\\d{2} Metascore") %>% # extract Metascore
str_match("\\d{2}") %>%
as.numeric() # convert to number
length(metascore_data)

## [1] 100

metascore_data

##   [1] 72 65 40 66 94 81 65 71 59 75 84 81 57 62 51 74 41 72 78 51 42 42 88 67 73
##  [26] 52 47 79 54 32 64 68 48 66 44 76 71 42 70 37 60 99 81 57 NA 51 44 25 60 NA
##  [51] 42 78 52 32 96 33 59 NA 77 81 69 40 66 62 79 36 77 65 26 66 77 66 83 45 48
##  [76] 74 47 35 67 39 81 60 58 33 46 58 61 NA 65 68 28 60 62 23 65 34 NA 69 44 22

summary(metascore_data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   22.00   44.50   62.00   59.28   72.00   99.00       5

Find the missing gross earnings

(automated - this is also in place of the tutorial method, which has issues)

Earnings are part of the votes bar in the html, scrape the votes bar and extract earnings with a regular expression to get the NAs in context.

# scrape the votes bar and convert to text
votes_bar_data <- html_nodes(webpage,'.sort-num_votes-visible') %>%
html_text2()

head(votes_bar_data) # look at the votes bar data

## [1] "Votes: 716,649 | Gross: $232.64M" "Votes: 613,137 | Gross: $532.18M"
## [3] "Votes: 675,128 | Gross: $325.10M" "Votes: 466,933 | Gross: $234.04M"
## [5] "Votes: 569,578 | Gross: $151.10M" "Votes: 324,382 | Gross: $248.76M"

gross_data <- str_match(votes_bar_data, "\\$.+$") # extract the gross earnings

gross_data <- gsub("M","",gross_data) # clean data: remove 'M' sign

gross_data <- substring(gross_data,2,6) %>% # clean data: remove '$' sign
  
as.numeric()
length(gross_data)

## [1] 100

Combine all the lists to form a data frame

movies_df<-data.frame(Rank = rank_data, Title = title_data, Description = description_data,
Runtime = runtime_data, Genre = genre_data, Rating = rating_data,
Director = directors_data, Actors = actors_data,
Metascore = metascore_data, Votes = votes_data, Gross_Earning_in_Mil = gross_data)

# I removed director and actor data from the dataframe since they currently only have 99 observations

#Director = directors_data, Actor = actors_data

#Structure of the data frame
str(movies_df)

## 'data.frame':    100 obs. of  11 variables:
##  $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title               : chr  "Doctor Strange" "Rogue One: A Star Wars Story" "Suicide Squad" "Fantastic Beasts and Where to Find Them" ...
##  $ Description         : chr  "While on a journey of physical and spiritual healing, a brilliant neurosurgeon is drawn into the world of the mystic arts." "In a time of conflict, a group of unlikely heroes band together on a mission to steal the plans to the Death St"| __truncated__ "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "The adventures of writer Newt Scamander in New York's secret community of witches and wizards seventy years bef"| __truncated__ ...
##  $ Runtime             : num  115 133 123 132 128 107 108 139 108 147 ...
##  $ Genre               : Factor w/ 8 levels "Action","Adventure",..: 1 1 1 2 5 3 1 4 3 1 ...
##  $ Rating              : num  7.5 7.8 5.9 7.2 8 7.6 8 8.1 7.1 7.8 ...
##  $ Director            : Factor w/ 96 levels "Alessandro Carloni",..: 84 31 22 25 17 80 92 62 33 6 ...
##  $ Actors              : Factor w/ 90 levels "Adam Sandler",..: 7 32 89 24 74 5 75 4 58 17 ...
##  $ Metascore           : num  72 65 40 66 94 81 65 71 59 75 ...
##  $ Votes               : num  716649 613137 675128 466933 569578 ...
##  $ Gross_Earning_in_Mil: num  233 532 325 234 151 ...

Question 1: Based on the above data, which movie from which genre had the longest runtime?

You can add plotly to get more information on each bar segment. You will also need to include additional code to filter to get the exact movie information``

# p1 <- qplot(data = movies_df,Runtime,fill = Genre,bins = 30)

p1 <- movies_df %>%
ggplot(aes(x=Runtime, fill = Genre)) +
geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
scale_fill_discrete(name = "Genre") +
labs(title = "Top 100 Movies of 2016 Runtime by Genre")
ggplotly(p1)

Answer 1

movies_df %>%
  rownames_to_column(var = "Name") %>%
  filter(Runtime == max(Runtime))

##   Name Rank                                               Title
## 1   97   97 Batman v Superman: Dawn of Justice Ultimate Edition
##                                                                                                                                                                                                                               Description
## 1 Batman is manipulated by Lex Luthor to fear Superman. Superman´s existence is meanwhile dividing the world and he is framed for murder during an international crisis. The heroes clash and force the neutral Wonder Woman to reemerge.
##   Runtime  Genre Rating    Director    Actors Metascore Votes
## 1     182 Action    7.2 Zack Snyder Amy Adams        NA 47724
##   Gross_Earning_in_Mil
## 1                   NA

A1: Batman v Superman: Dawn of Justice Ultimate Edition, in the action genre, has the longest runtime.

Question 2: Based on the above data, in the Runtime of 130-160 mins, which movie from which genre has the highest votes?

Again, use the filter to get the exact movie which answers this question.

p2 <- movies_df %>%
ggplot(aes(x=Runtime,y=Rating))+
geom_point(aes(size=Votes,col=Genre, text = paste("Movie Title:", title_data)), alpha = 0.7) +
labs(title = "Top 100 Movies of 2016 Runtime by Ratings")

## Warning: Ignoring unknown aesthetics: text

p2

Answer 2

movies_df %>%
  rownames_to_column(var = "Name") %>%
  filter(Runtime >= 130 & Runtime <= 160) %>%
  filter(Votes == max(Votes))

##   Name Rank                      Title
## 1   10   10 Captain America: Civil War
##                                                                                          Description
## 1 Political involvement in the Avengers' affairs causes a rift between Captain America and Iron Man.
##   Runtime  Genre Rating      Director      Actors Metascore  Votes
## 1     147 Action    7.8 Anthony Russo Chris Evans        75 761537
##   Gross_Earning_in_Mil
## 1                  408

A2: In the runtime 130 - 160 minutes category, Captain America: Civil War in the action genre had the highest number of votes.

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120.

p3 <- movies_df %>%
ggplot(aes(x=Runtime,y=Gross_Earning_in_Mil))+
geom_point(aes(size = Rating,col = Genre), alpha = 0.5) +
labs(title = "Top 100 Movies of 2016 Runtime by Gross Earnings in Millions") +
  scale_y_continuous("Gross Earnings in Millions", limits =c(-10, 600))
p3

## Warning: Removed 14 rows containing missing values (geom_point).

Answer 3

movies_df %>%
  rownames_to_column(var = "Name") %>%
  filter(Runtime >= 100 & Runtime <=120 & !is.na(Gross_Earning_in_Mil)) %>%
  group_by(Genre) %>%
  summarize(averageGross = mean(Gross_Earning_in_Mil)) %>%
  filter(averageGross == max(averageGross))

## # A tibble: 1 × 2
##   Genre     averageGross
##   <fct>            <dbl>
## 1 Animation         216.

Webscraping Tutorial

Tracee Matthias

2022-07-02

Install necessary packages for this project

Scrape the IMDB website to create a dataframe of

Load various elements and clean data using gsub.

Scrape for Movie Rank Information

Use the command, length, to ensure that each list contains 100 elements or NAs for missing data to sum to 100 elements.

Scrape for Title Information

Scrape for Movie Description Information

Scrape for Movie Run Times

Scrape for Movie Genre Information

Scrape for Movie Rating Information

This information changes as the webpage updates regularly

Scrape for Voting Information

Scrape for Movie Director Information

Scrape for Movie Actor Information

Find metascore data with missing values and replace with NAs

(this is an automated method instead of the fallible method provided in the tutorial)

Find the missing gross earnings

(automated - this is also in place of the tutorial method, which has issues)

Earnings are part of the votes bar in the html, scrape the votes bar and extract earnings with a regular expression to get the NAs in context.

Combine all the lists to form a data frame

Question 1: Based on the above data, which movie from which genre had the longest runtime?

You can add plotly to get more information on each bar segment. You will also need to include additional code to filter to get the exact movie information``

Answer 1

A1: Batman v Superman: Dawn of Justice Ultimate Edition, in the action genre, has the longest runtime.

Question 2: Based on the above data, in the Runtime of 130-160 mins, which movie from which genre has the highest votes?

Again, use the filter to get the exact movie which answers this question.

Answer 2

A2: In the runtime 130 - 160 minutes category, Captain America: Civil War in the action genre had the highest number of votes.

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120.

Answer 3

A3: The animation genre has the highest average gross earnings in runtime 100 to 120 minutes.