Webscraping W5

Install necessary packages for this project

#install.packages('rvest')
#Loading the rvest package
library(rvest)

## Warning: package 'rvest' was built under R version 4.1.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'stringr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()

library(plotly)

## Warning: package 'plotly' was built under R version 4.1.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

Scrape the IMDB website to create a dataframe of information from 2016 top 100 movies

Step one: Use the following URL from IMBD movies of 2016

http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature

#Specifying the url for desired website to be scraped
url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'

#Reading the HTML code from the website
webpage <- read_html(url)

# save_url(webpage, filename="webpage.html")

Step two: Load various elements and clean data using gsub.

#Using CSS selectors to scrape the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text
rank_data <- html_text(rank_data_html)

#Data-Preprocessing: Converting rankings to numerical
rank_data<-as.numeric(rank_data)

#Let's have a look at the rankings and the length
head(rank_data)

## [1] 1 2 3 4 5 6

length(rank_data)

## [1] 100

Scrape for Title Information

#Using CSS selectors to scrape the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')

#Converting the title data to text
title_data <- html_text(title_data_html)

#Let's have a look at the title and the length
head(title_data)

## [1] "Doctor Strange"                         
## [2] "Rogue One: A Star Wars Story"           
## [3] "Suicide Squad"                          
## [4] "Fantastic Beasts and Where to Find Them"
## [5] "La La Land"                             
## [6] "Moana"

length(title_data)

## [1] 100

Scrape for Movie Description Information

#Using CSS selectors to scrape the description section
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')

#Converting the description data to text
description_data <- html_text(description_data_html)

#Data-Preprocessing: removing '\n'
description_data<-gsub("\n","",description_data)

#Let's have another look at the description data and length
head(description_data)

## [1] "While on a journey of physical and spiritual healing, a brilliant neurosurgeon is drawn into the world of the mystic arts."                                                         
## [2] "In a time of conflict, a group of unlikely heroes band together on a mission to steal the plans to the Death Star, the Empire's ultimate weapon of destruction."                    
## [3] "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."
## [4] "The adventures of writer Newt Scamander in New York's secret community of witches and wizards seventy years before Harry Potter reads his book in school."                          
## [5] "While navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconcile their aspirations for the future."                               
## [6] "In Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches Moana's island, she answers the Ocean's call to seek out the Demigod to set things right."

length(description_data)

## [1] 100

Scrape for Movie Run Times

#Using CSS selectors to scrape the Movie runtime section
runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

#Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)

#Data-Preprocessing: removing mins and converting it to numerical
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

#Let's have a look at the runtime data and its length
head(runtime_data)

## [1] 115 133 123 132 128 107

length(runtime_data)

## [1] 100

Scrape for Movie Genre Information

#Using CSS selectors to scrape the Movie genre section
genre_data_html <- html_nodes(webpage,'.genre')

#Converting the genre data to text
genre_data <- html_text(genre_data_html)

#Data-Preprocessing: removing \n
genre_data<-gsub("\n","",genre_data)

#Data-Preprocessing: removing excess spaces
genre_data<-gsub(" ","",genre_data)

#taking only the first genre of each movie
genre_data<-gsub(",.*","",genre_data)

#Convering each genre from text to factor
genre_data<-as.factor(genre_data)

#Let's have a look at the genre data
head(genre_data)

## [1] Action    Action    Action    Adventure Comedy    Animation
## Levels: Action Adventure Animation Biography Comedy Crime Drama Horror

Scrape for Movie Rating Information

This information changes as the webpage updates regularly

#Using CSS selectors to scrape the IMDB rating section
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

#Converting the ratings data to text
rating_data <- html_text(rating_data_html)

#Data-Preprocessing: converting ratings to numerical
rating_data<-as.numeric(rating_data)

#Let's have another look at the ratings data
head(rating_data)

## [1] 7.5 7.8 5.9 7.2 8.0 7.6

Scrape for Voting Information

#Using CSS selectors to scrape the votes section
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

#Converting the votes data to text
votes_data <- html_text(votes_data_html)

#Data-Preprocessing: removing commas
votes_data<-gsub(",","",votes_data)

#Data-Preprocessing: converting votes to numerical
votes_data<-as.numeric(votes_data)

Scrape for Movie Director Information

#Using CSS selectors to scrape the directors section
directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

#Converting the directors data to text
directors_data <- html_text(directors_data_html)

#Data-Preprocessing: converting directors data into factors
directors_data<-as.factor(directors_data)

Scrape for Movie Actor Information

#Using CSS selectors to scrape the actors section
actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

#Converting the gross actors data to text
actors_data <- html_text(actors_data_html)

#Data-Preprocessing: converting actors data into factors
actors_data<-as.factor(actors_data)

Find metascore data with missing values and replace with NAs

(this is an automated method instead of the fallible method provided in the tutorial)

ratings_bar_data <- html_nodes(webpage,'.ratings-bar') %>%
# scrape the ratings bar and convert to text
 html_text2()

metascore_data <- str_match(ratings_bar_data, "\\d{2} Metascore") %>% # extract Metascore
 str_match("\\d{2}") %>%
 as.numeric() # convert to number 

summary(metascore_data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   22.00   44.50   62.00   59.28   72.00   99.00       5

length(metascore_data)

## [1] 100

Find the missing gross earnings

(automated - this is also in place of the tutorial method, which has issues) Earnings are part of the votes bar in the html, scrape the votes bar and extract earnings with a regular expression to get the NAs in context.

# scrape the votes bar and convert to text
votes_bar_data <- html_nodes(webpage,'.sort-num_votes-visible') %>%
 html_text2()

gross_data <- str_match(votes_bar_data, "\\$.+$") # extract the gross earnings
gross_data <- gsub("M","",gross_data) # clean data: remove 'M' sign
gross_data <- substring(gross_data,2,6) %>% # clean data: remove '$' sign
 as.numeric()

Combine all the lists to form a data frame

movies_df<-data.frame(Rank = rank_data, Title = title_data, Description = description_data, Runtime = runtime_data, Genre = genre_data, Rating = rating_data, Director = directors_data, Actors = actors_data, Metascore = metascore_data, Votes = votes_data, Gross_Earning_in_Mil = gross_data)
# I removed director and actor data from the dataframe since they currently only have 99 observations
 #Director = directors_data, Actor = actors_data
#Structure of the data frame
str(movies_df)

## 'data.frame':    100 obs. of  11 variables:
##  $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title               : chr  "Doctor Strange" "Rogue One: A Star Wars Story" "Suicide Squad" "Fantastic Beasts and Where to Find Them" ...
##  $ Description         : chr  "While on a journey of physical and spiritual healing, a brilliant neurosurgeon is drawn into the world of the mystic arts." "In a time of conflict, a group of unlikely heroes band together on a mission to steal the plans to the Death St"| __truncated__ "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "The adventures of writer Newt Scamander in New York's secret community of witches and wizards seventy years bef"| __truncated__ ...
##  $ Runtime             : num  115 133 123 132 128 107 108 139 108 147 ...
##  $ Genre               : Factor w/ 8 levels "Action","Adventure",..: 1 1 1 2 5 3 1 4 3 1 ...
##  $ Rating              : num  7.5 7.8 5.9 7.2 8 7.6 8 8.1 7.1 7.8 ...
##  $ Director            : Factor w/ 96 levels "Alessandro Carloni",..: 84 31 22 25 17 80 92 62 33 6 ...
##  $ Actors              : Factor w/ 90 levels "Adam Sandler",..: 7 32 89 24 74 5 75 4 58 17 ...
##  $ Metascore           : num  72 65 40 66 94 81 65 71 59 75 ...
##  $ Votes               : num  715977 612749 674939 466725 569293 ...
##  $ Gross_Earning_in_Mil: num  233 532 325 234 151 ...

Question 1: Based on the above data, which movie from which genre had the longest runtime?

#select title, genre and runtime
movie1 <- select(movies_df, Title, Genre, Runtime)

# find the row with maximum runtime
movie1[which.max(movie1$Runtime),]

##                                                  Title  Genre Runtime
## 97 Batman v Superman: Dawn of Justice Ultimate Edition Action     182

Title = Batman v Superman: Dawn of Justice Ultimate Edition

Genre = Action

Run_time = 182

Question 2: Based on the above data, in the Runtime of 130-160 mins, which movie from which genre has the highest votes?

#Find the subset with the four columns: Title, Genre, Runtime, and Votes
movie2 <- select(movies_df, Title, Genre, Runtime, Votes)

#Filter all movies that have runtime between 130 and 160
movie2 <- filter(movie2, Runtime>=130 & Runtime <=160)

#Find the row with the maximum votes
movie2[which.max(movie2$Votes),]

##                        Title  Genre Runtime  Votes
## 4 Captain America: Civil War Action     147 761110

Title = Captain America: Civil War

Genre = Action

Run_time = 147

Votes = 761110

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120.

#Find the subset with the four columns: Title, Genre, Runtime, and gross
movie3 <- select(movies_df, Title, Genre, Runtime, Gross_Earning_in_Mil)

#Filter all movies that have run time between 100 and 120
movie3 <- filter(movie3, Runtime>=100 & Runtime <=120)

# find the mean of each genre
Genre_mean <- movie3 %>%
  group_by(Genre) %>%                                           #Grouping by Genre
  summarise_at(vars(Gross_Earning_in_Mil), mean, na.rm = TRUE)  #Specify column and function

#find the max
Genre_mean[which.max(Genre_mean$Gross_Earning_in_Mil),]

## # A tibble: 1 x 2
##   Genre     Gross_Earning_in_Mil
##   <fct>                    <dbl>
## 1 Animation                 216.

Webscraping W5

Mais Alraee

2022-06-29

Install necessary packages for this project

Scrape the IMDB website to create a dataframe of information from 2016 top 100 movies

Step one: Use the following URL from IMBD movies of 2016

Step two: Load various elements and clean data using gsub.

Scrape for Title Information

Scrape for Movie Description Information

Scrape for Movie Run Times

Scrape for Movie Genre Information

Scrape for Movie Rating Information

Scrape for Voting Information

Scrape for Movie Director Information

Scrape for Movie Actor Information

Find metascore data with missing values and replace with NAs

Find the missing gross earnings

Combine all the lists to form a data frame

Question 1: Based on the above data, which movie from which genre had the longest runtime?

Title = Batman v Superman: Dawn of Justice Ultimate Edition

Genre = Action

Run_time = 182

Question 2: Based on the above data, in the Runtime of 130-160 mins, which movie from which genre has the highest votes?

Title = Captain America: Civil War

Genre = Action

Run_time = 147

Votes = 761110

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120.

Animation has the highest average gross earnings in runtime 100 to 120

Thank you!