WebScraping

Load the necessary packages and data

#Loading the rvest package
library('rvest')

#Specifying the url for desired website to be scraped
url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'

#Reading the HTML code from the website
webpage <- read_html(url)

Selecting ranks of movies

#Using CSS selectors to scrape the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text
rank_data <- html_text(rank_data_html)

#Let's have a look at the rankings
head(rank_data)

## [1] "1." "2." "3." "4." "5." "6."

(Clean UP) Process Rank data

#Data-Preprocessing: Converting rankings to numerical
rank_data<-as.numeric(rank_data)

#Let's have another look at the rankings
head(rank_data)

## [1] 1 2 3 4 5 6

List the titles

#Using CSS selectors to scrape the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')

#Converting the title data to text
title_data <- html_text(title_data_html)

#Let's have a look at the title
head(title_data)

## [1] "Suicide Squad"     "The Conjuring 2"   "Captain Fantastic"
## [4] "Sing"              "Deadpool"          "Hidden Figures"

Scrap the movie description

#Using CSS selectors to scrape the description section
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')

#Converting the description data to text
description_data <- html_text(description_data_html)

#Let's have a look at the description data
head(description_data)

## [1] "\nA secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."                                                             
## [2] "\nEd and Lorraine Warren travel to North London to help a single mother raising four children alone in a house plagued by a supernatural spirit."                                                                                                  
## [3] "\nIn the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical and intellectual education is forced to leave his paradise and enter the world, challenging his idea of what it means to be a parent."
## [4] "\nIn a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists find that their lives will never be the same."                   
## [5] "\nA wisecracking mercenary gets experimented on and becomes immortal but ugly, and sets out to track down the man who ruined his looks."                                                                                                           
## [6] "\nThe story of a team of female African-American mathematicians who served a vital role in NASA during the early years of the U.S. space program."

#Data-Preprocessing: removing '\n'
description_data<-gsub("\n","",description_data)

#Let's have another look at the description data 
head(description_data)

## [1] "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."                                                             
## [2] "Ed and Lorraine Warren travel to North London to help a single mother raising four children alone in a house plagued by a supernatural spirit."                                                                                                  
## [3] "In the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical and intellectual education is forced to leave his paradise and enter the world, challenging his idea of what it means to be a parent."
## [4] "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists find that their lives will never be the same."                   
## [5] "A wisecracking mercenary gets experimented on and becomes immortal but ugly, and sets out to track down the man who ruined his looks."                                                                                                           
## [6] "The story of a team of female African-American mathematicians who served a vital role in NASA during the early years of the U.S. space program."

Scrap the movie runtime

#Using CSS selectors to scrape the Movie runtime section
runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

#Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)

#Let's have a look at the runtime
head(runtime_data)

## [1] "123 min" "134 min" "118 min" "108 min" "108 min" "127 min"

(Clean UP) Remove colons, just need a numeric data

#Data-Preprocessing: removing mins and converting it to numerical

runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

#Let's have another look at the runtime data
head(runtime_data)

## [1] 123 134 118 108 108 127

Scrap the genre

#Using CSS selectors to scrape the Movie genre section
genre_data_html <- html_nodes(webpage,'.genre')

#Converting the genre data to text
genre_data <- html_text(genre_data_html)

#Let's have a look at the runtime
head(genre_data)

## [1] "\nAction, Adventure, Fantasy            "
## [2] "\nHorror, Mystery, Thriller            " 
## [3] "\nComedy, Drama            "             
## [4] "\nAnimation, Comedy, Family            " 
## [5] "\nAction, Adventure, Comedy            " 
## [6] "\nBiography, Drama, History            "

(Clean UP) Remove all the marks

#Data-Preprocessing: removing \n
genre_data<-gsub("\n","",genre_data)

#Data-Preprocessing: removing excess spaces
genre_data<-gsub(" ","",genre_data)

#taking only the first genre of each movie
genre_data<-gsub(",.*","",genre_data)

#Convering each genre from text to factor
genre_data<-as.factor(genre_data)

#Let's have another look at the genre data
head(genre_data)

## [1] Action    Horror    Comedy    Animation Action    Biography
## Levels: Action Adventure Animation Biography Comedy Crime Drama Horror

Scrap the ratings

#Using CSS selectors to scrape the IMDB rating section
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

#Converting the ratings data to text
rating_data <- html_text(rating_data_html)

#Let's have a look at the ratings
head(rating_data)

## [1] "5.9" "7.3" "7.9" "7.1" "8.0" "7.8"

(Clean UP) Remove the colons

#Data-Preprocessing: converting ratings to numerical
rating_data<-as.numeric(rating_data)

#Let's have another look at the ratings data
head(rating_data)

## [1] 5.9 7.3 7.9 7.1 8.0 7.8

Scrap the votes

#Using CSS selectors to scrape the votes section
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

#Converting the votes data to text
votes_data <- html_text(votes_data_html)

#Let's have a look at the votes data
head(votes_data)

## [1] "622,768" "239,684" "199,896" "138,646" "928,600" "208,144"

(Clean UP) Remove the colons and commas

#Data-Preprocessing: removing commas
votes_data<-gsub(",","",votes_data)

#Data-Preprocessing: converting votes to numerical
votes_data<-as.numeric(votes_data)

#Let's have another look at the votes data
head(votes_data)

## [1] 622768 239684 199896 138646 928600 208144

Scrap the directors

#Using CSS selectors to scrape the directors section
directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

#Converting the directors data to text
directors_data <- html_text(directors_data_html)

#Let's have a look at the directors data
head(directors_data)

## [1] "David Ayer"     "James Wan"      "Matt Ross"      "Garth Jennings"
## [5] "Tim Miller"     "Theodore Melfi"

Scrap the Actors

#Data-Preprocessing: converting directors data into factors
directors_data<-as.factor(directors_data)

#Using CSS selectors to scrape the actors section
actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

#Converting the gross actors data to text
actors_data <- html_text(actors_data_html)

#Let's have a look at the actors data
head(actors_data)

## [1] "Will Smith"          "Vera Farmiga"        "Viggo Mortensen"    
## [4] "Matthew McConaughey" "Ryan Reynolds"       "Taraji P. Henson"

#Data-Preprocessing: converting actors data into factors
actors_data <- as.factor(actors_data)

Scrap the metascore

#Using CSS selectors to scrape the metascore section
metascore_data_html <- html_nodes(webpage,'.metascore')

#Converting the runtime data to text
metascore_data <- html_text(metascore_data_html)

#Let's have a look at the metascore data
head(metascore_data)

## [1] "40        " "65        " "72        " "59        " "65        "
## [6] "74        "

Count the number of movies that have a metascore

#Data-Preprocessing: removing extra space in metascore
metascore_data<-gsub(" ","",metascore_data)

#Lets check the length of metascore data
length(metascore_data)

## [1] 96

Find metascore data with missing values and replace with NAs (this is an automated method)

library(knitr)
library(stringr)
ratings_bar_data <- html_nodes(webpage,'.ratings-bar') %>%      # scrape the ratings bar and convert to text
  html_text2()

head(ratings_bar_data)                                                 # look at the ratings bar

## [1] "5.9\nRate this\n 1 2 3 4 5 6 7 8 9 10 5.9/10 X \n40 Metascore"
## [2] "7.3\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.3/10 X \n65 Metascore"
## [3] "7.9\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.9/10 X \n72 Metascore"
## [4] "7.1\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X \n59 Metascore"
## [5] "8.0\nRate this\n 1 2 3 4 5 6 7 8 9 10 8/10 X \n65 Metascore"  
## [6] "7.8\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.8/10 X \n74 Metascore"

metascore_data <- str_match(ratings_bar_data, "\\d{2} Metascore") %>%  # extract Metascore 
  str_match("\\d{2}") %>% 
  as.numeric()                                                         # convert to number  

length(metascore_data)

## [1] 100

metascore_data

##   [1] 40 65 72 59 65 74 81 62 54 72 67 81 75 71 94 70 78 51 44 41 84 72 65 68 25
##  [26] 79 71 51 66 51 48 52 99 NA 48 96 57 44 32 57 88 79 77 52 80 58 28 81 66 78
##  [51] 81 32 76 66 42 60 62 33 51 67 52 81 46 NA 69 23 77 58 58 47 49 23 59 36 46
##  [76] 60 78 42 39 55 49 NA 77 51 64 68 55 NA 65 72 74 35 26 40 42 66 34 36 33 55

summary(metascore_data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   23.00   46.75   59.50   59.15   72.00   99.00       4

Summary

#Let's look at summary statistics
summary(metascore_data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   23.00   46.75   59.50   59.15   72.00   99.00       4

Scrap the gross earnings in millions

#Using CSS selectors to scrape the gross revenue section
gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')

#Converting the gross revenue data to text
gross_data <- html_text(gross_data_html)

#Let's have a look at the votes data
head(gross_data)

## [1] "$325.10M" "$102.47M" "$5.88M"   "$270.40M" "$363.07M" "$169.61M"

Count the number of movies that have a gross earnings data

#Data-Preprocessing: removing '$' and 'M' signs
gross_data<-gsub("M","",gross_data)

gross_data<-substring(gross_data,2,6)

#Let's check the length of gross data
length(gross_data)

## [1] 89

Find the missing gross earnings (automated)

# scrape the votess bar and convert to text
votes_bar_data <- html_nodes(webpage,'.sort-num_votes-visible') %>% 
  html_text2()

head(votes_bar_data)                                                 # look at the votes bar data

## [1] "Votes: 622,768 | Gross: $325.10M" "Votes: 239,684 | Gross: $102.47M"
## [3] "Votes: 199,896 | Gross: $5.88M"   "Votes: 138,646 | Gross: $270.40M"
## [5] "Votes: 928,600 | Gross: $363.07M" "Votes: 208,144 | Gross: $169.61M"

gross_data <- str_match(votes_bar_data, "\\$.+$")                    # extract the gross earnings

gross_data <- gsub("M","",gross_data)                                # clean data: remove 'M' sign 

gross_data <- substring(gross_data,2,6) %>%                          # clean data: remove '$' sign                    
  as.numeric()

length(gross_data)

## [1] 100

Summary

summary(gross_data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.18   26.86   58.70   96.47  125.00  532.10      11

Rename the variables

#Combining all the lists to form a data frame
movies_df<-data.frame(Rank = rank_data, Title = title_data,

Description = description_data, Runtime = runtime_data,

Genre = genre_data, Rating = rating_data,

Metascore = metascore_data, Votes = votes_data,                                           

Gross_Earning_in_Mil = gross_data,

Director = directors_data, Actor = actors_data)

#Structure of the data frame

str(movies_df)

## 'data.frame':    100 obs. of  11 variables:
##  $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title               : chr  "Suicide Squad" "The Conjuring 2" "Captain Fantastic" "Sing" ...
##  $ Description         : chr  "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "Ed and Lorraine Warren travel to North London to help a single mother raising four children alone in a house pl"| __truncated__ "In the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical and "| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
##  $ Runtime             : num  123 134 118 108 108 127 107 117 132 115 ...
##  $ Genre               : Factor w/ 8 levels "Action","Adventure",..: 1 8 5 3 1 4 3 8 1 1 ...
##  $ Rating              : num  5.9 7.3 7.9 7.1 8 7.8 7.6 7.3 6.9 7.5 ...
##  $ Metascore           : num  40 65 72 59 65 74 81 62 54 72 ...
##  $ Votes               : num  622768 239684 199896 138646 928600 ...
##  $ Gross_Earning_in_Mil: num  325.1 102.4 5.88 270.4 363 ...
##  $ Director            : Factor w/ 99 levels "Alex Proyas",..: 23 42 59 35 95 93 83 56 8 87 ...
##  $ Actor               : Factor w/ 90 levels "Aamir Khan","Alexander Skarsgård",..: 88 86 87 59 73 81 6 39 22 8 ...

str(movies_df)

## 'data.frame':    100 obs. of  11 variables:
##  $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title               : chr  "Suicide Squad" "The Conjuring 2" "Captain Fantastic" "Sing" ...
##  $ Description         : chr  "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "Ed and Lorraine Warren travel to North London to help a single mother raising four children alone in a house pl"| __truncated__ "In the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical and "| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
##  $ Runtime             : num  123 134 118 108 108 127 107 117 132 115 ...
##  $ Genre               : Factor w/ 8 levels "Action","Adventure",..: 1 8 5 3 1 4 3 8 1 1 ...
##  $ Rating              : num  5.9 7.3 7.9 7.1 8 7.8 7.6 7.3 6.9 7.5 ...
##  $ Metascore           : num  40 65 72 59 65 74 81 62 54 72 ...
##  $ Votes               : num  622768 239684 199896 138646 928600 ...
##  $ Gross_Earning_in_Mil: num  325.1 102.4 5.88 270.4 363 ...
##  $ Director            : Factor w/ 99 levels "Alex Proyas",..: 23 42 59 35 95 93 83 56 8 87 ...
##  $ Actor               : Factor w/ 90 levels "Aamir Khan","Alexander Skarsgård",..: 88 86 87 59 73 81 6 39 22 8 ...

Analyzing scraped data from the web

library('ggplot2')
qplot(data = movies_df, Runtime, fill = Genre, bins = 30)

Question 1: Based on the above data, which movie from which Genre had the longest runtime?

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

top3 <- movies_df %>%
  arrange(desc(runtime_data)) %>%
  head(3)
top3

##   Rank          Title
## 1   45 American Honey
## 2   42        Silence
## 3   64         Dangal
##                                                                                                                                                                                                          Description
## 1 A teenage girl with nothing to lose joins a traveling magazine sales crew, and gets caught up in a whirlwind of hard partying, law bending and young love as she criss-crosses the Midwest with a band of misfits.
## 2                                  In the 17th century, two Portuguese Jesuit priests travel to Japan in an attempt to locate their mentor, who is rumored to have committed apostasy, and to propagate Catholicism.
## 3                                                           Former wrestler Mahavir Singh Phogat and his two wrestler daughters struggle towards glory at the Commonwealth Games in the face of societal oppression.
##   Runtime     Genre Rating Metascore  Votes Gross_Earning_in_Mil
## 1     163 Adventure    7.0        80  39827                 0.66
## 2     161     Drama    7.2        79 103410                 7.10
## 3     161    Action    8.4        NA 165822                12.39
##          Director           Actor
## 1   Andrea Arnold      Sasha Lane
## 2 Martin Scorsese Andrew Garfield
## 3   Nitesh Tiwari      Aamir Khan

top3 %>%
  ggplot() +
  geom_bar(aes(x= Title, y= Genre, fill = Runtime),
  position = "dodge", stat = "identity") +
  ggtitle("Top 3 Runtime Movies")

Answer: It shows that the longest runtime movie is American Honey which is 163 min of runtime from the Adventure genre.

ggplot(movies_df,aes(x=Runtime,y=Rating))+
geom_point(aes(size=Votes,col=Genre))

Question 2: Based on the above data, in the Runtime of 130-160 mins, which genre has the highest votes?

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

highestvote <-
  ggplot(movies_df,aes(x=Runtime, y=Rating)) +
  scale_x_continuous(limits = c(130,160)) +
  geom_point(aes(size=Votes, col=Genre)) +
  labs(title = "Votes in Runtime 130 to 160 mins",
  x = "Runtime (minutes)", y = "Rating")
ggplotly()

Answer: Larger circles mean more votes. It shows Action genre has the highest votes in the Runtime of 130-160 mins.

ggplot(movies_df,aes(x=Runtime, y=Gross_Earning_in_Mil))+
geom_point(aes(size=Rating,col=Genre))

## Warning: Removed 11 rows containing missing values (geom_point).

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120.

mean <- movies_df %>%
  filter(Runtime >= 100 & Runtime <= 120) %>%
  group_by(Genre) %>%
  summarize(AverageGross = mean(Gross_Earning_in_Mil), Runtime) %>%
  arrange(desc(AverageGross))

## `summarise()` has grouped output by 'Genre'. You can override using the `.groups` argument.

mean

## # A tibble: 48 x 3
## # Groups:   Genre [8]
##    Genre     AverageGross Runtime
##    <fct>            <dbl>   <dbl>
##  1 Animation        216.      108
##  2 Animation        216.      107
##  3 Animation        216.      108
##  4 Animation        216.      106
##  5 Adventure        185.      106
##  6 Adventure        185.      101
##  7 Action            78.4     108
##  8 Action            78.4     115
##  9 Action            78.4     116
## 10 Action            78.4     118
## # … with 38 more rows

grossearnings <-
ggplot(mean, aes(x=Runtime, y=AverageGross)) +
    scale_x_continuous(limits = c(100,120)) +
    geom_point(aes(size=AverageGross ,col=Genre)) +
    labs(title = "Gross Earnings in Runtime 100 to 120 mins",
    x = "Runtime (minutes)", y = "Average Gross Earnings (Millions)")
ggplotly()

WebScraping

Soojin Kim

6/29/2021

Load the necessary packages and data

Selecting ranks of movies

(Clean UP) Process Rank data

List the titles

Scrap the movie description

Scrap the movie runtime

(Clean UP) Remove colons, just need a numeric data

Scrap the genre

(Clean UP) Remove all the marks

Scrap the ratings

(Clean UP) Remove the colons

Scrap the votes

(Clean UP) Remove the colons and commas

Scrap the directors

Scrap the Actors

Scrap the metascore

Count the number of movies that have a metascore

Find metascore data with missing values and replace with NAs (this is an automated method)

Summary

Scrap the gross earnings in millions

Count the number of movies that have a gross earnings data

Find the missing gross earnings (automated)

Summary

Rename the variables

Analyzing scraped data from the web

Question 1: Based on the above data, which movie from which Genre had the longest runtime?

Answer: It shows that the longest runtime movie is American Honey which is 163 min of runtime from the Adventure genre.

Question 2: Based on the above data, in the Runtime of 130-160 mins, which genre has the highest votes?

Answer: Larger circles mean more votes. It shows Action genre has the highest votes in the Runtime of 130-160 mins.

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120.

Answer: It shows Animation genre has the highest average gross earnings in runtime 100 to 120 mins.

Thank you :)