Web Scraping

Scraping the IMDb website for the 100 most popular feature films released in 2016

#install the necessary packages and loading the rvest package

#install.packages('rvest')
library('rvest')
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

Scrape the IMDB website to create a dataframe of information from 2016 top 100 movies

Use the following URL from IMDB movies of 2016

https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature

#Specifying the url for desired website to be scraped
url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'

#Reading the HTML code from the website
webpage <- read_html(url)
# save_url(webpage, filename="webpage.html")

Load various elements and clean data using gsub

Scrape for Movie Rank Information

#Using CSS Selectors to scrape the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text
rank_data <- html_text(rank_data_html)

rank_data <- as.numeric(rank_data)
head(rank_data)

## [1] 1 2 3 4 5 6

length(rank_data)

## [1] 100

Scrape for title information

#Using CSS Selectors to scrape the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')

#Converting the title data to text
title_data <- html_text(title_data_html)

#Let's have a look at the title
head(title_data)

## [1] "The Magnificent Seven"        "Me Before You"               
## [3] "Rogue One: A Star Wars Story" "Hidden Figures"              
## [5] "Suicide Squad"                "Sing"

length(title_data)

## [1] 100

Scrape for Movie Description Information

#Using CSS Selectors to scrape the description section
description_data_html <- html_nodes(webpage, '.ratings-bar+ .text-muted')

#Converting the description data to text
description_data <- html_text(description_data_html)

#Data- Preprocessing: removing '\n'
description_data<-gsub("\n","",description_data)


#Let's have a look at the description data
head(description_data)

## [1] "Seven gunmen from a variety of backgrounds are brought together by a vengeful young widow to protect her town from the private army of a destructive industrialist."                                                          
## [2] "A girl in a small town forms an unlikely bond with a recently-paralyzed man she's taking care of."                                                                                                                            
## [3] "In a time of conflict, a group of unlikely heroes band together on a mission to steal the plans to the Death Star, the Empire's ultimate weapon of destruction."                                                              
## [4] "The story of a team of female African-American mathematicians who served a vital role in NASA during the early years of the U.S. space program."                                                                              
## [5] "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."                                          
## [6] "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists find that their lives will never be the same."

length(description_data)

## [1] 100

Scrape for movie run times

#Using CSS selectors to scrape the runtime section
runtime_data_html <- html_nodes(webpage, '.text-muted .runtime')

#Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)

#Data-Preprocessing: removing mins and converting it to numerical
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

#Let's have a look at the runtime data
head(runtime_data)

## [1] 132 106 133 127 123 108

Scrape for movie genre

#Using CSS selectors to scrape the Movie genre section
genre_data_html <- html_nodes(webpage,'.genre')

#Converting the genre data to text
genre_data <- html_text(genre_data_html)

#Data-Preprocessing: removing \n
genre_data<-gsub("\n","",genre_data)

#Data-Preprocessing: removing excess spaces
genre_data<-gsub(" ","",genre_data)

#taking only the first genre of each movie
genre_data<-gsub(",.*","",genre_data)

#Convering each genre from text to factor
genre_data<-as.factor(genre_data)

#Let's have a look at the runtime
head(genre_data)

## [1] Action    Drama     Action    Biography Action    Animation
## Levels: Action Adventure Animation Biography Comedy Crime Drama Horror

Scrape for Movie Rating Information

#Using CSS selectors to scrape the IMDB rating section
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

#Converting the ratings data to text
rating_data <- html_text(rating_data_html)

#Data-Preprocessing: converting ratings to numerical
rating_data<-as.numeric(rating_data)

#Let's have a look at the ratings
head(rating_data)

## [1] 6.8 7.4 7.8 7.8 5.9 7.1

Scrape for Voting Information Section

#Using CSS selectors to scrape the votes section
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

#Converting the votes data to text
votes_data <- html_text(votes_data_html)

#Data-Preprocessing: removing commas
votes_data<-gsub(",","",votes_data)

#Data-Preprocessing: converting votes to numerical
votes_data<-as.numeric(votes_data)

#Let's have a look at the votes data
head(votes_data)

## [1] 217113 263258 651949 238276 695463 176639

length(votes_data)

## [1] 100

Scrape for Movie Director Information

#Using CSS selectors to scrape the directors section
directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

#Converting the directors data to text
directors_data <- html_text(directors_data_html)

#Let's have a look at the directors data
head(directors_data)

## [1] "Antoine Fuqua"  "Thea Sharrock"  "Gareth Edwards" "Theodore Melfi"
## [5] "David Ayer"     "Garth Jennings"

#Data-Preprocessing: converting directors data into factors
directors_data<-as.factor(directors_data)
length(directors_data)

## [1] 100

Scrape for Movie Actor Information

#Data-Preprocessing: converting directors data into factors
directors_data<-as.factor(directors_data)

#Using CSS selectors to scrape the actors section
actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

#Converting the gross actors data to text
actors_data <- html_text(actors_data_html)

#Data-Preprocessing: converting actors data into factors
actors_data<-as.factor(actors_data)

#Let's have a look at the actors data
head(actors_data)

## [1] Denzel Washington   Emilia Clarke       Felicity Jones     
## [4] Taraji P. Henson    Will Smith          Matthew McConaughey
## 92 Levels: Adam Sandler Alexander Skarsgård Amy Adams ... Zoey Deutch

length(actors_data)

## [1] 100

Finding metascore data with missing values and replace with NAs

ratings_bar_data <- html_nodes(webpage,'.ratings-bar') %>%
  #scrape the ratings bar and convert to text
  html_text2()

head(ratings_bar_data)

## [1] "6.8\nRate this\n 1 2 3 4 5 6 7 8 9 10 6.8/10 X \n54 Metascore"
## [2] "7.4\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.4/10 X \n51 Metascore"
## [3] "7.8\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.8/10 X \n65 Metascore"
## [4] "7.8\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.8/10 X \n74 Metascore"
## [5] "5.9\nRate this\n 1 2 3 4 5 6 7 8 9 10 5.9/10 X \n40 Metascore"
## [6] "7.1\nRate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X \n59 Metascore"

# looking at the ratings bar 
metascore_data <- str_match(ratings_bar_data, "\\d{2} Metascore") %>%

#extract metascore
  str_match("\\d{2}") %>%
  as.numeric()
length(metascore_data)

## [1] 100

metascore_data

##   [1] 54 51 65 74 40 59 94 65 71 81 81 78 84 79 62 66 70 56 NA 68 67 25 73 52 96
##  [26] 44 64 55 99 76 88 44 75 36 41 47 51 72 65 57 69 48 66 32 81 72 74 51 65 66
##  [51] 77 NA 71 42 81 33 58 65 48 57 67 62 79 80 32 42 46 21 NA 79 52 45 48 42 77
##  [76] 77 34 73 33 46 60 NA 78 61 76 66 40 58 23 44 59 22 60 58 35 39 60 34 81 49

summary(metascore_data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   46.00   60.50   59.57   73.25   99.00       4

Find the missing gross earnings

# scrape the votes bar and convert to text
votes_bar_data <- html_nodes(webpage,'.sort-num_votes-visible') %>%
  html_text2()
head(votes_bar_data)

## [1] "Votes: 217,113 | Gross: $93.43M"  "Votes: 263,258 | Gross: $56.25M" 
## [3] "Votes: 651,949 | Gross: $532.18M" "Votes: 238,276 | Gross: $169.61M"
## [5] "Votes: 695,463 | Gross: $325.10M" "Votes: 176,639 | Gross: $270.40M"

#looking at the voting bar
#Extract the gross earnings
gross_data <- str_match(votes_bar_data, "\\$.+$")

#clean data: remove 'M' sign
gross_data <- gsub("M", "",gross_data)

gross_data <- substring(gross_data, 2,6) %>%
  
  as.numeric()
length(gross_data)

## [1] 100

Combine all the lists to form a data frame

movies_df<-data.frame(Rank = rank_data, Title = title_data,
Description = description_data, Runtime = runtime_data,
Genre = genre_data, Rating = rating_data,
Metascore = metascore_data, Votes = votes_data,
Gross_Earning_in_Mil = gross_data,
Director = directors_data, Actor = actors_data)

#Structure of the data frame
str(movies_df)

## 'data.frame':    100 obs. of  11 variables:
##  $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title               : chr  "The Magnificent Seven" "Me Before You" "Rogue One: A Star Wars Story" "Hidden Figures" ...
##  $ Description         : chr  "Seven gunmen from a variety of backgrounds are brought together by a vengeful young widow to protect her town f"| __truncated__ "A girl in a small town forms an unlikely bond with a recently-paralyzed man she's taking care of." "In a time of conflict, a group of unlikely heroes band together on a mission to steal the plans to the Death St"| __truncated__ "The story of a team of female African-American mathematicians who served a vital role in NASA during the early "| __truncated__ ...
##  $ Runtime             : num  132 106 133 127 123 108 128 108 139 116 ...
##  $ Genre               : Factor w/ 8 levels "Action","Adventure",..: 1 7 1 4 1 3 5 1 4 7 ...
##  $ Rating              : num  6.8 7.4 7.8 7.8 5.9 7.1 8 8 8.1 7.9 ...
##  $ Metascore           : num  54 51 65 74 40 59 94 65 71 81 ...
##  $ Votes               : num  217113 263258 651949 238276 695463 ...
##  $ Gross_Earning_in_Mil: num  93.4 56.2 532.1 169.6 325.1 ...
##  $ Director            : Factor w/ 99 levels "Aisling Walsh",..: 11 91 34 92 26 36 20 94 61 30 ...
##  $ Actor               : Factor w/ 92 levels "Adam Sandler",..: 19 25 30 85 91 59 74 75 4 3 ...

Three graphs from the tutorial

library('ggplot2')

qplot(data = movies_df,Runtime,fill = Genre,bins = 30)

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

ggplot(movies_df,aes(x=Runtime,y=Rating))+
geom_point(aes(size=Rating,col=Genre))

ggplot(movies_df,aes(x=Runtime,y=Gross_Earning_in_Mil))+
geom_point(aes(size=Rating,col=Genre))

## Warning: Removed 11 rows containing missing values (`geom_point()`).

Question 1: Based on the above data, which movie from which Genre had the longest runtime?

longest <- select(movies_df, Title, Genre, Runtime) %>%
  arrange(desc(Runtime))
head(longest)

##                                Title     Genre Runtime
## 1                     American Honey Adventure     163
## 2                            Silence     Drama     161
## 3                        The Wailing     Drama     156
## 4 Batman v Superman: Dawn of Justice    Action     151
## 5                          Brimstone     Drama     148
## 6         Captain America: Civil War    Action     147

Answer: Based on the data, American Honey from the genre Adventure has the longest runtime of 163 minutes.

Question 2: Based on the above data, in the Runtime of 130-160 mins, which genre has the highest votes?

highest_votes <- select(movies_df, Runtime, Genre, Rating) %>%
  filter (Runtime >= 130 & Runtime <= 160) %>%
  arrange(desc(Rating))
head(highest_votes)

##   Runtime     Genre Rating
## 1     139 Biography    8.1
## 2     145     Drama    8.1
## 3     130 Animation    8.1
## 4     133    Action    7.8
## 5     137     Drama    7.8
## 6     147    Action    7.8

Answer: In the runtime of 130 to 160 mins, the genres Biography, Drama, and Animation tied at the highest vote of 8.1.

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120?

highest_avg_gross <- select(movies_df, Genre, Gross_Earning_in_Mil, Runtime) %>%
  filter (Runtime >= 100 & Runtime <= 120) %>%
  group_by(Genre) %>%
  summarize(average_gross_earnings = mean(Gross_Earning_in_Mil, na.rm = T))
head(highest_avg_gross)

## # A tibble: 6 × 2
##   Genre     average_gross_earnings
##   <fct>                      <dbl>
## 1 Action                      92.4
## 2 Adventure                  149. 
## 3 Animation                  216. 
## 4 Biography                   35.1
## 5 Comedy                      45.9
## 6 Crime                       51.1

Average_gross <- highest_avg_gross %>%
 arrange(desc(average_gross_earnings))
head(Average_gross)

## # A tibble: 6 × 2
##   Genre     average_gross_earnings
##   <fct>                      <dbl>
## 1 Animation                  216. 
## 2 Adventure                  149. 
## 3 Action                      92.4
## 4 Horror                      69.8
## 5 Drama                       61.2
## 6 Crime                       51.1

Web Scraping

Linh Le

2023-04-03

Scraping the IMDb website for the 100 most popular feature films released in 2016

Scrape the IMDB website to create a dataframe of information from 2016 top 100 movies

Use the following URL from IMDB movies of 2016

Load various elements and clean data using gsub

Scrape for Movie Rank Information

Scrape for title information

Scrape for Movie Description Information

Scrape for movie run times

Scrape for movie genre

Scrape for Movie Rating Information

Scrape for Voting Information Section

Scrape for Movie Director Information

Scrape for Movie Actor Information

Finding metascore data with missing values and replace with NAs

Find the missing gross earnings

Combine all the lists to form a data frame

Three graphs from the tutorial

Question 1: Based on the above data, which movie from which Genre had the longest runtime?

Answer: Based on the data, American Honey from the genre Adventure has the longest runtime of 163 minutes.

Question 2: Based on the above data, in the Runtime of 130-160 mins, which genre has the highest votes?

Answer: In the runtime of 130 to 160 mins, the genres Biography, Drama, and Animation tied at the highest vote of 8.1.

Question 3: Based on the above data, across all genres which genre has the highest average gross earnings in runtime 100 to 120?

Answer:Based on the above data, Animation is the genre that has the highest average gross earnings in runtime 100 to 120.