WebScraping

This assignment is based on the following article:

< Beginner’s Guide on Web Scraping in R (using rvest) with hands-on example >

https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/

#installing the rvest package.
#install.packages('rvest')

#loading the rvest package.
library('rvest')

#specifying the url for desired website to be scraped
url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'

#Reading the HTML code from the website
webpage <- read_html(url)

➭ Using CSS selectors to scrape the rankings section

rank_data_html <- html_nodes(webpage, '.text-primary')

#converting the ranking data to text
rank_data <- html_text(rank_data_html)

#let's have a look at the rankings.
str(rank_data)

##  chr [1:100] "1." "2." "3." "4." "5." "6." "7." "8." "9." "10." "11." "12." ...

#data-preprocessing : converting rankings type from character to numerical
rank_data <- as.numeric(rank_data)
str(rank_data)

##  num [1:100] 1 2 3 4 5 6 7 8 9 10 ...

➭ Using CSS selectors to scrape the title section

# I could not get the same CSS selector '.lister-item-header a' when I selected the title with the cursor.
title_data_html <- html_nodes(webpage,'.lister-item-header a')

#converting the title data to text
title_data <- html_text(title_data_html)

#let's have a look at the title
str(title_data)

##  chr [1:100] "Batman v Superman: Dawn of Justice Ultimate Edition" ...

➭ Using CSS selectors to scrape the description section

description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')
#converting the description data to text
description_data <- html_text(description_data_html)
#let's have a look at the description data
head(description_data)

## [1] "\nBatman is manipulated by Lex Luthor to fear Superman. Superman´s existence is meanwhile dividing the world and he is framed for murder during an international crisis. The heroes clash and force the neutral Wonder Woman to reemerge."
## [2] "\nThe adventures of writer Newt Scamander in New York's secret community of witches and wizards seventy years before Harry Potter reads his book in school."                                                                              
## [3] "\nA secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."                                                    
## [4] "\nA wisecracking mercenary gets experimented on and becomes immortal but ugly, and sets out to track down the man who ruined his looks."                                                                                                  
## [5] "\nFearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world wrestles with what kind of a hero it really needs."                                                                          
## [6] "\nWorld War II American Army Medic Desmond T. Doss, who served during the Battle of Okinawa, refuses to kill people and becomes the first man in American history to receive the Medal of Honor without firing a shot."

#Data-preprocessing : removing '\n'
description_data <-gsub(pattern="\n", replacement = "", x = description_data)
head(description_data)

## [1] "Batman is manipulated by Lex Luthor to fear Superman. Superman´s existence is meanwhile dividing the world and he is framed for murder during an international crisis. The heroes clash and force the neutral Wonder Woman to reemerge."
## [2] "The adventures of writer Newt Scamander in New York's secret community of witches and wizards seventy years before Harry Potter reads his book in school."                                                                              
## [3] "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."                                                    
## [4] "A wisecracking mercenary gets experimented on and becomes immortal but ugly, and sets out to track down the man who ruined his looks."                                                                                                  
## [5] "Fearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world wrestles with what kind of a hero it really needs."                                                                          
## [6] "World War II American Army Medic Desmond T. Doss, who served during the Battle of Okinawa, refuses to kill people and becomes the first man in American history to receive the Medal of Honor without firing a shot."

str(description_data)

##  chr [1:100] "Batman is manipulated by Lex Luthor to fear Superman. Superman´s existence is meanwhile dividing the world and "| __truncated__ ...

➭ Using CSS selectors to scrape the Movie runtime section

runtime_data_html <- html_nodes(webpage,'.runtime')
runtime_data <- html_text(runtime_data_html)
head(runtime_data)

## [1] "182 min" "132 min" "123 min" "108 min" "152 min" "139 min"

#Data-Preprocessing : removing mins and converting it to numerical
runtime_data <- gsub(pattern = " min", replacement = "", x= runtime_data)
runtime_data <- as.numeric(runtime_data)
head(runtime_data)

## [1] 182 132 123 108 152 139

length(runtime_data)

## [1] 100

➭ Using CSS selectors to scrape the Movie genre section

genre_data_html <- html_nodes(webpage,'.genre')

#Converting the genre data to text
genre_data <- html_text(genre_data_html)
head(genre_data)

## [1] "\nAction, Adventure, Sci-Fi            " 
## [2] "\nAdventure, Family, Fantasy            "
## [3] "\nAction, Adventure, Fantasy            "
## [4] "\nAction, Adventure, Comedy            " 
## [5] "\nAction, Adventure, Sci-Fi            " 
## [6] "\nBiography, Drama, History            "

#data_preprocessing: removing \n
genre_data <- gsub(pattern = "\n", replacement = "",x= genre_data)

#data_preprocessiong : removing excess spaces
genre_data <- gsub(pattern = "            ",replacement = "", x = genre_data)
head(genre_data)

## [1] "Action, Adventure, Sci-Fi"  "Adventure, Family, Fantasy"
## [3] "Action, Adventure, Fantasy" "Action, Adventure, Comedy" 
## [5] "Action, Adventure, Sci-Fi"  "Biography, Drama, History"

#taking only the first genre of each movie
genre_data <- gsub(pattern = ",.*", replacement = "", x= genre_data)
head(genre_data)

## [1] "Action"    "Adventure" "Action"    "Action"    "Action"    "Biography"

#converting each genre from text to factor.
class(genre_data)

## [1] "character"

genre_data <- as.factor(genre_data)

#let's gave another look at the genre data
length(genre_data)

## [1] 100

table(genre_data)

## genre_data
##    Action Adventure Animation Biography    Comedy     Crime     Drama    Horror 
##        39         6        11         6        12         3        18         5

➭ Using CSS selectors to scrape the IMDB rating section

rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')
#converting the ratings data to text
rating_data <- html_text(rating_data_html)
#let's have a look at the ratings
head(rating_data)

## [1] "8.4" "7.3" "5.9" "8.0" "6.5" "8.1"

#data-preprocessing : converting ratings to numerical
rating_data <- as.numeric(rating_data)
#let's have another look at the ratings data
str(rating_data)

##  num [1:100] 8.4 7.3 5.9 8 6.5 8.1 8 6.9 7.5 7.6 ...

➭ Using CSS selectors to scrape the votes section

votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')
#converiting the votes data to text
votes_data <- html_text(votes_data_html)
#let's have a look at the votes data
str(votes_data)

##  chr [1:100] "10,375" "447,783" "669,445" "989,423" "683,393" "497,182" ...

#data-preproceccing : removing commas
votes_data <- gsub(pattern= "," , replacement ="", x=votes_data)
class(votes_data)

## [1] "character"

#data-preprocessing: convering votes to numerical
votes_data <- as.numeric(votes_data)
#let's have another look at the votes data
str(votes_data)

##  num [1:100] 10375 447783 669445 989423 683393 ...

➭ Using CSS selectors to scrape the directors section

director_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')
#converting the directors data to text
director_data <- html_text(director_data_html)
#let's have a look at the directors data
str(director_data)

##  chr [1:99] "David Yates" "David Ayer" "Tim Miller" "Zack Snyder" ...


#```{r}
#filling missing entries with NA
for (i in c(1)){
a <- director_data[1:(i-1)]
b <- director_data[i:length(director_data)]
director_data <- append(a,list("NA"))
director_data <- append(director_data,b)
}
#```
It doesn't work. In c(1), the number 1 add the second observation. So I had to use a different way to add the first row. why?

# I used the append function. In append(), the last argument value 0 is the first position. why?
director_data <- append(x = director_data, values = NA, 0)
head(director_data)

## [1] NA            "David Yates" "David Ayer"  "Tim Miller"  "Zack Snyder"
## [6] "Mel Gibson"

#data-preprocessing: converting directors data into factors
director_data <- as.factor(director_data)
length(director_data)

## [1] 100

head(director_data)

## [1] <NA>        David Yates David Ayer  Tim Miller  Zack Snyder Mel Gibson 
## 98 Levels: Alex Proyas Ana Lily Amirpour André Øvredal ... Zack Snyder

➭ Using CSS selectors to scrape the actors section

actor_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')
#converting the gross actors data to text
actor_data <- html_text(actor_data_html)
#let's have alook at the actors data
str(actor_data)

##  chr [1:99] "Eddie Redmayne" "Will Smith" "Ryan Reynolds" "Ben Affleck" ...

#it also needed to add the first row so I used the append function again.
actor_data <- append(x = actor_data, values = NA, 0)
head(actor_data)

## [1] NA                "Eddie Redmayne"  "Will Smith"      "Ryan Reynolds"  
## [5] "Ben Affleck"     "Andrew Garfield"

#data-preprocessing : converting actors data into factors
actor_data <- as.factor(actor_data)
str(actor_data)

##  Factor w/ 89 levels "Alexander Skarsgård",..: NA 23 88 76 6 3 75 39 7 5 ...

length(actor_data)

## [1] 100

head(actor_data)

## [1] <NA>            Eddie Redmayne  Will Smith      Ryan Reynolds  
## [5] Ben Affleck     Andrew Garfield
## 89 Levels: Alexander Skarsgård Amy Adams Andrew Garfield ... Zoey Deutch

➭ Using CSS selectors to scrape the metascore section

metascore_data_html <- html_nodes(webpage,'.metascore')
#converting the runtime data to text
metascore_data <- html_text(metascore_data_html)
#let's have a look at the metascore
str(metascore_data)

##  chr [1:96] "66        " "40        " "65        " "44        " ...

#You don't need to remove extra space in metascore using gsub function. 
#Converting the data type to numeric, extra blanks are automatically removed.
metascore_data <- as.numeric(metascore_data)
str(metascore_data)

##  num [1:96] 66 40 65 44 71 94 52 72 81 59 ...

length(metascore_data)

## [1] 96

# Adding the first row for NA
metascore_data <- append(x = metascore_data, values = NA, 0)

# Adding the other 3 NAs
for (i in c(50, 71, 93)){
a <- metascore_data[1:(i-1)]
b <- metascore_data[i:length(metascore_data)]
metascore_data <- append(a,list("NA"))
metascore_data <- append(metascore_data,b)
}

length(metascore_data)

## [1] 100

head(metascore_data)

## [[1]]
## [1] NA
## 
## [[2]]
## [1] 66
## 
## [[3]]
## [1] 40
## 
## [[4]]
## [1] 65
## 
## [[5]]
## [1] 44
## 
## [[6]]
## [1] 71

#let's look at summary statistics
class(metascore_data)

## [1] "list"

#the class has been converted as 'list' after adding 'NAs' 
#Converting the data type to numeric again.
metascore_data <- as.numeric(metascore_data)

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

summary(metascore_data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   47.00   62.00   60.15   74.00   99.00       4

➭ Using CSS selectors to scrape the gross revenue section.

gross_data_html <- html_nodes(webpage, '.ghost~ .text-muted+ span')
#converting the gross revenue data to text
gross_data <- html_text(gross_data_html)
#let's have a look at the votes data
str(gross_data)

##  chr [1:89] "$234.04M" "$325.10M" "$363.07M" "$330.36M" "$67.21M" ...

#data-preprocessing: removinf '$' and 'M' signs.

gross_data <- gsub(pattern = "M", replacement = "", x= gross_data)

gross_data <- gsub("$", "", gross_data)

#gsub() cannot replace $ to "", we need to use substring function.
gross_data <- substring(text = gross_data, first = 2, last = 7)
head(gross_data)

## [1] "234.04" "325.10" "363.07" "330.36" "67.21"  "151.10"

#let's check the length of gross data
length(gross_data)

## [1] 89

#filling missing entries with NA
for (i in c(46, 49,55, 57, 70, 71, 83, 84, 85,  95, 99)){
a <- gross_data[1:(i-1)]
b <- gross_data[i:length(gross_data)]
gross_data <- append(a,list("NA"))
gross_data <- append(gross_data,b)
}

#data-preprocessing: converting gross to numerical

gross_data <- as.numeric(gross_data)

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

head(gross_data)

## [1] 234.04 325.10 363.07 330.36  67.21 151.10

length(gross_data)

## [1] 100

summary(gross_data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.18   18.71   57.64   99.39  126.64  532.18      11

➭ Combining all the lists to form a data frame

movie_df <- data.frame(Rank = rank_data, Title = title_data, Description =description_data, Runtime = runtime_data, Genre = genre_data, Rating= rating_data, Metascore=metascore_data, Votes = votes_data, Gross_Earning_in_Mil = gross_data, Director = director_data, Actor = actor_data)

str(movie_df)

## 'data.frame':    100 obs. of  11 variables:
##  $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title               : chr  "Batman v Superman: Dawn of Justice Ultimate Edition" "Fantastic Beasts and Where to Find Them" "Suicide Squad" "Deadpool" ...
##  $ Description         : chr  "Batman is manipulated by Lex Luthor to fear Superman. Superman´s existence is meanwhile dividing the world and "| __truncated__ "The adventures of writer Newt Scamander in New York's secret community of witches and wizards seventy years bef"| __truncated__ "A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive "| __truncated__ "A wisecracking mercenary gets experimented on and becomes immortal but ugly, and sets out to track down the man"| __truncated__ ...
##  $ Runtime             : num  182 132 123 108 152 139 128 144 115 107 ...
##  $ Genre               : Factor w/ 8 levels "Action","Adventure",..: 1 2 1 1 1 4 5 1 1 3 ...
##  $ Rating              : num  8.4 7.3 5.9 8 6.5 8.1 8 6.9 7.5 7.6 ...
##  $ Metascore           : num  NA 66 40 65 44 71 94 52 72 81 ...
##  $ Votes               : num  10375 447783 669445 989423 683393 ...
##  $ Gross_Earning_in_Mil: num  234 325.1 363.1 330.4 67.2 ...
##  $ Director            : Factor w/ 98 levels "Alex Proyas",..: NA 25 22 94 98 60 18 11 84 79 ...
##  $ Actor               : Factor w/ 89 levels "Alexander Skarsgård",..: NA 23 88 76 6 3 75 39 7 5 ...

library(tidyverse)
library(plotly)
#view(movie_df)

➭ Analyzing scraped data from the web

qplot(data = movie_df, Runtime, fill = Genre,bins = 30)

# Creating the same graph as the above qplot graph with using ggplot
ggplot(movie_df, aes(x= Runtime, fill = Genre)) +
  geom_histogram(colour = "gray", lwd= 0.2, position = "stack")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The stacked histogram is hard to read exact information. So I divided histogram bars by genre.
ggplot(movie_df, aes(x= Runtime, fill= Genre)) +
  geom_histogram(color = "gray") +
  facet_wrap(Genre ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

☑ Question 1: Based on the above data, which movie from which Genre had the longest runtime?

➪ Answer: Action. There is one orange bar above Runtime 175 which indicates Action.

# By adding vlines, you can see which bubbles are inside the range in the question.
gr <- ggplot(movie_df, aes(x= Runtime, y = Rating)) +
  geom_point(aes(size = Votes, col = Genre, alpha = 0.9)) +
  geom_vline(xintercept = 130, alpha = 0.4) +
  geom_vline(xintercept = 160, alpha = 0.4) +
  theme_bw()

ggplotly(gr)

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

☑ Question 2 : Based on the above data, in the Runtime of 130-160 mins, which genre has the highest votes?

➪ Answer: Action. The bubbles’ size means the number of votes. In the Runtime of 130- 160 mins, there are several large orange bubbles that represent Action.

gr2 <-ggplot(movie_df, aes(x = Runtime, y = Gross_Earning_in_Mil)) +
  geom_point(aes(size = Rating, col = Genre, alpha = 0.9)) +
  geom_vline(xintercept = 100, alpha = 0.4) +
  geom_vline(xintercept = 120, alpha = 0.4) +
  theme_classic()

ggplotly(gr2)

#There are too many bubbles so I will create a graph which represent average gross earnings to solve the question 3.

df_gross <- movie_df %>%
  group_by(Genre) %>%
  filter(Runtime >=100 & Runtime <=120) %>%
  summarise(mean_gross = mean(Gross_Earning_in_Mil, na.rm = T))
df_gross

## # A tibble: 8 x 2
##   Genre     mean_gross
##   <fct>          <dbl>
## 1 Action          75.8
## 2 Adventure      125. 
## 3 Animation      125. 
## 4 Biography       19.1
## 5 Comedy          37.7
## 6 Crime          117. 
## 7 Drama           64.9
## 8 Horror         177.

ggplot(df_gross, aes(x= Genre, y= mean_gross, fill = Genre)) +
  geom_col() +
  ylab("Average Gross Earning in Million") +
  ggtitle("Average Gross Earning of Movies which Runtime Is 100 - 120 minutes (By Genre)")

☑ Question 3: Based on the above data, across all genres which genre has the highest average gross earninigs in runtime 100 to 120.

➪ Answer: Horror. The horror genre has the highest average gross earnings in runtime 100 to 120.

The End. Thank you!

WebScraping_yk

Yunji Kim

2022-04-03

This assignment is based on the following article:

< Beginner’s Guide on Web Scraping in R (using rvest) with hands-on example >

https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/

➭ Using CSS selectors to scrape the rankings section

➭ Using CSS selectors to scrape the title section

➭ Using CSS selectors to scrape the description section

➭ Using CSS selectors to scrape the Movie runtime section

➭ Using CSS selectors to scrape the Movie genre section

➭ Using CSS selectors to scrape the IMDB rating section

➭ Using CSS selectors to scrape the votes section

➭ Using CSS selectors to scrape the directors section

➭ Using CSS selectors to scrape the actors section

➭ Using CSS selectors to scrape the metascore section

➭ Using CSS selectors to scrape the gross revenue section.

➭ Combining all the lists to form a data frame

➭ Analyzing scraped data from the web

☑ Question 1: Based on the above data, which movie from which Genre had the longest runtime?

➪ Answer: Action. There is one orange bar above Runtime 175 which indicates Action.

☑ Question 2 : Based on the above data, in the Runtime of 130-160 mins, which genre has the highest votes?

➪ Answer: Action. The bubbles’ size means the number of votes. In the Runtime of 130- 160 mins, there are several large orange bubbles that represent Action.

☑ Question 3: Based on the above data, across all genres which genre has the highest average gross earninigs in runtime 100 to 120.

➪ Answer: Horror. The horror genre has the highest average gross earnings in runtime 100 to 120.