Most data from over the web is not readily available. It is in an unstructures format and is not downloadable. It requires knowledge to build a useful model. ## What is webscraping? Technique for converting data in HTML taks over to web to a structures format that can be easily accessed and used. ## Why do we need it? To provide you with hands-on knowledge, we are going to scrape data from IMDB. Some other possible applications that you can use web scraping for are:
Scraping movie rating data to create movie recommendation engines. Scraping text data from Wikipedia and other sources for making NLP-based systems or training deep learning models for tasks like topic recognition from the given text. Scraping labeled image data from websites like Google, Flickr, etc to train image classification models. Scraping data from social media sites like Facebook and Twitter for performing tasks Sentiment analysis, opinion mining, etc. Scraping user reviews and feedbacks from e-commerce sites like Amazon, Flipkart, etc.
Human Copy-Paste: This is a slow and efficient way of scraping data from the web. This involves humans themselves analyzing and copying the data to local storage. Text pattern matching: Another simple yet powerful approach to extract information from the web is by using regular expression matching facilities of programming languages. You can learn more about regular expressions here. API Interface: Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/ or private APIs which can be called using the standard code for retrieving the data in the prescribed format. DOM Parsing: By using web browsers, programs can retrieve the dynamic content generated by client-side scripts. It is also possible to parse web pages into a DOM tree, based on which programs can retrieve parts of these pages. We’ll use the DOM parsing approach during the course of this article. And rely on the CSS selectors of the webpage for finding the relevant fields which contain the desired information. But before we begin there are a few prerequisites that one need in order to proficiently scrape data from any website. One of the best sources I could find for learning HTML and CSS is this. I have observed that most of the Data Scientists are not very sound with technical knowledge of HTML and CSS. Therefore, we’ll be using an open source software named Selector Gadget which will be more than sufficient for anyone in order to perform Web scraping. You can access and download the Selector Gadget extension here. Make sure that you have this extension installed by following the instructions from the website. I have done the same. I’m using Google chrome and I can access the extension in the extension bar to the top right. Link to help with HTML and CSS: https://flukeout.github.io/ Access and download Selector Gadget: https://selectorgadget.com/
#install.packages("rvest")
#Load rvest package
library("rvest")
#specify URL for desired website to be scraped
url <-"https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature"
#Read the HTML code from the website
webpage <- read_html(url)
Now, we’ll be scraping the following data from this website.
Rank: The rank of the film from 1 to 100 on the list of 100 most popular feature films released in 2016. Title: The title of the feature film. Description: The description of the feature film. Runtime: The duration of the feature film. Genre: The genre of the feature film, Rating: The IMDb rating of the feature film. Metascore: The metascore on IMDb website for the feature film. Votes: Votes cast in favor of the feature film. Gross_Earning_in_Mil: The gross earnings of the feature film in millions. Director: The main director of the feature film. Note, in case of multiple directors, I’ll take only the first. Actor: The main actor in the feature film. Note, in case of multiple actors, I’ll take only the first.
Use Selector gadget to get specific CSS selectors that enclose the rankings (Use Selector Gadget in your extensions in Chrome) 1. Mark all rankings on the site 2/3. Now we know the CSS selector contains the rankings and we can use simple R code to get the rankings ## Rank Data
#using css selector to scrape rankings section
rank_data_html <- html_nodes(webpage, ".text-primary")
#converting ranking data to text
rank_data <- html_text(rank_data_html)
#Look at the rankings
head(rank_data)
## [1] "1." "2." "3." "4." "5." "6."
#Data-preprocessing: converting rankings to numerical
rank_data <- as.numeric(rank_data)
#Let's have another look at the rankings
head(rank_data)
## [1] 1 2 3 4 5 6
#Using CSS selector to scrape title section
title_data_html <- html_nodes(webpage, ".lister-item-header a")
#convert title to text
title_data <- html_text(title_data_html)
#Let's look at the titles we have
head(title_data)
## [1] "Suicide Squad" "Batman v Superman: Dawn of Justice"
## [3] "Captain America: Civil War" "Captain Fantastic"
## [5] "Deadpool" "The Accountant"
Complete same scraping technique for the Description, Runtime, Genre, Rating, Metascore, Votes, Gross_Earning_in_Mil , Director and Actor data
#Use CSS Selectors to scrape the description section
description_data_html <- html_nodes(webpage, ".ratings-bar+ .text-muted")
#convert description data to text
description_data <- html_text(description_data_html)
#View the description data
head(description_data)
## [1] "\n A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."
## [2] "\n Fearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world wrestles with what kind of a hero it really needs."
## [3] "\n Political involvement in the Avengers' affairs causes a rift between Captain America and Iron Man."
## [4] "\n In the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical and intellectual education is forced to leave his paradise and enter the world, challenging his idea of what it means to be a parent."
## [5] "\n A wisecracking mercenary gets experimented on and becomes immortal but ugly, and sets out to track down the man who ruined his looks."
## [6] "\n As a math savant uncooks the books for a new client, the Treasury Department closes in on his activities, and the body count starts to rise."
#removing \n
description_data <- gsub ("\n", "", description_data)
head(description_data)
## [1] " A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defensive task force. Their first mission: save the world from the apocalypse."
## [2] " Fearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world wrestles with what kind of a hero it really needs."
## [3] " Political involvement in the Avengers' affairs causes a rift between Captain America and Iron Man."
## [4] " In the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical and intellectual education is forced to leave his paradise and enter the world, challenging his idea of what it means to be a parent."
## [5] " A wisecracking mercenary gets experimented on and becomes immortal but ugly, and sets out to track down the man who ruined his looks."
## [6] " As a math savant uncooks the books for a new client, the Treasury Department closes in on his activities, and the body count starts to rise."
#scrap runtime selection
runtime_data_html <- html_nodes(webpage,".text-muted .runtime")
#converting runtime data to text
runtime_data <- html_text(runtime_data_html)
#disply runtime
head(runtime_data)
## [1] "123 min" "152 min" "147 min" "118 min" "108 min" "128 min"
#remove mins and convert to numerical
runtime_data <-gsub("min", "", runtime_data)
runtime_data <- as.numeric(runtime_data)
#display numeric runtime
head(runtime_data)
## [1] 123 152 147 118 108 128
genre_data_html <- html_nodes(webpage, ".genre")
genre_data <- html_text(genre_data_html)
head(genre_data)
## [1] "\nAction, Adventure, Fantasy "
## [2] "\nAction, Adventure, Sci-Fi "
## [3] "\nAction, Adventure, Sci-Fi "
## [4] "\nComedy, Drama "
## [5] "\nAction, Adventure, Comedy "
## [6] "\nAction, Crime, Drama "
#remove \n
genre_data <- gsub("\n", "", genre_data)
#remove access spacing
genre_data <-gsub(" ", "", genre_data)
#take only the first genre of each movie
genre_data <- gsub(",.*","", genre_data)
#convert each genre from text to factor
genre_data <- as.factor(genre_data)
#display data
head(genre_data)
## [1] Action Action Action Comedy Action Action
## Levels: Action Adventure Animation Biography Comedy Crime Drama Horror
rating_data_html <- html_nodes(webpage,".ratings-imdb-rating strong")
#convert data to text
rating_data <- html_text(rating_data_html)
#Let's look at ratings
head(rating_data)
## [1] "6.0" "6.4" "7.8" "7.9" "8.0" "7.3"
#convert ratings to numerical
rating_data <- as.numeric(rating_data)
head(rating_data)
## [1] 6.0 6.4 7.8 7.9 8.0 7.3
#CSS selectors to scrape the votes
votes_data_html <- html_nodes(webpage, ".sort-num_votes-visible span:nth-child(2)")
#convert data to text
votes_data <- html_text(votes_data_html)
#review the votes
head(votes_data)
## [1] "612,283" "643,222" "676,168" "194,549" "913,827" "264,380"
#remove the commas
votes_data <- gsub(",", "", votes_data)
#convert votes to numerical
votes_data <- as.numeric(votes_data)
#display votes
head(votes_data)
## [1] 612283 643222 676168 194549 913827 264380
#scrape directors
directors_data_html <- html_nodes(webpage,".text-muted+ p a:nth-child(1)")
#converting data to text
directors_data <- html_text(directors_data_html)
#display directors
head(directors_data)
## [1] "David Ayer" "Zack Snyder" "Anthony Russo" "Matt Ross"
## [5] "Tim Miller" "Gavin O'Connor"
#convert directors data into factors
directors_data <- as.factor(directors_data)
#CSS selectors to scrape the actors
actors_data_html <- html_nodes(webpage, ".lister-item-content .ghost+ a")
#convert gross actors data to text
actors_data <- html_text(actors_data_html)
#View actors data
head(actors_data)
## [1] "Will Smith" "Ben Affleck" "Chris Evans" "Viggo Mortensen"
## [5] "Ryan Reynolds" "Ben Affleck"
#conert data into factors
actors_data <- as.factor(actors_data)
#scrape metascore section
metascore_data_html <- html_nodes(webpage, ".metascore")
#convert runtime data to text
metascore_data <- html_text(metascore_data_html)
#Look at metascore
head(metascore_data)
## [1] "40 " "44 " "75 " "72 " "65 "
## [6] "51 "
#remove extra space
metascore_data <- gsub(" ", "", metascore_data)
#check length of metascore data
length(metascore_data)
## [1] 97
#clean the NA's for movies without metascores
for (i in c(18, 57, 100)){
a<-metascore_data[1:(i-1)]
b<-metascore_data[i:length(metascore_data)]
metascore_data<-append(a,list("NA"))
metascore_data<-append(metascore_data,b)
}
#clear extra lines becuase unlist caused errors
metascore_data <- metascore_data[-c(101, 102)]
#Data-Preprocessing: converting metascore to numerical
metascore_data<-as.numeric(metascore_data)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
#Let's have another look at length of the metascore data
length(metascore_data)
## [1] 100
#Let's look at summary statistics
summary(metascore_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.00 48.00 62.00 60.44 72.00 99.00 3
#scrape gross_data
gross_data_html <- html_nodes(webpage, ".ghost~ .text-muted+ span")
#convert revenue data to text
gross_data <-html_text(gross_data_html)
#look at votes data
head(gross_data)
## [1] "$325.10M" "$330.36M" "$408.08M" "$5.88M" "$363.07M" "$86.26M"
#remove $ change to M
gross_data <- gsub("M","", gross_data)
gross_data <- substring(gross_data,2,6)
#check length of gross data
head(gross_data)
## [1] "325.1" "330.3" "408.0" "5.88" "363.0" "86.26"
length(gross_data)
## [1] 92
#fill in the missing data
for (i in c(18,67,73,75,83,87,98,100))
{
a <- gross_data[1:(i-1)]
b <- gross_data[i:length(gross_data)]
gross_data <- append(a,list("NA"))
gross_data <- append(gross_data, b)
}
gross_data <- gross_data[-c(101, 102)]
#convert gross to numerical
gross_data <- as.numeric(gross_data)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
#legnth of gross
length(gross_data)
## [1] 100
summary(gross_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.04 14.39 52.30 91.32 116.15 532.10 8
#combine lists to form a dataframe
movies_df <- data.frame(Rank = rank_data, Title = title_data, Runtime = runtime_data,Genre = genre_data, Description = description_data, Rating = rating_data, Metascore = metascore_data, Votes = votes_data, Gross_Earning_in_Mil = gross_data, Director = directors_data, Actor = actors_data)
#structure of data frame
str(movies_df)
## 'data.frame': 100 obs. of 11 variables:
## $ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : chr "Suicide Squad" "Batman v Superman: Dawn of Justice" "Captain America: Civil War" "Captain Fantastic" ...
## $ Runtime : num 123 152 147 118 108 128 120 116 107 116 ...
## $ Genre : Factor w/ 8 levels "Action","Adventure",..: 1 1 1 5 1 1 1 1 3 7 ...
## $ Description : chr " A secret government agency recruits some of the most dangerous incarcerated super-villains to form a defens"| __truncated__ " Fearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world "| __truncated__ " Political involvement in the Avengers' affairs causes a rift between Captain America and Iron Man." " In the forests of the Pacific Northwest, a father devoted to raising his six kids with a rigorous physical "| __truncated__ ...
## $ Rating : num 6 6.4 7.8 7.9 8 7.3 6.8 7.4 7.6 7.9 ...
## $ Metascore : num 40 44 75 72 65 51 67 70 81 81 ...
## $ Votes : num 612283 643222 676168 194549 913827 ...
## $ Gross_Earning_in_Mil: num 325.1 330.3 408 5.88 363 ...
## $ Director : Factor w/ 98 levels "Adam Wingard",..: 23 98 6 61 93 36 40 86 82 27 ...
## $ Actor : Factor w/ 91 levels "Aamir Khan","Alexander Skarsgård",..: 89 8 19 88 75 8 39 73 7 3 ...
#Visualizations
library("ggplot2")
#Create barchart
qplot(data = movies_df, Runtime, fill = Genre, bins = 30)
With Runtime as the x-axis, it is clear that the longest running movies are on the right side of the graph ranging near 160 minute. For 2016, theres only one movie in the Drama Genre that has a runtime close to 160 and one movie in the Action Genre that has a similar Runtime. Dangal (the action film) and Silence I (the drama film) are the two movies that have the longest runtimes, both being at 161 minutes which was found by using this graph and IMDb chart to narrow down the longest running movies!
#dotplot
ggplot(movies_df, aes(x=Runtime, y = Rating)) +
geom_point(aes(size = Votes, col = Genre))
When looking at the x-axis for the Runtime of movies between 130-160 minutes, the largest circles are the genres that have the most votes. Therefore, the action genre not only has 3 medium - big circles compared to the other colored genres in the given range. With that being said, action movies have the highest votes when deviding on movies with runtimes between 130-160 minutes.
#dot plot 2
ggplot(movies_df, aes(x = Runtime, y = Gross_Earning_in_Mil))+
geom_point(aes(size = Rating, col = Genre))
## Warning: Removed 8 rows containing missing values (geom_point).
Looking at the x-axis between 100-120 min, there is a lot going on! However, the most important dots are the ones that are highest to the top of the graph becuase that means the movies with a runtime between 100-120 min have higher gross earnings than the movies that sit to the lower section of the graph. With this visual, there are two colored dots that look very close, all most in line with each other, and they aline with the adventure and action genres. Both of these movies read at an approximately $375 Million gross earnings for the year of 2016.