Web scraping involves using a set of automated tools to copy information from a website rather than doing it manually (Vargiu & Urru 2012). Due to the volume, types and inconsistent formats of information available on the Web, Web scraping provides an effective way to extract large amounts of data quickly for further analysis. For this reason, it is also an efficient method often preferred by researchers (Krotov & Silva 2018).
Before we get started, there are legal and ethical implications of Web scraping to be aware of. Krotov & Silva (2018) comment that there are no direct laws preventing Web scraping at this point in time however it is still guided by an existing legal framework of copyright infringement, fraud, abuse and breach of contract. Some common examples include:
Terms of use - Websites may prohibit Web scraping through their ‘terms of use’ policy and failure to comply would be deemed as breach of contract. However, to violate the contract, the person must accept the terms.
Copyright - Extracting and republishing the exact data ‘owned’ by the Website could be subject to copyright. This does not include comments from users using the website. It would also be possible to extract and summarise the information without breaching copyright.
Fraud - Accessing data that is for illegal or fraudulent purposes.
Abuse - The web scraping method causes damage to the Website.
Ethically, there are also individual and organisational privacy and unfair use areas to consider (Krotov & Silva 2018). As each specific reason for Web scraping may differ, it is important to review the legal risks in your area before proceeding.
This guide aims to demonstrate how to extract data from a website through R by using the httr and rvest packages. As an example we will extract weekly movie sales from the boxofficemojo.com website (IMDB 2018).
The boxofficemojo.com website is owned by IMDB and reports on United States box office sales for movies. For this guide, our goal will be to extracting weekly sales from this URL https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm and store it in a data frame to allow further analysis.
Figure 1 – Weekly Box Office results (IMDB 2018B):
Firstly we’ll add the packages we need:
install.packages("httr")
library(httr)
install.packages("rvest")
library(rvest)
install.packages("dplyr")
library(dplyr)
install.packages("magrittr")
library(magrittr)
install.packages("lubridate")
library(lubridate)
install.packages("lubridate")
library(lubridate)
We will use the httr package to build the query. Notice how in the URL https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm there is a ‘yr’ and ‘wk’ parameter. This refers to the year and calendar week. It is possible to change the week number in the URL to see different week results without having to click on a different hyperlink. Try changing it.
Having the ability to adjust the parameter means the query can automatically be generated and return different weekly results. The Httr package provides the ability to call a URL via the GET() function and pass a series of parameters.
getYear = "2018"
getWeek = "31"
base_url = "https://www.boxofficemojo.com/weekly/chart/"
query_params = list(yr=getYear, wk=getWeek)
resp = GET(url=base_url, query=query_params)
resp
## Response [https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31]
## Date: 2018-08-14 12:26
## Status: 200
## Content-Type: text/html
## Size: 91.3 kB
## NA
Notice the status 200. This means we received a successful response back. If you get a status of 4XX it means a problem on the website. If you get a status of 5XX it means a problem on your side.
Now that we have the response (resp) back from the Website, we can extract the contents by using the read_html() function in the rvest package.
page_html = read_html(resp)
We can write the contents to file to read the contents and find out what to extract.
write_xml(page_html, file="WeeklyBoxOfficeSales.html")
Here’s what the content looks like.
Figure 2 – HTML contents of webpage:
HTML consists of a series of text within tags eg. <title>Weekly Box Office Results</title>
In rvest, each tag is considered a node. A node can be extracted as a object using the html_nodes() function passing the name of the node. For example, calling html_nodes(“title”) will return all nodes called ‘<title>’ to a list. To extract the actual title, the html_text(
page_html %>%
html_nodes("title") %>%
html_text()
## [1] "Weekly Box Office Results for August 3-9, 2018 - Box Office Mojo"
Our objective is to get the table from the Webpage into a data frame. This can be done in a series of steps.
page_html %>%
html_nodes("table")
## {xml_nodeset (5)}
## [1] <table border="0" cellspacing="0" cellpadding="0"><tr><form method=" ...
## [2] <table border="0" cellspacing="0" cellpadding="0" width="100%"><tr b ...
## [3] <table border="0" cellpadding="0" cellspacing="0" width="100%"><tr>\ ...
## [4] <table border="0 cellspacing=" cellpadding="0">\n<tr>\n<td width="25 ...
## [5] <table border="0" cellspacing="1" cellpadding="5">\n<tr bgcolor="#dc ...
page_html %>%
html_nodes("table") %>%
extract2(5)
## {xml_node}
## <table border="0" cellspacing="1" cellpadding="5">
## [1] <tr bgcolor="#dcdcdc">\n<td align="center"><font size="1"><a href=" ...
## [2] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">1</font>< ...
## [3] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">2</font>< ...
## [4] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">3</font>< ...
## [5] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">4</font>< ...
## [6] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">5</font>< ...
## [7] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">6</font>< ...
## [8] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">7</font>< ...
## [9] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">8</font>< ...
## [10] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">9</font>< ...
## [11] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">10</font> ...
## [12] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">11</font> ...
## [13] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">12</font> ...
## [14] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">13</font> ...
## [15] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">14</font> ...
## [16] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">15</font> ...
## [17] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">16</font> ...
## [18] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">17</font> ...
## [19] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">18</font> ...
## [20] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">19</font> ...
## ...
mydf = page_html %>%
html_nodes("table") %>%
extract2(5) %>%
html_table()
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TW | LW | Title (Click to view) | Studio | Weekly Gross | % Change | Theater Count / Change | Theater Count / Change | Average | Total Gross | Budget* | Week # |
| 1 | 1 | Mission: Impossible - Fallout | Par. | $52,479,913 | -41.4% | 4,395 | +9 | $11,941 | $141,967,284 | $178 | 2 |
| 2 | N | Disney’s Christopher Robin | BV | $37,589,317 | - | 3,602 | - | $10,436 | $37,589,317 | - | 1 |
| 3 | N | The Spy Who Dumped Me | LGF | $17,960,385 | - | 3,111 | - | $5,773 | $17,960,385 | $40 | 1 |
| 4 | 2 | Mamma Mia! Here We Go Again | Uni. | $15,764,650 | -41.2% | 3,359 | -155 | $4,693 | $98,009,200 | $75 | 3 |
myColNames = c("ThisWeek", "LastWeek", "Title", "Studio", "WeeklyGross", "PctChange", "TheatreCount", "TheatreChange", "Average", "TotalGross", "Budget", "WeekNum")
mydf = page_html %>%
html_nodes("table") %>%
extract2(5) %>%
html_table() %>%
setNames(myColNames)
| ThisWeek | LastWeek | Title | Studio | WeeklyGross | PctChange | TheatreCount | TheatreChange | Average | TotalGross | Budget | WeekNum |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TW | LW | Title (Click to view) | Studio | Weekly Gross | % Change | Theater Count / Change | Theater Count / Change | Average | Total Gross | Budget* | Week # |
| 1 | 1 | Mission: Impossible - Fallout | Par. | $52,479,913 | -41.4% | 4,395 | +9 | $11,941 | $141,967,284 | $178 | 2 |
| 2 | N | Disney’s Christopher Robin | BV | $37,589,317 | - | 3,602 | - | $10,436 | $37,589,317 | - | 1 |
| 3 | N | The Spy Who Dumped Me | LGF | $17,960,385 | - | 3,111 | - | $5,773 | $17,960,385 | $40 | 1 |
| ThisWeek | LastWeek | Title | Studio | WeeklyGross | PctChange | TheatreCount | TheatreChange | Average | TotalGross | Budget | WeekNum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 95 | 94 | 91 | Filmworker | KL | $388 | -72.3% | 1 | -1 | $388 | $97,004 | - | 13 |
| 96 | 95 | - | Gabriel and the Mountain | Strand | $240 | - | 1 | - | $240 | $15,065 | - | 7 |
| 97 | 96 | - | Marlina the Murderer in Four Acts | Icar. | $98 | - | 3 | - | $33 | $16,133 | - | 6 |
| 98 | TOTAL (96 MOVIES): | TOTAL (96 MOVIES): | TOTAL (96 MOVIES): | TOTAL (96 MOVIES): | $216,340,190 | -11.6% | 43,783 | +4,207 | $4,941 |
mydf = page_html %>%
html_nodes("table") %>%
extract2(5) %>%
html_table() %>%
setNames(myColNames) %>%
filter(row_number()!=1) %>%
filter(row_number()!=n())
| ThisWeek | LastWeek | Title | Studio | WeeklyGross | PctChange | TheatreCount | TheatreChange | Average | TotalGross | Budget | WeekNum |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Mission: Impossible - Fallout | Par. | $52,479,913 | -41.4% | 4,395 | +9 | $11,941 | $141,967,284 | $178 | 2 |
| 2 | N | Disney’s Christopher Robin | BV | $37,589,317 | - | 3,602 | - | $10,436 | $37,589,317 | - | 1 |
| 3 | N | The Spy Who Dumped Me | LGF | $17,960,385 | - | 3,111 | - | $5,773 | $17,960,385 | $40 | 1 |
| 4 | 2 | Mamma Mia! Here We Go Again | Uni. | $15,764,650 | -41.2% | 3,359 | -155 | $4,693 | $98,009,200 | $75 | 3 |
| ThisWeek | LastWeek | Title | Studio | WeeklyGross | PctChange | TheatreCount | TheatreChange | Average | TotalGross | Budget | WeekNum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 93 | 93 | 69 | En el Septimo Dia (On the Seventh Day) | CGld | $840 | -88.8% | 2 | -3 | $420 | $71,531 | - | 9 |
| 94 | 94 | 91 | Filmworker | KL | $388 | -72.3% | 1 | -1 | $388 | $97,004 | - | 13 |
| 95 | 95 | - | Gabriel and the Mountain | Strand | $240 | - | 1 | - | $240 | $15,065 | - | 7 |
| 96 | 96 | - | Marlina the Murderer in Four Acts | Icar. | $98 | - | 3 | - | $33 | $16,133 | - | 6 |
mydf = page_html %>%
html_nodes("table") %>%
extract2(5) %>%
html_table() %>%
setNames(myColNames) %>%
filter(row_number()!=1) %>%
filter(row_number()!=n()) %>%
mutate(calYear=getYear, calWeek=getWeek)
| ThisWeek | LastWeek | Title | Studio | WeeklyGross | PctChange | TheatreCount | TheatreChange | Average | TotalGross | Budget | WeekNum | calYear | calWeek |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Mission: Impossible - Fallout | Par. | $52,479,913 | -41.4% | 4,395 | +9 | $11,941 | $141,967,284 | $178 | 2 | 2018 | 31 |
| 2 | N | Disney’s Christopher Robin | BV | $37,589,317 | - | 3,602 | - | $10,436 | $37,589,317 | - | 1 | 2018 | 31 |
tryCatch(
{
page_html %>%
html_nodes("table") %>%
extract2(5) %>%
html_table() %>%
setNames(myColNames) %>%
filter(row_number()!=1) %>%
filter(row_number()!=n()) %>%
mutate(calYear=getYear, calWeek=getWeek)
}
,
error=function(e) return(NULL)
)
getWeeklyBoxOffice = function(theYear, theWeek) {
if (theYear <= year(now()) && theWeek <= week(now())-2) {
tryCatch(
{
base_url = "https://www.boxofficemojo.com/weekly/chart/"
query_params = list(yr=theYear, wk=theWeek)
myResp = GET(url = base_url, query=query_params)
myPage_html = read_html(myResp)
myColNames = c("ThisWeek", "LastWeek", "Title", "Studio", "WeeklyGross", "PctChange", "TheatreCount", "TheatreChange", "Average", "TotalGross", "Budget", "WeekNum")
myWeeklyBoxOffice = myPage_html %>%
html_nodes("table") %>%
extract2(5) %>%
html_table() %>%
setNames(myColNames) %>%
filter(row_number()!=1) %>%
filter(row_number()!=n()) %>%
mutate(calYear=theYear, calWeek=theWeek)
return(myWeeklyBoxOffice)
},
error=function(e) return(NULL)
)
}
else {
return(NULL)
}
}
df1 = getWeeklyBoxOffice("2018", "31")
df2 = getWeeklyBoxOffice("2018", "30")
| ThisWeek | LastWeek | Title | Studio | WeeklyGross | PctChange | TheatreCount | TheatreChange | Average | TotalGross | Budget | WeekNum | calYear | calWeek |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Mission: Impossible - Fallout | Par. | $52,479,913 | -41.4% | 4,395 | +9 | $11,941 | $141,967,284 | $178 | 2 | 2018 | 31 |
| 2 | N | Disney’s Christopher Robin | BV | $37,589,317 | - | 3,602 | - | $10,436 | $37,589,317 | - | 1 | 2018 | 31 |
| ThisWeek | LastWeek | Title | Studio | WeeklyGross | PctChange | TheatreCount | TheatreChange | Average | TotalGross | Budget | WeekNum | calYear | calWeek |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | N | Mission: Impossible - Fallout | Par. | $89,487,371 | - | 4,386 | - | $20,403 | $89,487,371 | $178 | 1 | 2018 | 30 |
| 2 | 1 | Mamma Mia! Here We Go Again | Uni. | $26,820,025 | -51.6% | 3,514 | +197 | $7,632 | $82,244,550 | $75 | 2 | 2018 | 30 |
We can create another function that passes a date, works out all the weeks in between to today and calls the same getWeeklyBoxOffice() function.
getWeeklyBoxOfficeByDate = function(startDate) {
countWeeks = floor(interval(startDate, now()) / duration(num=1, units="weeks"))
weekDates = ymd(startDate) + weeks(x = seq.int(from = 0, to = countWeeks, by = 1))
yearWeeks = data.frame(yr=year(weekDates), wk=week(weekDates))
mydf = NULL
fulldf = NULL
for (x in 1:nrow(yearWeeks)) {
myYear = yearWeeks[x,1]
myWeek = yearWeeks[x,2]
mydf = getWeeklyBoxOffice(myYear, myWeek)
if(!is.null(mydf)) {
if(!is.null(fulldf)) {
fulldf = rbind(fulldf, mydf)
} else {
fulldf = mydf
}
}
}
return(fulldf)
}
fulldf = getWeeklyBoxOfficeByDate("2018-07-19")
Now that we have a data frame with many weekly results we can apply further exploratory data analysis and plotting to analyse the results.
IMDB 2018A, Box Office Mojo, viewed 14 August 2018, <https://www.boxofficemojo.com>.
IMDB 2018B, Weekly Box Office, Box Office Mojo, viewed 14 August 2018, <https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm>
Krotov, V. & Silva, L. 2018, ‘Legality and Ethics of Web Scraping’, Twenty-fourth Americas Conference on Information Systems.
Vargiu, E. & Urru, M. 2012, ‘Exploiting web scraping in a collaborative filtering- based approach to web advertising’, Artificial Intelligence Research, vol. 2, no. 1, p. 44, viewed 14 August 2018, <http://www.sciedu.ca/journal/index.php/air/article/view/1390>.