Introduction

Web scraping involves using a set of automated tools to copy information from a website rather than doing it manually (Vargiu & Urru 2012). Due to the volume, types and inconsistent formats of information available on the Web, Web scraping provides an effective way to extract large amounts of data quickly for further analysis. For this reason, it is also an efficient method often preferred by researchers (Krotov & Silva 2018).

Before we get started, there are legal and ethical implications of Web scraping to be aware of. Krotov & Silva (2018) comment that there are no direct laws preventing Web scraping at this point in time however it is still guided by an existing legal framework of copyright infringement, fraud, abuse and breach of contract. Some common examples include:

Ethically, there are also individual and organisational privacy and unfair use areas to consider (Krotov & Silva 2018). As each specific reason for Web scraping may differ, it is important to review the legal risks in your area before proceeding.

Objective

This guide aims to demonstrate how to extract data from a website through R by using the httr and rvest packages. As an example we will extract weekly movie sales from the boxofficemojo.com website (IMDB 2018).

The boxofficemojo.com website is owned by IMDB and reports on United States box office sales for movies. For this guide, our goal will be to extracting weekly sales from this URL https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm and store it in a data frame to allow further analysis.

Figure 1 – Weekly Box Office results (IMDB 2018B):

Firstly we’ll add the packages we need:

install.packages("httr")
library(httr)

install.packages("rvest")
library(rvest)

install.packages("dplyr")
library(dplyr)

install.packages("magrittr")
library(magrittr)

install.packages("lubridate")
library(lubridate)

install.packages("lubridate")
library(lubridate)

We will use the httr package to build the query. Notice how in the URL https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm there is a ‘yr’ and ‘wk’ parameter. This refers to the year and calendar week. It is possible to change the week number in the URL to see different week results without having to click on a different hyperlink. Try changing it.

Having the ability to adjust the parameter means the query can automatically be generated and return different weekly results. The Httr package provides the ability to call a URL via the GET() function and pass a series of parameters.

getYear = "2018"
getWeek = "31"

base_url = "https://www.boxofficemojo.com/weekly/chart/"
query_params = list(yr=getYear, wk=getWeek)

resp = GET(url=base_url, query=query_params)

resp
## Response [https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31]
##   Date: 2018-08-14 12:26
##   Status: 200
##   Content-Type: text/html
##   Size: 91.3 kB
## NA

Notice the status 200. This means we received a successful response back. If you get a status of 4XX it means a problem on the website. If you get a status of 5XX it means a problem on your side.

Now that we have the response (resp) back from the Website, we can extract the contents by using the read_html() function in the rvest package.

page_html = read_html(resp)

We can write the contents to file to read the contents and find out what to extract.

write_xml(page_html, file="WeeklyBoxOfficeSales.html")

Here’s what the content looks like.

Figure 2 – HTML contents of webpage:

HTML consists of a series of text within tags eg. <title>Weekly Box Office Results</title>

In rvest, each tag is considered a node. A node can be extracted as a object using the html_nodes() function passing the name of the node. For example, calling html_nodes(“title”) will return all nodes called ‘<title>’ to a list. To extract the actual title, the html_text() function can be used. For example, to get the title of the page we can do the following:

page_html %>% 
  html_nodes("title") %>%
  html_text()
## [1] "Weekly Box Office Results for August 3-9, 2018 - Box Office Mojo"

Our objective is to get the table from the Webpage into a data frame. This can be done in a series of steps.

  1. Calling html_nodes(“table”) extracts all the tables in the html document. Notice how there are several tables returned. You will need to find which table has the content you are after.
page_html %>%
  html_nodes("table")
## {xml_nodeset (5)}
## [1] <table border="0" cellspacing="0" cellpadding="0"><tr><form method=" ...
## [2] <table border="0" cellspacing="0" cellpadding="0" width="100%"><tr b ...
## [3] <table border="0" cellpadding="0" cellspacing="0" width="100%"><tr>\ ...
## [4] <table border="0 cellspacing=" cellpadding="0">\n<tr>\n<td width="25 ...
## [5] <table border="0" cellspacing="1" cellpadding="5">\n<tr bgcolor="#dc ...
  1. In our case, the data we need is in table 5. This can be specifically extracted from the list using the magrittr::extract2() function. You can see that the 1 table is returned with the rows and columns we need.
page_html %>%
  html_nodes("table") %>%
  extract2(5) 
## {xml_node}
## <table border="0" cellspacing="1" cellpadding="5">
##  [1] <tr bgcolor="#dcdcdc">\n<td align="center"><font size="1"><a href=" ...
##  [2] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">1</font>< ...
##  [3] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">2</font>< ...
##  [4] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">3</font>< ...
##  [5] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">4</font>< ...
##  [6] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">5</font>< ...
##  [7] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">6</font>< ...
##  [8] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">7</font>< ...
##  [9] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">8</font>< ...
## [10] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">9</font>< ...
## [11] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">10</font> ...
## [12] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">11</font> ...
## [13] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">12</font> ...
## [14] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">13</font> ...
## [15] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">14</font> ...
## [16] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">15</font> ...
## [17] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">16</font> ...
## [18] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">17</font> ...
## [19] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">18</font> ...
## [20] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">19</font> ...
## ...
  1. The html_table() function converts the HTML table to a more useful data frame, however we are not quite done yet.
mydf = page_html %>%
  html_nodes("table") %>%
  extract2(5) %>% 
  html_table()
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
TW LW Title (Click to view) Studio Weekly Gross % Change Theater Count / Change Theater Count / Change Average Total Gross Budget* Week #
1 1 Mission: Impossible - Fallout Par. $52,479,913 -41.4% 4,395 +9 $11,941 $141,967,284 $178 2
2 N Disney’s Christopher Robin BV $37,589,317 - 3,602 - $10,436 $37,589,317 - 1
3 N The Spy Who Dumped Me LGF $17,960,385 - 3,111 - $5,773 $17,960,385 $40 1
4 2 Mamma Mia! Here We Go Again Uni. $15,764,650 -41.2% 3,359 -155 $4,693 $98,009,200 $75 3
  1. We can use setNames() to name the columns of our data frame.
myColNames = c("ThisWeek", "LastWeek", "Title", "Studio", "WeeklyGross", "PctChange", "TheatreCount", "TheatreChange", "Average", "TotalGross", "Budget", "WeekNum")

mydf = page_html %>%
  html_nodes("table") %>%
  extract2(5) %>% 
  html_table() %>%
  setNames(myColNames)
first 4 lines
ThisWeek LastWeek Title Studio WeeklyGross PctChange TheatreCount TheatreChange Average TotalGross Budget WeekNum
TW LW Title (Click to view) Studio Weekly Gross % Change Theater Count / Change Theater Count / Change Average Total Gross Budget* Week #
1 1 Mission: Impossible - Fallout Par. $52,479,913 -41.4% 4,395 +9 $11,941 $141,967,284 $178 2
2 N Disney’s Christopher Robin BV $37,589,317 - 3,602 - $10,436 $37,589,317 - 1
3 N The Spy Who Dumped Me LGF $17,960,385 - 3,111 - $5,773 $17,960,385 $40 1
last 4 lines
ThisWeek LastWeek Title Studio WeeklyGross PctChange TheatreCount TheatreChange Average TotalGross Budget WeekNum
95 94 91 Filmworker KL $388 -72.3% 1 -1 $388 $97,004 - 13
96 95 - Gabriel and the Mountain Strand $240 - 1 - $240 $15,065 - 7
97 96 - Marlina the Murderer in Four Acts Icar. $98 - 3 - $33 $16,133 - 6
98 TOTAL (96 MOVIES): TOTAL (96 MOVIES): TOTAL (96 MOVIES): TOTAL (96 MOVIES): $216,340,190 -11.6% 43,783 +4,207 $4,941
  1. Notice how the first row contains the column names and the last row has totals. We can filter this out by filter(row_number()!=1) and filter(row_number()!=n())
mydf = page_html %>%
  html_nodes("table") %>%
  extract2(5) %>% 
  html_table() %>%
  setNames(myColNames) %>%
  filter(row_number()!=1) %>% 
  filter(row_number()!=n())
first 4 lines
ThisWeek LastWeek Title Studio WeeklyGross PctChange TheatreCount TheatreChange Average TotalGross Budget WeekNum
1 1 Mission: Impossible - Fallout Par. $52,479,913 -41.4% 4,395 +9 $11,941 $141,967,284 $178 2
2 N Disney’s Christopher Robin BV $37,589,317 - 3,602 - $10,436 $37,589,317 - 1
3 N The Spy Who Dumped Me LGF $17,960,385 - 3,111 - $5,773 $17,960,385 $40 1
4 2 Mamma Mia! Here We Go Again Uni. $15,764,650 -41.2% 3,359 -155 $4,693 $98,009,200 $75 3
last 4 lines
ThisWeek LastWeek Title Studio WeeklyGross PctChange TheatreCount TheatreChange Average TotalGross Budget WeekNum
93 93 69 En el Septimo Dia (On the Seventh Day) CGld $840 -88.8% 2 -3 $420 $71,531 - 9
94 94 91 Filmworker KL $388 -72.3% 1 -1 $388 $97,004 - 13
95 95 - Gabriel and the Mountain Strand $240 - 1 - $240 $15,065 - 7
96 96 - Marlina the Murderer in Four Acts Icar. $98 - 3 - $33 $16,133 - 6
  1. That’s better. We should also add to the data frame what year and week we requested using mutate(calYear=getYear, calWeek=getWeek).
mydf = page_html %>%
  html_nodes("table") %>%
  extract2(5) %>% 
  html_table() %>%
  setNames(myColNames) %>%
  filter(row_number()!=1) %>% 
  filter(row_number()!=n()) %>%
  mutate(calYear=getYear, calWeek=getWeek)
ThisWeek LastWeek Title Studio WeeklyGross PctChange TheatreCount TheatreChange Average TotalGross Budget WeekNum calYear calWeek
1 1 Mission: Impossible - Fallout Par. $52,479,913 -41.4% 4,395 +9 $11,941 $141,967,284 $178 2 2018 31
2 N Disney’s Christopher Robin BV $37,589,317 - 3,602 - $10,436 $37,589,317 - 1 2018 31
  1. Brilliant. Now to make it a little more robust. What happens if you specify a week where there are no results, you need a way to stop the code from erroring. The tryCatch method catches errors gracefully and allows you to return a value if it errors. We will return a NULL value if it errors.
  tryCatch(
    {
      page_html %>%
        html_nodes("table") %>%
        extract2(5) %>% 
        html_table() %>%
        setNames(myColNames) %>%
        filter(row_number()!=1) %>% 
        filter(row_number()!=n()) %>%
        mutate(calYear=getYear, calWeek=getWeek)
    }
    ,
    error=function(e) return(NULL)
  )
  1. Almost there. Now we can turn this into a function, passing the year and week to get a dataframe of results easier.
getWeeklyBoxOffice = function(theYear, theWeek) {
  
  if (theYear <= year(now()) && theWeek <= week(now())-2) {
  
    tryCatch(
      {  
        base_url = "https://www.boxofficemojo.com/weekly/chart/"
        query_params = list(yr=theYear, wk=theWeek)
        
        myResp = GET(url = base_url, query=query_params)
        
        myPage_html = read_html(myResp)
        
        myColNames = c("ThisWeek", "LastWeek", "Title", "Studio", "WeeklyGross", "PctChange", "TheatreCount", "TheatreChange", "Average", "TotalGross", "Budget", "WeekNum")
        
          myWeeklyBoxOffice = myPage_html %>%
          html_nodes("table") %>%
          extract2(5) %>%
          html_table() %>%
          setNames(myColNames) %>%
          filter(row_number()!=1) %>% 
          filter(row_number()!=n()) %>%
          mutate(calYear=theYear, calWeek=theWeek)
        
          return(myWeeklyBoxOffice)
      },
          error=function(e) return(NULL)
    )
  }
  else {
    return(NULL) 
  }
}
         
df1 = getWeeklyBoxOffice("2018", "31")
df2 = getWeeklyBoxOffice("2018", "30")
week 31 results
ThisWeek LastWeek Title Studio WeeklyGross PctChange TheatreCount TheatreChange Average TotalGross Budget WeekNum calYear calWeek
1 1 Mission: Impossible - Fallout Par. $52,479,913 -41.4% 4,395 +9 $11,941 $141,967,284 $178 2 2018 31
2 N Disney’s Christopher Robin BV $37,589,317 - 3,602 - $10,436 $37,589,317 - 1 2018 31
week 30 results
ThisWeek LastWeek Title Studio WeeklyGross PctChange TheatreCount TheatreChange Average TotalGross Budget WeekNum calYear calWeek
1 N Mission: Impossible - Fallout Par. $89,487,371 - 4,386 - $20,403 $89,487,371 $178 1 2018 30
2 1 Mamma Mia! Here We Go Again Uni. $26,820,025 -51.6% 3,514 +197 $7,632 $82,244,550 $75 2 2018 30
  1. Lastly, we can do even better. The previous step was great for one week at a time. What if we want to get all the sales from a couple of months in arrears?

We can create another function that passes a date, works out all the weeks in between to today and calls the same getWeeklyBoxOffice() function.

getWeeklyBoxOfficeByDate = function(startDate) {
  
  countWeeks = floor(interval(startDate, now()) / duration(num=1, units="weeks"))
  
  weekDates = ymd(startDate) + weeks(x = seq.int(from = 0, to = countWeeks, by = 1))
  
  yearWeeks = data.frame(yr=year(weekDates), wk=week(weekDates))
  
  mydf = NULL
  fulldf = NULL
  for (x in 1:nrow(yearWeeks)) {
    myYear = yearWeeks[x,1]
    myWeek = yearWeeks[x,2]
    
    mydf = getWeeklyBoxOffice(myYear, myWeek)
    if(!is.null(mydf)) {
      if(!is.null(fulldf)) {
        fulldf = rbind(fulldf, mydf)  
      } else {
        fulldf = mydf
      }
      
    }
  }
  return(fulldf)  

}

fulldf = getWeeklyBoxOfficeByDate("2018-07-19")

Now that we have a data frame with many weekly results we can apply further exploratory data analysis and plotting to analyse the results.

References

IMDB 2018A, Box Office Mojo, viewed 14 August 2018, <https://www.boxofficemojo.com>.

IMDB 2018B, Weekly Box Office, Box Office Mojo, viewed 14 August 2018, <https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm>

Krotov, V. & Silva, L. 2018, ‘Legality and Ethics of Web Scraping’, Twenty-fourth Americas Conference on Information Systems.

Vargiu, E. & Urru, M. 2012, ‘Exploiting web scraping in a collaborative filtering- based approach to web advertising’, Artificial Intelligence Research, vol. 2, no. 1, p. 44, viewed 14 August 2018, <http://www.sciedu.ca/journal/index.php/air/article/view/1390>.