Web scraping movie sales from boxofficemojo.com

Introduction

Web scraping involves using a set of automated tools to copy information from a website rather than doing it manually (Vargiu & Urru 2012). Due to the volume, types and inconsistent formats of information available on the Web, Web scraping provides an effective way to extract large amounts of data quickly for further analysis. For this reason, it is also an efficient method often preferred by researchers (Krotov & Silva 2018).

Before we get started, there are legal and ethical implications of Web scraping to be aware of. Krotov & Silva (2018) comment that there are no direct laws preventing Web scraping at this point in time however it is still guided by an existing legal framework of copyright infringement, fraud, abuse and breach of contract. Some common examples include:

Terms of use - Websites may prohibit Web scraping through their ‘terms of use’ policy and failure to comply would be deemed as breach of contract. However, to violate the contract, the person must accept the terms.
Copyright - Extracting and republishing the exact data ‘owned’ by the Website could be subject to copyright. This does not include comments from users using the website. It would also be possible to extract and summarise the information without breaching copyright.
Fraud - Accessing data that is for illegal or fraudulent purposes.
Abuse - The web scraping method causes damage to the Website.

Ethically, there are also individual and organisational privacy and unfair use areas to consider (Krotov & Silva 2018). As each specific reason for Web scraping may differ, it is important to review the legal risks in your area before proceeding.

Objective

This guide aims to demonstrate how to extract data from a website through R by using the httr and rvest packages. As an example we will extract weekly movie sales from the boxofficemojo.com website (IMDB 2018).

The boxofficemojo.com website is owned by IMDB and reports on United States box office sales for movies. For this guide, our goal will be to extracting weekly sales from this URL https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm and store it in a data frame to allow further analysis.

Figure 1 – Weekly Box Office results (IMDB 2018B):

Firstly we’ll add the packages we need:

install.packages("httr")
library(httr)

install.packages("rvest")
library(rvest)

install.packages("dplyr")
library(dplyr)

install.packages("magrittr")
library(magrittr)

install.packages("lubridate")
library(lubridate)

install.packages("lubridate")
library(lubridate)

We will use the httr package to build the query. Notice how in the URL https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm there is a ‘yr’ and ‘wk’ parameter. This refers to the year and calendar week. It is possible to change the week number in the URL to see different week results without having to click on a different hyperlink. Try changing it.

Having the ability to adjust the parameter means the query can automatically be generated and return different weekly results. The Httr package provides the ability to call a URL via the GET() function and pass a series of parameters.

getYear = "2018"
getWeek = "31"

base_url = "https://www.boxofficemojo.com/weekly/chart/"
query_params = list(yr=getYear, wk=getWeek)

resp = GET(url=base_url, query=query_params)

resp

## Response [https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31]
##   Date: 2018-08-14 12:26
##   Status: 200
##   Content-Type: text/html
##   Size: 91.3 kB
## NA

Notice the status 200. This means we received a successful response back. If you get a status of 4XX it means a problem on the website. If you get a status of 5XX it means a problem on your side.

Now that we have the response (resp) back from the Website, we can extract the contents by using the read_html() function in the rvest package.

page_html = read_html(resp)

We can write the contents to file to read the contents and find out what to extract.

write_xml(page_html, file="WeeklyBoxOfficeSales.html")

Here’s what the content looks like.

Figure 2 – HTML contents of webpage:

HTML consists of a series of text within tags eg. <title>Weekly Box Office Results</title>

In rvest, each tag is considered a node. A node can be extracted as a object using the html_nodes() function passing the name of the node. For example, calling html_nodes(“title”) will return all nodes called ‘<title>’ to a list. To extract the actual title, the html_text() function can be used. For example, to get the title of the page we can do the following:

page_html %>% 
  html_nodes("title") %>%
  html_text()

## [1] "Weekly Box Office Results for August 3-9, 2018 - Box Office Mojo"

Our objective is to get the table from the Webpage into a data frame. This can be done in a series of steps.

Calling html_nodes(“table”) extracts all the tables in the html document. Notice how there are several tables returned. You will need to find which table has the content you are after.

page_html %>%
  html_nodes("table")

## {xml_nodeset (5)}
## [1] <table border="0" cellspacing="0" cellpadding="0"><tr><form method=" ...
## [2] <table border="0" cellspacing="0" cellpadding="0" width="100%"><tr b ...
## [3] <table border="0" cellpadding="0" cellspacing="0" width="100%"><tr>\ ...
## [4] <table border="0 cellspacing=" cellpadding="0">\n<tr>\n<td width="25 ...
## [5] <table border="0" cellspacing="1" cellpadding="5">\n<tr bgcolor="#dc ...

In our case, the data we need is in table 5. This can be specifically extracted from the list using the magrittr::extract2() function. You can see that the 1 table is returned with the rows and columns we need.

page_html %>%
  html_nodes("table") %>%
  extract2(5)

## {xml_node}
## <table border="0" cellspacing="1" cellpadding="5">
##  [1] <tr bgcolor="#dcdcdc">\n<td align="center"><font size="1"><a href=" ...
##  [2] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">1</font>< ...
##  [3] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">2</font>< ...
##  [4] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">3</font>< ...
##  [5] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">4</font>< ...
##  [6] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">5</font>< ...
##  [7] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">6</font>< ...
##  [8] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">7</font>< ...
##  [9] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">8</font>< ...
## [10] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">9</font>< ...
## [11] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">10</font> ...
## [12] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">11</font> ...
## [13] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">12</font> ...
## [14] <tr bgcolor="#ffff99">\n<td align="center"><font size="2">13</font> ...
## [15] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">14</font> ...
## [16] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">15</font> ...
## [17] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">16</font> ...
## [18] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">17</font> ...
## [19] <tr bgcolor="#f4f4ff">\n<td align="center"><font size="2">18</font> ...
## [20] <tr bgcolor="#ffffff">\n<td align="center"><font size="2">19</font> ...
## ...

The html_table() function converts the HTML table to a more useful data frame, however we are not quite done yet.

mydf = page_html %>%
  html_nodes("table") %>%
  extract2(5) %>% 
  html_table()

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	X11	X12
TW	LW	Title (Click to view)	Studio	Weekly Gross	% Change	Theater Count / Change	Theater Count / Change	Average	Total Gross	Budget*	Week #
1	1	Mission: Impossible - Fallout	Par.	$52,479,913	-41.4%	4,395	+9	$11,941	$141,967,284	$178	2
2	N	Disney’s Christopher Robin	BV	$37,589,317	-	3,602	-	$10,436	$37,589,317	-	1
3	N	The Spy Who Dumped Me	LGF	$17,960,385	-	3,111	-	$5,773	$17,960,385	$40	1
4	2	Mamma Mia! Here We Go Again	Uni.	$15,764,650	-41.2%	3,359	-155	$4,693	$98,009,200	$75	3

We can use setNames() to name the columns of our data frame.

myColNames = c("ThisWeek", "LastWeek", "Title", "Studio", "WeeklyGross", "PctChange", "TheatreCount", "TheatreChange", "Average", "TotalGross", "Budget", "WeekNum")

mydf = page_html %>%
  html_nodes("table") %>%
  extract2(5) %>% 
  html_table() %>%
  setNames(myColNames)

first 4 lines
ThisWeek	LastWeek	Title	Studio	WeeklyGross	PctChange	TheatreCount	TheatreChange	Average	TotalGross	Budget	WeekNum
TW	LW	Title (Click to view)	Studio	Weekly Gross	% Change	Theater Count / Change	Theater Count / Change	Average	Total Gross	Budget*	Week #
1	1	Mission: Impossible - Fallout	Par.	$52,479,913	-41.4%	4,395	+9	$11,941	$141,967,284	$178	2
2	N	Disney’s Christopher Robin	BV	$37,589,317	-	3,602	-	$10,436	$37,589,317	-	1
3	N	The Spy Who Dumped Me	LGF	$17,960,385	-	3,111	-	$5,773	$17,960,385	$40	1

last 4 lines
	ThisWeek	LastWeek	Title	Studio	WeeklyGross	PctChange	TheatreCount	TheatreChange	Average	TotalGross	Budget	WeekNum
95	94	91	Filmworker	KL	$388	-72.3%	1	-1	$388	$97,004	-	13
96	95	-	Gabriel and the Mountain	Strand	$240	-	1	-	$240	$15,065	-	7
97	96	-	Marlina the Murderer in Four Acts	Icar.	$98	-	3	-	$33	$16,133	-	6
98	TOTAL (96 MOVIES):	TOTAL (96 MOVIES):	TOTAL (96 MOVIES):	TOTAL (96 MOVIES):	$216,340,190	-11.6%	43,783	+4,207	$4,941

Notice how the first row contains the column names and the last row has totals. We can filter this out by filter(row_number()!=1) and filter(row_number()!=n())

mydf = page_html %>%
  html_nodes("table") %>%
  extract2(5) %>% 
  html_table() %>%
  setNames(myColNames) %>%
  filter(row_number()!=1) %>% 
  filter(row_number()!=n())

first 4 lines
ThisWeek	LastWeek	Title	Studio	WeeklyGross	PctChange	TheatreCount	TheatreChange	Average	TotalGross	Budget	WeekNum
1	1	Mission: Impossible - Fallout	Par.	$52,479,913	-41.4%	4,395	+9	$11,941	$141,967,284	$178	2
2	N	Disney’s Christopher Robin	BV	$37,589,317	-	3,602	-	$10,436	$37,589,317	-	1
3	N	The Spy Who Dumped Me	LGF	$17,960,385	-	3,111	-	$5,773	$17,960,385	$40	1
4	2	Mamma Mia! Here We Go Again	Uni.	$15,764,650	-41.2%	3,359	-155	$4,693	$98,009,200	$75	3

last 4 lines
	ThisWeek	LastWeek	Title	Studio	WeeklyGross	PctChange	TheatreCount	TheatreChange	Average	TotalGross	Budget	WeekNum
93	93	69	En el Septimo Dia (On the Seventh Day)	CGld	$840	-88.8%	2	-3	$420	$71,531	-	9
94	94	91	Filmworker	KL	$388	-72.3%	1	-1	$388	$97,004	-	13
95	95	-	Gabriel and the Mountain	Strand	$240	-	1	-	$240	$15,065	-	7
96	96	-	Marlina the Murderer in Four Acts	Icar.	$98	-	3	-	$33	$16,133	-	6

That’s better. We should also add to the data frame what year and week we requested using mutate(calYear=getYear, calWeek=getWeek).

mydf = page_html %>%
  html_nodes("table") %>%
  extract2(5) %>% 
  html_table() %>%
  setNames(myColNames) %>%
  filter(row_number()!=1) %>% 
  filter(row_number()!=n()) %>%
  mutate(calYear=getYear, calWeek=getWeek)

ThisWeek	LastWeek	Title	Studio	WeeklyGross	PctChange	TheatreCount	TheatreChange	Average	TotalGross	Budget	WeekNum	calYear	calWeek
1	1	Mission: Impossible - Fallout	Par.	$52,479,913	-41.4%	4,395	+9	$11,941	$141,967,284	$178	2	2018	31
2	N	Disney’s Christopher Robin	BV	$37,589,317	-	3,602	-	$10,436	$37,589,317	-	1	2018	31

Brilliant. Now to make it a little more robust. What happens if you specify a week where there are no results, you need a way to stop the code from erroring. The tryCatch method catches errors gracefully and allows you to return a value if it errors. We will return a NULL value if it errors.

  tryCatch(
    {
      page_html %>%
        html_nodes("table") %>%
        extract2(5) %>% 
        html_table() %>%
        setNames(myColNames) %>%
        filter(row_number()!=1) %>% 
        filter(row_number()!=n()) %>%
        mutate(calYear=getYear, calWeek=getWeek)
    }
    ,
    error=function(e) return(NULL)
  )

Almost there. Now we can turn this into a function, passing the year and week to get a dataframe of results easier.

getWeeklyBoxOffice = function(theYear, theWeek) {
  
  if (theYear <= year(now()) && theWeek <= week(now())-2) {
  
    tryCatch(
      {  
        base_url = "https://www.boxofficemojo.com/weekly/chart/"
        query_params = list(yr=theYear, wk=theWeek)
        
        myResp = GET(url = base_url, query=query_params)
        
        myPage_html = read_html(myResp)
        
        myColNames = c("ThisWeek", "LastWeek", "Title", "Studio", "WeeklyGross", "PctChange", "TheatreCount", "TheatreChange", "Average", "TotalGross", "Budget", "WeekNum")
        
          myWeeklyBoxOffice = myPage_html %>%
          html_nodes("table") %>%
          extract2(5) %>%
          html_table() %>%
          setNames(myColNames) %>%
          filter(row_number()!=1) %>% 
          filter(row_number()!=n()) %>%
          mutate(calYear=theYear, calWeek=theWeek)
        
          return(myWeeklyBoxOffice)
      },
          error=function(e) return(NULL)
    )
  }
  else {
    return(NULL) 
  }
}
         
df1 = getWeeklyBoxOffice("2018", "31")
df2 = getWeeklyBoxOffice("2018", "30")

week 31 results
ThisWeek	LastWeek	Title	Studio	WeeklyGross	PctChange	TheatreCount	TheatreChange	Average	TotalGross	Budget	WeekNum	calYear	calWeek
1	1	Mission: Impossible - Fallout	Par.	$52,479,913	-41.4%	4,395	+9	$11,941	$141,967,284	$178	2	2018	31
2	N	Disney’s Christopher Robin	BV	$37,589,317	-	3,602	-	$10,436	$37,589,317	-	1	2018	31

week 30 results
ThisWeek	LastWeek	Title	Studio	WeeklyGross	PctChange	TheatreCount	TheatreChange	Average	TotalGross	Budget	WeekNum	calYear	calWeek
1	N	Mission: Impossible - Fallout	Par.	$89,487,371	-	4,386	-	$20,403	$89,487,371	$178	1	2018	30
2	1	Mamma Mia! Here We Go Again	Uni.	$26,820,025	-51.6%	3,514	+197	$7,632	$82,244,550	$75	2	2018	30

Lastly, we can do even better. The previous step was great for one week at a time. What if we want to get all the sales from a couple of months in arrears?

We can create another function that passes a date, works out all the weeks in between to today and calls the same getWeeklyBoxOffice() function.

getWeeklyBoxOfficeByDate = function(startDate) {
  
  countWeeks = floor(interval(startDate, now()) / duration(num=1, units="weeks"))
  
  weekDates = ymd(startDate) + weeks(x = seq.int(from = 0, to = countWeeks, by = 1))
  
  yearWeeks = data.frame(yr=year(weekDates), wk=week(weekDates))
  
  mydf = NULL
  fulldf = NULL
  for (x in 1:nrow(yearWeeks)) {
    myYear = yearWeeks[x,1]
    myWeek = yearWeeks[x,2]
    
    mydf = getWeeklyBoxOffice(myYear, myWeek)
    if(!is.null(mydf)) {
      if(!is.null(fulldf)) {
        fulldf = rbind(fulldf, mydf)  
      } else {
        fulldf = mydf
      }
      
    }
  }
  return(fulldf)  

}

fulldf = getWeeklyBoxOfficeByDate("2018-07-19")

Now that we have a data frame with many weekly results we can apply further exploratory data analysis and plotting to analyse the results.

References

IMDB 2018A, Box Office Mojo, viewed 14 August 2018, <https://www.boxofficemojo.com>.

IMDB 2018B, Weekly Box Office, Box Office Mojo, viewed 14 August 2018, <https://www.boxofficemojo.com/weekly/chart/?yr=2018&wk=31&p=.htm>

Krotov, V. & Silva, L. 2018, ‘Legality and Ethics of Web Scraping’, Twenty-fourth Americas Conference on Information Systems.

Vargiu, E. & Urru, M. 2012, ‘Exploiting web scraping in a collaborative filtering- based approach to web advertising’, Artificial Intelligence Research, vol. 2, no. 1, p. 44, viewed 14 August 2018, <http://www.sciedu.ca/journal/index.php/air/article/view/1390>.

Web scraping movie sales from boxofficemojo.com

Archel Aguilar

14 August 2018

Introduction

Objective

References