Overview/Usefulness

There has never been a time where information has been more readily available online. Data growth on the world wide web has continued to exponentially increase over the past decade and has given no indication of slowing in the future. While the presence of online information is in clear abundance, accessing that information is not such a simple endeavor. This tutorial is designed to help those in need of access to online information by providing a method to extract data from webpages via web scraping. This method can be effectively used with the programming language R and a package called rvest.

Web scraping is the process of extracting large amounts of data from resources that are located on the World Wide Web. This data is extracted and stored on the scraper’s computer or to a database. Many businesses and organizations across the globe need this technique to maintain a competitive advantage, increase revenue, or maintain a working knowledge of what their competition is doing. Government use of web scraping can be viewed in terms of competitor analysis, as well as providing insight into personal circumstances facing the country through social media. Applications can also extend to the acquisitions process used by military agencies in procurement research. Government, however, is not the only entity that benefits from the use of web scraping. Industry examples of web scraping include companies gathering email addresses to bolster lead generation, learning what competitors are selling and selling similar or the same products, inspection of competitor prices, and scraping information on social media websites to learn what’s trending. Web scraping, in most circumstances, is straightforward in concept, but presents many challenges which include:

Prereqs

The pre-requisites for web scraping using R consist of an R section and a Google Chrome Extension. This tutorial will utilize the rvest package authored by Hadley Wickham. Rvest is a package designed to help users scrape information from webpages. SelectorGadget is a Google Chrome Extension that allows for a user to easily extract CSS Selector nodes from HTML webpages. To download the extension, click here. Use of the selectorgadget is not the sole means to access HTML webpage script. The following section will also cover methods to access HTML webpage data through use of the developers tab. In addition to the rvest package, several others are utilized for secondary operations such as data cleaning.

library(rvest)
library(tidyverse)
library(stringr)
library(knitr)

HTML Overview

This section covers the foundation of scraping website data from a single webpage. Moreover, this section will illustrate a basic method of extracting specified elements of information embedded within a webpage, with an explicit focus on extracting data from HTML websites. To begin, it is necessary to provide a concise explanation of how HTML webpages are typically arranged. HTML layouts are provided by Cascading Style Sheets (CSS) instructions which are embedded in the HTML. CSS is a web style sheet language that is used to describe the presentation of a document written in a markup language. This technology is used amongst many websites to deliver visually engaging webpages and user interfaces for both web applications and mobile applications. CSS enables the differentiation of the presentation aspects of a webpage and the content of the webpage. This permits website developers to maintain thematic concepts among multiple webpages while changing the content of each page. This structure is governed by a set of rules, housed within each sheet, which is made up of one or more selectors. CSS selectors are used to define which parts of the HTML style applies to different sections on the webpage by matching tags and attributes in the markup itself. Selectors can be applied to an entire HTML document as well as specified components such as headers, for instance. An example of a defined heading in a CSS selector would be main heading as (h1), sub-headings as (h2), and sub-sub-headings as (h3). HTML elements are written with a start tag identifying the section, the content, and an ending tag which identifies the closing of the section. The start tag identifier is housed between < and > symbols and the desired content would follow directly after. The end tag identifier is housed between </ and > symbols. An example of a CSS selector is:

<h1>
Chapter 1: Putting the Bae in Bayesian Statistics
</h1>

Some of the most commonly identified tags present in CSS selectors are

Additional information regarding HTML elements can be found here.

While implementing the aforementioned CSS selectors into a web scraping tool facilitates the collection of data across basic HTML structures, it does not guarantee information retrieval for more robust HTML webpages or for more focused information retrieval. Websites that include unique features, consumer ads, or more dynamic behavior may employ CSS selectors that apply a finer level of identification to extract the content. In cases such as these, the user may need to enter into the browser’s developer tools to perform a more detailed examination of the CSS elements needed to be identified in the HTML script. To do this, simply click F12 (Cmd + Opt + I for Mac) in Chrome or Firefox; if you’re using Safari, Command-Option-I. To demonstrate we’ll look at this webpage. The developer tab may look a little overwhelming at first, but we’ll only need to focus on a few aspects, which we discuss next.


The Elements section will be the primary focus when accessing the developer’s tab for scraping. It is usually preselected when you open the developer’s tab.


As the cursor is moved over different elements of the webpage script in the developer’s tab, the corresponding elements on the webpage will become highlighted.


Locating the information in the developer tool that corresponds to the movie summary highlights that the data within the <div> node can be extracted by calling the CSS selector identifier .summary_text. The process of locating each node in a HTML file can be a lengthy and arduous task depending on the complexity of the webpage. Luckily there are options at a web scraper’s disposal that make this task much simpler and faster.

Selector Gadget

SelectorGadget, created by Andrew Cantino, is an open source tool built to make CSS selector generation and discovery on complicated websites simple. SelectorGadet allows the user to click on the page element of interest and have that element identifier presented to the user. This tool greatly reduces the time associated with locating CSS selectors through the websites developer section. To install the Chrome Extension click here. Once installed, the selector gadget will be located in the upper-right corner of your web browser; click on it to open the interface.

Selecting the movie summary with SelectorGadget gives us the same html node found from the developer’s tab.


Often, an html node will refer to many pieces of a web page, including information that you may not want to scrape. One neat advantage of SelectorGadget is that it allows us to deselect all irrelevant nodes and point our scraper to very specific aspects. All of the node targeted by .itemprop are shown in yellow, with the original selection highlighted in green.


Clicking the nodes we don’t want helps us target specific information and the corresponding html node/s.

Follow-along

Now that we’ve covered some of the basics, let’s put those skills into practice. Using IMDB, let’s see what we can glean about the top-ranked movies.

Collecting the data

First let’s pull the rank, title, and year of each film.

url <- read_html("http://www.imdb.com/chart/top?ref_=nv_mv_250_6")

# Using selector gadget or developer tab to identify the pertinent html nodes
title_info <- html_nodes(url,'.titleColumn a') %>% 
  html_text()

year_info <- html_nodes(url, '.secondaryInfo') %>% 
  html_text()

rank_info <- html_nodes(url, '.imdbRating') %>% 
  html_text()

# Let's check out what the scraper returns
head(title_info)
## [1] "The Shawshank Redemption" "The Godfather"           
## [3] "The Godfather: Part II"   "The Dark Knight"         
## [5] "12 Angry Men"             "Schindler's List"
head(year_info)
## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)"
head(rank_info)
## [1] "\n            9.2\n    " "\n            9.2\n    "
## [3] "\n            9.0\n    " "\n            9.0\n    "
## [5] "\n            8.9\n    " "\n            8.9\n    "


Cleaning the data

Information is initially returned as a single character line. We’ll want to unlist() the data to separate the entries. Since the scraper also extracts all html text, we need to clean the returned information into something more useable.

# Other than unlisting, nothing is really needed for the title info
title_info <- title_info %>% unlist()

title_info[1]
## [1] "The Shawshank Redemption"

The release year comes surrounded by parentheses, so we’ll use a regexp to remove the unwanted characters and store the information as numeric.

year_info <- year_info %>% 
  str_replace_all(pattern = "[\\(\\)]", replacement = "") %>% 
  unlist() %>% 
  as.numeric()

year_info[1]
## [1] 1994

The IMDB ranking has a little more baggage associated with it; we’ll need to remove the newline characters and clear the whitespace in addition to unlisting the data.

rank_info <- rank_info %>% 
  str_replace_all(pattern = "\n", replacement = "") %>% 
  str_trim(side = "both") %>% 
  unlist() %>% 
  as.numeric()

rank_info[1]
## [1] 9.2


Visiting other urls

Now we’ll want to gather additional information about each movie - director, run time, genres, and the metascore. However, this information does not exist on our current url. Instead, the data are contained within each movie’s individual web page so we’ll need to redirect our scraper. To do that, we have to capture the url leading to each movie’s individual page and store it in a list.

title_url <- html_nodes(url,'.titleColumn a')

Once we have that list, we can use the html_attr() function and the "href" argument to tell the scraper to return the web address.

title_url_we <- data.frame(html_attr(title_url,"href"))

top_n(title_url_we, 1)
##                                                                                                                           html_attr.title_url...href..
## 1 /title/tt5311514/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=1TEY0ZG2G4YB3VTNDWZV&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_88

What returns is only the part of the address specific to that movie (/title/tt…), so we paste on the main site ‘www.imdb.com’ in order to have a functional url for our scraper to reference.

title_url_we <- as.character(title_url_we$html_attr.title_url...href..)
title_url_we <- paste0("http://www.imdb.com", title_url_we) %>% 
  unlist()

# Check to ensure process generated what we want.
title_url_we[1]
## [1] "http://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=1TEY0ZG2G4YB3VTNDWZV&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1"


Storing the data

In order to store the data, we’ll create an empty table and then later redefine the value of each cell as the scraper progresses.

extra_info <- data.frame(matrix(nrow = length(title_url_we), ncol = 4),
                         stringsAsFactors = FALSE)
colnames(extra_info) <- c("Director", "MetaScore", "RunTime", "Genre")

Additionally, the html node that we’ve selected for the movie run time returns a lot of additional html text and is in a clunky format (e.g. ‘2h 23min’). To address this, we can write a custom function that allows us to convert the run time. After removing extraneous characters from the returned values, our run time is a tad more useful (e.g. ‘2h 23 min’ -> 223). This function takes this and transforms it into elapsed minutes (e.g. 223 -> 143).

t2m <- function(x){
  hours <- trunc(x/100)
  mins <- 100 * (x/100 - hours)
  hours*60 + mins
}

From here, we can loop through and direct our scraper to use each item in title_url_we as the base url and capture our movie information. Forewarning, the scraper will take some time to finish running, as each webpage is being accessed iteratively. However, if you don’t want to wait for the scraper to do its thing, you can download the table it produces as an .RData file:

# Loop through urls
for (i in 1:nrow(extra_info)) {
  
  # Read in website
  temp_url <- read_html(title_url_we[i])
  
  # Extract Information
  extra_info[i,] <- c(
    # Director
    html_node(temp_url, '.summary_text+ .credit_summary_item .itemprop') %>%
      html_text(),
    # MetaScore
    html_node(temp_url, '.metacriticScore') %>%
      html_text(),
    # RunTime
    html_node(temp_url, 'time') %>%
      html_text(),
    # Genre
    html_node(temp_url, '.ghost+ a .itemprop') %>% 
      html_text()
  )
}

There is a bit of final tidying up we might want to do before performing any anaylsis on the data. First, we’ll want to make sure any numeric variable is stored as just that. We’ll also want to clean vestigial html text from any of our variables.

# Recharacterize the metascore data as numeric
extra_info$MetaScore <- as.numeric(extra_info$MetaScore)

# Remove the extraneous html text from the run time
extra_info$RunTime <- extra_info$RunTime %>% 
  gsub(pattern = "\n", replacement = "", .) %>% 
  gsub(pattern = " ", replacement = "", .) %>% 
  gsub(pattern = "h", replacement = "", .) %>% 
  gsub(pattern = "min", replacement = "", .) %>% 
  as.numeric() %>% 
  t2m()

Now that we have all the data we want (and in the form we want), let’s combine it into one cohesive table.

movie_data <- data.frame(Title = title_info, ReleaseYear = year_info, 
                         ImdbRating = rank_info, stringsAsFactors = FALSE) %>% 
                        cbind(extra_info) %>% 
                        as.tibble()

movie_data[1:5,] %>% 
  kable(caption = "movie_data")
movie_data
Title ReleaseYear ImdbRating Director MetaScore RunTime Genre
The Shawshank Redemption 1994 9.2 Frank Darabont 80 142 Crime
The Godfather 1972 9.2 Francis Ford Coppola 100 175 Crime
The Godfather: Part II 1974 9.0 Francis Ford Coppola 85 202 Crime
The Dark Knight 2008 9.0 Christopher Nolan 82 152 Action
12 Angry Men 1957 8.9 Sidney Lumet 96 96 Crime

Individual Uses

How might this be useful for thesis work? Well, in many instances the thesis sponsor either has data or access to it and can provide it to you. That may not always be the case. One example is a thesis which concerns the relationship between personnel attrition in the Air Force (specifically the officer corps) and the economic environment - how the economic climate affects attrition trends and decisions. The personnel data are available and, though the economic data certainly exist online, the particular data set needed does not exist in any single location. In order to perform any substantial analysis, a data set must be created. This involves scraping several different economic databases, constructing the economic data set, and then marrying that with the personnel data. Another example is a thesis focusing on developing a software package that can ingest large corpuses of news article documents, analyze the text within each article, and create visualizations of the results of that analysis. Different news sources post hundreds of articles on their websites daily. Using web scraping techniques facilitates the gathering of data such as author, title, and article text across thousands of articles easily.

References

Klepeis, Neil E. Convert Time in “hhmm” to Elapsed Minutes. Retrieved from http://exposurescience.org/heR.doc/library/heR.Activities/html/t2m.html

Morey, Richard (2004, 12 September). Embedding RData Files in R Markdown Files for more Reproducible Analyses. Retrieved from http://rmarkdown.rstudio.com/articles_rdata.html

Boehmke, Bradley. Scraping Data. Retrieved from https://afit-r.github.io/scraping