Web Scraping

Overview/Usefulness

There has never been a time where information has been more readily available online. Data growth on the world wide web has continued to exponentially increase over the past decade and has given no indication of slowing in the future. While the presence of online information is in clear abundance, accessing that information is not such a simple endeavor. This tutorial is designed to help those in need of access to online information by providing a method to extract data from webpages via web scraping. This method can be effectively used with the programming language R and a package called rvest.

Web scraping is the process of extracting large amounts of data from resources that are located on the World Wide Web. This data is extracted and stored on the scraper’s computer or to a database. Many businesses and organizations across the globe need this technique to maintain a competitive advantage, increase revenue, or maintain a working knowledge of what their competition is doing. Government use of web scraping can be viewed in terms of competitor analysis, as well as providing insight into personal circumstances facing the country through social media. Applications can also extend to the acquisitions process used by military agencies in procurement research. Government, however, is not the only entity that benefits from the use of web scraping. Industry examples of web scraping include companies gathering email addresses to bolster lead generation, learning what competitors are selling and selling similar or the same products, inspection of competitor prices, and scraping information on social media websites to learn what’s trending. Web scraping, in most circumstances, is straightforward in concept, but presents many challenges which include:

Each website has a unique infrastructure and requires a unique script.
Unique script languages may be written for each page within a single website.
Webpages may be altered reguaraly by web developers. Slight changes in code may require a complete script rewrite for web crawlers.
Successfully scraping a specific piece of data from a website does not mean that the information itself will be imported perfecty. It may be, and often is, necessary to purge the data of irregularities.
Some web pages have been purposefully designed to prevent actions such as web scraping. Many professional web crawling companies have come about the provide businesses with data on their competition. These entities need to protect themselves and their data, so they put systems in place to do interfere with scraping. A guide to prevent web scraping can be found here.

Prereqs

The pre-requisites for web scraping using R consist of an R section and a Google Chrome Extension. This tutorial will utilize the rvest package authored by Hadley Wickham. Rvest is a package designed to help users scrape information from webpages. SelectorGadget is a Google Chrome Extension that allows for a user to easily extract CSS Selector nodes from HTML webpages. To download the extension, click here. Use of the selectorgadget is not the sole means to access HTML webpage script. The following section will also cover methods to access HTML webpage data through use of the developers tab. In addition to the rvest package, several others are utilized for secondary operations such as data cleaning.

library(rvest)
library(tidyverse)
library(stringr)

HTML Overview

This section covers the foundation of scraping website data from a single webpage. Moreover, this section will illustrate a basic method of extracting specified elements of information embedded within a webpage, with an explicit focus on extracting data from HTML websites. To begin, it is necessary to provide a concise explanation of how HTML webpages are typically arranged. HTML layouts are provided by Cascading Style Sheets (CSS) instructions which are embedded in the HTML. CSS is a web style sheet language that is used to describe the presentation of a document written in a markup language. This technology is used amongst many websites to deliver visually engaging webpages and user interfaces for both web applications and mobile applications. CSS enables the differentiation of the presentation aspects of a webpage and the content of the webpage. This permits website developers to maintain thematic concepts among multiple webpages while changing the content of each page. This structure is governed by a set of rules, housed within each sheet, which is made up of one or more selectors. CSS selectors are used to define which parts of the HTML style applies to different sections on the webpage by matching tags and attributes in the markup itself. Selectors can be applied to an entire HTML document as well as specified components such as headers, for instance. An example of a defined heading in a CSS selector would be main heading as (h1), sub-headings as (h2), and sub-sub-headings as (h3). HTML elements are written with a start tag identifying the section, the content, and an ending tag which identifies the closing of the section. The start tag identifier is housed between < and > symbols and the desired content would follow directly after. The end tag identifier is housed between </ and > symbols. An example of a CSS selector is:

<h1>
Chapter 1: Putting the Bae in Bayesian Statistics
</h1>

Some of the most commonly identified tags present in CSS selectors are

<h1>, <h2>, …, <hn>: Largest headings, second largest headings, etc.
<p>: paragraph elements
<ul>: Unordered bulleted list
<ol>: Ordered list
<li>: Individual List item
<div>: Division or section
<table>: Table

Additional information regarding HTML elements can be found here.

While implementing the aforementioned CSS selectors into a web scraping tool facilitates the collection of data across basic HTML structures, it does not guarantee information retrieval for more robust HTML webpages or for more focused information retrieval. Websites that include unique features, consumer ads, or more dynamic behavior may employ CSS selectors that apply a finer level of identification to extract the content. In cases such as these, the user may need to enter into the browser’s developer tools to perform a more detailed examination of the CSS elements needed to be identified in the HTML script. To do this, simply click F12 (Cmd + Opt + I for Mac) in Chrome or Firefox; if you’re using Safari, Command-Option-I. To demonstrate we’ll look at this webpage. The developer tab may look a little overwhelming at first, but we’ll only need to focus on a few aspects, which we discuss next.

The Elements section will be the primary focus when accessing the developer’s tab for scraping. It is usually preselected when you open the developer’s tab.

As the cursor is moved over different elements of the webpage script in the developer’s tab, the corresponding elements on the webpage will become highlighted.

Locating the information in the developer tool that corresponds to the movie summary highlights that the data within the <div> node can be extracted by calling the CSS selector identifier .summary_text. The process of locating each node in a HTML file can be a lengthy and arduous task depending on the complexity of the webpage. Luckily there are options at a web scraper’s disposal that make this task much simpler and faster.

Selector Gadget

SelectorGadget, created by Andrew Cantino, is an open source tool built to make CSS selector generation and discovery on complicated websites simple. SelectorGadet allows the user to click on the page element of interest and have that element identifier presented to the user. This tool greatly reduces the time associated with locating CSS selectors through the websites developer section. To install the Chrome Extension click here. Once installed, the selector gadget will be located in the upper-right corner of your web browser; click on it to open the interface.

Selecting the movie summary with SelectorGadget gives us the same html node found from the developer’s tab.

Often, an html node will refer to many pieces of a web page, including information that you may not want to scrape. One neat advantage of SelectorGadget is that it allows us to deselect all irrelevant nodes and point our scraper to very specific aspects. All of the node targeted by .itemprop are shown in yellow, with the original selection highlighted in green.

Clicking the nodes we don’t want helps us target specific information and the corresponding html node(s).

Follow-along

Now that we’ve covered some of the basics, let’s put those skills into practice. Using IMDB, let’s see what we can glean about the top-ranked movies.

Collecting the data

First let’s pull the rank, title, and year of each film.

url <- read_html("http://www.imdb.com/chart/top?ref_=nv_mv_250_6")

# Using selector gadget or developer tab to identify the pertinent html nodes
title_info <- html_nodes(url,'.titleColumn a') %>% 
  html_text()

year_info <- html_nodes(url, '.secondaryInfo') %>% 
  html_text()

rank_info <- html_nodes(url, '.imdbRating') %>% 
  html_text()

# Let's check out what the scraper returns
head(title_info)

## [1] "The Shawshank Redemption" "The Godfather"           
## [3] "The Godfather: Part II"   "The Dark Knight"         
## [5] "12 Angry Men"             "Schindler's List"

head(year_info)

## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)"

head(rank_info)

## [1] "\n            9.2\n    " "\n            9.2\n    "
## [3] "\n            9.0\n    " "\n            9.0\n    "
## [5] "\n            8.9\n    " "\n            8.9\n    "