Learning Outcomes

Webscraping in R

This workshop will introduce you to the concept and practices of web scraping in R using the rvest package. In the end, you will have worked through the process of writing a scraper for two websites, used the basic functions of rvest. The process of scraping data from the web is a lesson in the power of data science, because it exemplifies the computer-plus-human model of computing. It is also a nice entre into the process of building software, since custom scrapers are essentially software for scraping a specific website.

This workshop is for beginners, but it does assume that you have R installed, have a basic familiarity with how R works, and are familiar with at least one R interface (e.g. RStudio or the R command prompt). The workshop demonstrates multiple data formats, including lists and dataframes, and exploratory data visualization techniques, though this is not the focus of the workshop.

Tutorial is for a range of skill sets, still trying to avoid “something for everyone and everything for no one.”

General Idea of Webscraping

Since no two websites are the same, webscraping requires you to identify and exploit patterns in the code that renders websites. Each website is rendered by your browser from HTML, and the goal of webscraping is to parse the HTML that is sent to your browser into usable data. Generally speaking, the steps for webscraping are as follows: - access a web page from R, - tell R where to “look” on the page, - and manipulate the data in a usable format within R.

The first section will cover the magrittr package, which enables a piping operator to help us filter the through text. Without magrittr, we would have to assign a bunch of temporary variables or use nested functions to do operations on a string of HTML.

A Prerequisite: magrittr

Rvest is designed to work with magrittr and the %>% piping notation. This can be a little goofy when you first see it, so let’s take a second to get oriented to the %>% piping notation and what it means. Piping is also useful for understanding other packages, such as ggplot and dplyr. Just remember the main advantage of piping is to minimize the need for nesting functions.

The magrittr package implements a piping function, which takes the output of an operation and “pipes” it to the next function.

Why Scrape?

  • Sometimes data are not downloadable.
  • The only source might be a web page.
  • You want to update the data (in the future).
  • There are lots of ‘high quality relational data’ out there to be scraped.

Other Presentations on Webscraping in R

Generally speaking, the steps for webscraping are as follows: - Inspect the page in your browser to determine if you really want to scrape it. - Once you have decided, set aside some time and don’t doubt yourself. - Access a web page from R; - Tell R where to “look” on the page; - and manipulate the data in a usable format within R.

Now on to the scraping

  1. Getting Familiar with the Website

Because each website is different, we’ll be scraping a couple during this tutorial to give you an idea of the general principles (as well as a few examples of working code to start your next project). First up is SciStarter.org, a platform for advertising and joining citizen science projects. Take a look at the projects page.

Projects filtered by location At the beach.

Projects filtered by location “At the beach”.

To scrape the projects info from this page, we need to take a look at the HTML.

You can do this by right-clicking and selecting "view page source" from Chrome, Firefox, Safari, and most other modern browsers. You should see something that looks like the image below.

Source code for SciStarter.

Source code for SciStarter.

If you don’t know HTML, this looks pretty daunting! Luckily, there are some built-in tools in your browser that will help you parse the code to find the relevant sections of the page. You can right-click and select the Inspect Element option.

You can right-click and select the Inspect Element option. This allows you to interactively scroll over the page and see how each line of HTML corresponds to different parts of the web page. See the screen shot below for an example.

Example of Inspect Element screen for SciStarter.

Example of “Inspect Element” screen for SciStarter.

Ok, now we’re going to start trying to isolate the specific information we’re after. Let’s say our goal is to grab all the data from the first ten records on the page. We can see from the image above that the first piece of information we need is a link with the node “a” and the atttribute “href.” Let’s see if we can grab that using a basic function in the rvest package.

  • First, let’s install the rvest package with the command install.packages('rvest').
  • Once it is installed, you can access the help by typing ?rvest in your command prompt.
  • The documentation for specific functions can also be accessed the same way, such as ?html_node.

The basic functions in rvest are powerful, and you should try to utilize the following functions when starting out a new project.

  • html_nodes(): identifies HTML wrappers.
  • html_nodes(".class"): calls node based on css class
  • html_nodes("#id"): calls node based on <div> id
  • html_nodes(xpath="xpath"): calls node based on xpath (we’ll cover this later)
  • html_attrs(): identifies attributes (useful for debugging)
  • html_table(): turns HTML tables into data frames
  • html_text(): strips the HTML tags and extracts only the text

Note on plurals: html_node() returns metadata for the , html_nodes() iterates over the matching nodes.

html_nodes()

The html_nodes() function turns each HTML tag into a row in an R dataframe.

library(rvest)
## Loading required package: xml2
# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"

scistarter_html <- read_html(URL)
scistarter_html
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n    \n    \n    <svg style="position: absolute; width: 0; he ...

We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.

The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.

This grabs all the nodes that have links in them.

scistarter_html %>%
  html_nodes("a") %>%
  head()
## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] <a href="/dashboard">My Account</a>
## [3] <a href="/finder" class="is-active">Project Finder</a>
## [4] <a href="/events">Event Finder</a>
## [5] <a href="/people-finder">People Finder</a>
## [6] <a href="#dialog-login" rel="modal:open">log in</a>

In a more complex example, we could use this to “crawl” the page, but that’s for another day.

Every div on the page:

scistarter_html %>%
  html_nodes("div") %>%
  head()
## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n        <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n          < ...
## [3] <div class="nav-tools">\n            <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n              <div class="field">\n ...
## [5] <div class="field">\n                <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n                    <d ...

… the nav-tools div. This calls by css where class=nav-tools.

scistarter_html %>%
  html_nodes("div.nav-tools") %>%
  head()
## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n            <div class="nav-tools__search"> ...

We can call the nodes by id as follows.

scistarter_html %>%
  html_nodes("div#project-listing") %>%
  head()
## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n          \n       ...

All the tables as follows:

scistarter_html %>%
  html_nodes("table") %>%
  head()
## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...

This will help us access one of rvest’s most powerful features, html_table().

Now that it’s clear what html_node() does, lets look at the source code for the site again to see if we can combine rvest functions (and a some base R functions) to get at the right data.

Putting it together

With the piping function and what we learned about rvest, we can now start scraping this page.

scistarter_html %>%
  html_nodes("div#project-listing") %>%
  html_nodes("table") %>%
  html_table() %>%
  "["(1) %>% str()
## List of 1
##  $ :'data.frame':    3 obs. of  2 variables:
##   ..$ X1: chr [1:3] "Goal" "Task" "Where"
##   ..$ X2: chr [1:3] "Study where ants live and what they eat" "Build your kit. Catch your ants. Send them to us!" "United States of America"

Should we try using another function? Let’s see.

scistarter_html %>%
  html_nodes("div#project-listing") %>% #filter to the projec-listing div
  html_nodes("h3") # filter the tables in the project-listing div 
## {xml_nodeset (10)}
##  [1] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/17003-School ...
##  [2] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/16830-NASA-G ...
##  [3] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/17039-Britis ...
##  [4] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/17011-Coasta ...
##  [5] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/16909-Pl%40n ...
##  [6] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/16868-Wiscon ...
##  [7] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/15731-Sentin ...
##  [8] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/15069-Surfri ...
##  [9] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/15066-Surfri ...
## [10] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/14743-Surfri ...
scistarter_html %>%
  html_nodes("div#project-listing") %>% #filter to the projec-listing div
  html_nodes("h3") %>%                  # get the headings
  html_text() %>%                       #get the text, not the HTML tags
  gsub("^\\s+|\\s+$", "", .)            #strip the white space from the beginning and end of a string.
##  [1] "School of Ants USA"                                          
##  [2] "NASA GLOBE Observer: Clouds"                                 
##  [3] "British Columbia Beached Bird Survey"                        
##  [4] "Coastal Research Volunteers"                                 
##  [5] "Pl@ntNet"                                                    
##  [6] "Wisconsin Breeding Bird Atlas II"                            
##  [7] "Sentinels of the Sounds"                                     
##  [8] "Surfrider Foundation's Blue Water Task Force Rincon"         
##  [9] "Surfrider Foundation's Blue Water Task Force San Luis Obispo"
## [10] "Surfrider Foundation's Blue Water Task Force Marin County"

Now we have isolated the titles of each citizen science project. This will make up one of the columns of our dataframe, but it’s not the whole thing.

Other columns are embedded in separate tables on the page. This is not ideal for webscraping, and there are multiple ways to access that information. - Example of a method of scraping the information from each table. - Filter to tables

scistarter_html %>%
  html_nodes("td") %>% #grab the <td> tags
  html_text() %>% # isolate the text from the html tages
  gsub("^\\s+|\\s+$", "", .) %>% #strip the white space from the beginning and end of a string.
  head(n=12) # take a peek at the first 12 records
##  [1] "Study where ants live and what they eat"                       
##  [2] "Build your kit. Catch your ants. Send them to us!"             
##  [3] "United States of America"                                      
##  [4] "Help scientists understand the sky from above and below"       
##  [5] "Photograph clouds, record sky observations and share with NASA"
##  [6] "Anywhere"                                                      
##  [7] "Monitor seabird mortality in British Columbia"                 
##  [8] "Walk your local beach while looking for dead birds"            
##  [9] "British Columbia"                                              
## [10] "Contribute to coastal science, stewardship, and resilience"    
## [11] "Participate in one of our many coastal research opportunities" 
## [12] "New Hampshire"

Putting it all together

scistarter_html %>%
  html_nodes("td") %>% #grab the <td> tags
  html_text() %>% # isolate the text from the html tages
  gsub("^\\s+|\\s+$", "", .) %>% #strip the white space from the beginning and end of a string.
  head(n=12) # take a peek at the first 12 records
##  [1] "Study where ants live and what they eat"                       
##  [2] "Build your kit. Catch your ants. Send them to us!"             
##  [3] "United States of America"                                      
##  [4] "Help scientists understand the sky from above and below"       
##  [5] "Photograph clouds, record sky observations and share with NASA"
##  [6] "Anywhere"                                                      
##  [7] "Monitor seabird mortality in British Columbia"                 
##  [8] "Walk your local beach while looking for dead birds"            
##  [9] "British Columbia"                                              
## [10] "Contribute to coastal science, stewardship, and resilience"    
## [11] "Participate in one of our many coastal research opportunities" 
## [12] "New Hampshire"

Notice the pattern here? Every third record is a goal, task, or location. We can use our base R knowledge to transform this into three columns that have only goals, tasks, or locations. After that, we will put the columns in a data.frame.

===

page_list <- scistarter_html %>%
  html_nodes("td") %>%
  html_text() %>%
  gsub("^\\s+|\\s+$", "", .) #strip the white space from the beginning and end of a string.

goals <- page_list[seq(from=1, to=30,by=3)] # make a sequence to select the goals
task <- page_list[seq(from=2, to=30,by=3)]
location <- page_list[seq(from=3, to=30,by=3)]

title <- scistarter_html %>%
  html_nodes("div#project-listing") %>% #filter to the projec-listing div
  html_nodes("h3") %>%                  # get the headings
  html_text() %>%                       #get the text, not the HTML tags
  gsub("^\\s+|\\s+$", "", .) 

scistarter_df <- data.frame(title, goals, task, location)

Now you have scraped SciStarter’s first project page! From here, you can write a loop that will build up a data frame from multiple pages by going to each page and scraping the data.

pages <- ceiling(832/10) # number of pages to go through
sci_df <- data.frame()

#for (page in (1:pages)) { Uncomment this if you want all the pages.
for (page in (1:5)) {

  
  print(paste0("geting data for page: " , page ))
  URL <- paste0("https://scistarter.com/finder?phrase=&lat=&lng=&activity=&topic=&search_filters=&search_audience=&page=", page, "#view-projects")
  
  sci_html <- read_html(URL)
  page_list <- sci_html %>%
    html_nodes("td") %>%
    html_text() %>%
    gsub("^\\s+|\\s+$", "", .) #strip the white space from the beginning and end of a string.

goal <- page_list[seq(from=1, to=30,by=3)]
task <- page_list[seq(from=2, to=30,by=3)]
location <- page_list[seq(from=3, to=30,by=3)]

title <- sci_html %>%
  html_nodes("div#project-listing") %>% #filter to the projec-listing div
  html_nodes("h3") %>% # get the headings
  html_text() %>% #get the text, not the HTML tags
  gsub("^\\s+|\\s+$", "", .) #strip the white space from the beginning and end of a string.

tmp <- data.frame(title, goal, task, location)
    if (pages == 1 ) {
        sci_df <- data.frame(tmp)
      } else {
        sci_df <- rbind(sci_df, tmp)  
      }
}
## [1] "geting data for page: 1"
## [1] "geting data for page: 2"
## [1] "geting data for page: 3"
## [1] "geting data for page: 4"
## [1] "geting data for page: 5"
sci_df %>% str()
## 'data.frame':    50 obs. of  4 variables:
##  $ title   : Factor w/ 50 levels "California Roadkill Observation System",..: 10 5 7 6 4 9 3 2 1 8 ...
##  $ goal    : Factor w/ 50 levels "Contribute to better health care. No experience necessary!",..: 9 7 5 4 8 2 6 1 10 3 ...
##  $ task    : Factor w/ 50 levels "Add your roadkill observations.",..: 9 5 2 6 10 8 7 3 1 4 ...
##  $ location: Factor w/ 22 levels "","Anywhere",..: 1 1 5 2 4 4 2 4 3 2 ...

Easier Targets - html_table()

Some websites publish their data in an easy-to-read table without offering the option to download the data.

  • Rvest is has a great tool built in for this, html_table().
  • Using the functions listed above, isolate the table on the page.
  • Then pass the HTML table to html_table() and viola - a shiny R data frame is ready for you to analyze.

html_table() example

Go to https://www.nis.gov.kh/cpi/Jan14.html and inspect the html.

# To scrape a table from a website, the html_table() function can be a game-changer.
# But it doesn't give us the right output right away. 
URL2 <- "https://www.nis.gov.kh/cpi/Apr14.html"

# TIP: When debugging or building your scraper, assign a variable to the raw HTML.
# That way you only have to read it once
accounts <- read_html(URL2) 

table <- accounts %>%
  html_nodes("table") %>%
  html_table(header=T)

# You can clean up the table with the following code, or something like it. 
# table[[1]]
dict <- table[[1]][,1:2]
accounts_df <- table[[1]][6:18,-1]

names <- c('id', 'weight.pct', 'jan.2013', 'dec.2013', 'jan.2014', 'mo.pctch', 'yr.pctch', 'mo.cont', 'yr.cont')
colnames(accounts_df) <- names

accounts_df #%>% str()
##                                                     id weight.pct jan.2013
## 6                               All ITEMS  (CPI TOTAL)    100.000    150.2
## 7                  FOOD AND    NON-ALCOHOLIC BEVERAGES     44.775    170.4
## 8        ALCOHOLIC BEVERAGES,    TOBACCO AND NARCOTICS      1.625    127.9
## 9                                CLOTHING AND FOOTWEAR      3.036    125.8
## 10 HOUSING, WATER,    ELECTRICITY, GAS AND OTHER FUELS     17.084    127.8
## 11               FURNISHINGS,    HOUSEHOLD MAINTENANCE      2.743    130.5
## 12                                              HEALTH      5.141    116.8
## 13                                           TRANSPORT     12.228    128.2
## 14                                       COMMUNICATION      1.136     71.3
## 15                           RECREATION AND    CULTURE      2.912    107.8
## 16                                           EDUCATION      1.174    146.1
## 17                                         RESTAURANTS      5.861    201.2
## 18                 MISCELLANEOUS GOODS    AND SERVICES      2.285    152.8
##    dec.2013 jan.2014 mo.pctch yr.pctch mo.cont yr.cont
## 6     156.8    157.7      0.5      4.9     0.5     4.9
## 7     178.7    180.0      0.7      5.6     0.4     2.9
## 8     136.0    136.9      0.6      7.1     0.0     0.1
## 9     130.1    130.4      0.2      3.6     0.0     0.1
## 10    131.1    131.5      0.3      2.9     0.0     0.4
## 11    141.3    141.9      0.4      8.7     0.0     0.2
## 12    127.1    127.5      0.3      9.2     0.0     0.4
## 13    129.7    129.8      0.1      1.2     0.0     0.1
## 14     69.6     69.4     -0.3     -2.7     0.0     0.0
## 15    110.7    110.5     -0.2      2.5     0.0     0.1
## 16    159.9    159.9      0.0      9.5     0.0     0.1
## 17    218.0    220.1      0.9      9.4     0.1     0.7
## 18    145.0    144.7     -0.3     -5.3     0.0    -0.1

Using xpath

The final method for extracting data from a webpage is to call the data using it’s xpath. Sometimes, we want very specific data from a website. Maybe we don’t want only specific information from a table.

The xpath option can be very useful for doing this, but it is not super intuitive. Think of this as directions you are giving rvest to the specific pice of data you’re interested in scraping.

For the last example, we’ll scrape wunderground.com, inspired by a great tutorial on webscraping in Python from Nathan Yau’s Visualize This.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
URL <- "https://www.wunderground.com/history/airport/WSSS/2016/1/1/DailyHistory.html?req_city=Singapore&req_statename=Singapore"

raw <- read_html(URL)
      
max <- raw %>% 
  html_nodes(xpath='//*[@id="historyTable"]/tbody/tr[3]/td[2]/span/span[1]')  %>%
  html_text() %>% as.numeric()
min <- raw %>%
  html_nodes(xpath='//*[@id="historyTable"]/tbody/tr[4]/td[2]/span/span[1]') %>%
  html_text() %>% as.numeric()
date <- ymd(paste("2016","1","1", sep="/"))

record <- data.frame(date, min, max)

record
##         date min max
## 1 2016-01-01  24  29

What is going on here?

  • //*[@id="historyTable"]/tbody/tr[3]/td[2]/span/span[1]

  • When you loaded rvest, you might have seen a friendly warning from R that xml package is a dependency.
  • rvest uses XML notation to read the HTML
  • xpath is the “address” of the text in markup text.
  • It isn’t necessary to understand the deeper points regarding this (I certainly don’t!), but it might demystify the next steps to learn about xpath.

The address we supplied is //*[@id="historyTable"]/tbody/tr[3]/td[2]/span/span[1]. You can go back through the HTML using view source to reverse-engineer this xpath.

For a nice overview, check out this StackOverflow response.

Loop through the URLs like so:

# go to https://www.wunderground.com/history/airport/WSSS/2017/3/5/DailyHistory.html?req_city=Singapore&req_statename=Singapore

years <- c(2016) # edit for the year(s) you want
months <- c(1:12)

for (y in years) {
  for (m in months) {
    if (m == c(4 || 6 || 9 || 11) ) {
      days <- c(1:31) # Apr, Jun, Sep, Nov have 31
    } else if (m == 2 && y %% 4 == 0 ) {
      days <- c(1:29) # leap year
    } else if (m == 2 && y %% 4 != 0 ) {
      days <- c(1:28) # non leap year Febs
    } else {
      days <- c(1:30) # All the rest have 30 days 
    }
    #for (d in days) {
    for (d in 1) {
      URL <- paste0("https://www.wunderground.com/history/airport/WSSS/", 
                    y, "/", 
                    m, "/",
                    d, "/DailyHistory.html?req_city=Singapore&req_statename=Singapore")
      print(URL) # try this to test before running the script
      
      raw <- read_html(URL)
      
      max <- raw %>% 
        html_nodes(xpath='//*[@id="historyTable"]/tbody/tr[3]/td[2]/span/span[1]')  %>%
        html_text() %>% as.numeric()
      min <- raw %>%
        html_nodes(xpath='//*[@id="historyTable"]/tbody/tr[4]/td[2]/span/span[1]') %>%
        html_text() %>% as.numeric()
      
      date <- ymd(paste(y,m,d, sep="/"))
      record <- data.frame(cbind(date, min, max))
      
      if ( date == "2016-01-01") {
        sing_temp <- record
      } else {
        sing_temp <- cbind(sing_temp, record)
      }
    }
  }
}

Base R Scraper

# Angel Hsu
# Scrape of RE100 members
# October 27, 2016

#setwd("~/Dropbox/NAZCA DATA 2016/RE100")
#URL of the HTML webpage 
co.names <- data.frame()
url <- "http://there100.org/companies"

x <- scan(url, what="", sep="\n")

# location of company names starts at line 2464
# <p><a href="http://www.there100.org/ikea" target="_blank"><img src="http://media.virbcdn.com/cdn_images/resize_1024x1365/71/87aded9e69e34239-ikea.jpg" /></a></p>
# <p>The IKEA Group is a home furnishing company with 336 stores in 28 countries. The company has committed to produce as much renewable energy as the total energy it consumes in its buildings by 2020. Alongside Swiss Re, IKEA Group is a founding partner of the RE100 campaign.</p>
start <- grep("The IKEA Group", x) # start of companies
end <- grep("S.p.A is an Italian", x) # end of companies 

# each company names is preceded by either a .jpg or a .png
sub <- x[start:end]
sel <- grep("jpg|png", sub)

co.names <- sub[sel+1]
# add back in bank of america
bofa <- grep("Bank of America", x)
co.names <- c(co.names, x[start], bofa)
co.names <- gsub("<.*?>", "", co.names)

write.csv(co.names, "RE100_2016.csv", row.names=F)