Web scraping with R

This is a tutorial that serves as an intro to scraping with R using Rvest library. It walks through how to scrape a table of OSHA inspections, extract information based on italic style of the text, scrape hyperlinked tables. It also offers tips on how to deal with errors and avoid being blocked as a robot.

A quick intro to webpage

Webpages usually consist of:

  • HTML (HyperText Markup Language) files, which build structure of the page
  • CSS (Cascading Style Sheet) files, which define the style or look of the page
  • JavaScript files, which make the page interactive

An HTML file is a text file with HTML tags, which are reserved keywords for certain elements and they remind your your web browswer, “hey, here’s a table/paragraph/list, please display it as a table/paragraph/list”. And most tags must be in pairs, an opening tag and a closing tag, i.e. <table></table>, <p></p>, <li></li>.

These tags can have attributes such as:

  • hyperlinks: <a href='https://www.osha.gov/'>Occupational Safety and Health</a>
  • class: <table class='table table-bordered'>
  • id: <h1 id='myHeader'>Hello World!</h1>

You can learn more about HTML tags here

Inspect elements

An HTML document is like a tree and scraping data from it is like picking apples. You need to tell R which branches do you want the apples from, features of the branches, more ripened apples, without leaves etc. Tags and attributes help you target the branch we want apples from. To find the right tags and attributes we need to inspect the source code.

Click here to visit the page we’re going to scrape. Place your mouse on “Activity” column of the table in the middle, right click on the page and click “Inspect”. “Elements” tab highlights where your mouse placed. “Sources” tab shows the entire html file.

If our apple is the data in <table> tag that’s highlighted, it is on a <table> branch of the <div> branch, which is a branch of another <div>, which is, several layers of <div> branches later, a branch of the <body> branch of the HTML tree. Any branch or sub-branch of this tree can also be called a “node”, and you will hear this word several times in this session.

Now let’s scrape a table!

First load the libraries we need.

## install the package
#install.packages("rvest")
#install.packages("dplyr")
## load the package
library(rvest)
library(dplyr)

This webpage has 161 osha citations in the messenger courier industry in 2019.

#url of website to be scrapped
url <- "https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=200"

read_html(): read the webpage/html document into R

#read the html content into R and assigns to webpage object
webpage <- read_html(url, encoding = "windows-1252")
webpage
## {html_document}
## <html lang="en-us">
## [1] <head>\n<meta charset="utf-8">\n<title>Industry SIC Search Results Page | ...
## [2] <body>\n<div class="block block-gtranslate block-gtranslate block-gtransl ...

Tip: to find the right encoding, run “document.inputEncoding” in the console tab.

Character encoding is a method of converting bytes into characters. To validate or display an HTML document properly, a program must choose a proper character encoding. You can read more about in this post.

html_nodes(): select elements/nodes from the html

We can select certain elements in the html document, or “nodes”, by picking out certain feature, like we talked about, picking which branches we want the apples from. We do that by passing on what is called “CSS selector” to the html_nodes() function. You can also pass on “Xpath” but we’re not covering it today.

The following line is telling R to pull nodes or tree branches with “table” tags.

html_nodes(webpage,"table")
## {xml_nodeset (3)}
## [1] <table class="table table-bordered table-striped">\n<tr>\n<th>SIC</th>\n< ...
## [2] <table class="table table-bordered"><tr class="breadcrumb">\n<td>\n<stron ...
## [3] <table class="table table-bordered table-striped">\n<thead><tr>\n<th> </t ...

You can also choose elements or tree branches based on attributes. Here we can find the value of class attribute of the <table> node/branch we want and pass that onto html_nodes() function. There are two tables with the same class attribute. The table we want is the second node in the returned nodeset.

html_nodes(webpage,"[class='table table-bordered table-striped']")[[2]]
## {html_node}
## <table class="table table-bordered table-striped">
##  [1] <thead><tr>\n<th> </th>\n<th>#</th>\n<th>Activity</th>\n<th>Opened</th>\ ...
##  [2] <tr>\n<td><input type="checkbox" name="id" value="1452519.015"></td>\n<t ...
##  [3] <tr>\n<td><input type="checkbox" name="id" value="1451959.015"></td>\n<t ...
##  [4] <tr>\n<td><input type="checkbox" name="id" value="1451637.015"></td>\n<t ...
##  [5] <tr>\n<td><input type="checkbox" name="id" value="1451571.015"></td>\n<t ...
##  [6] <tr>\n<td><input type="checkbox" name="id" value="1453178.015"></td>\n<t ...
##  [7] <tr>\n<td><input type="checkbox" name="id" value="1451355.015"></td>\n<t ...
##  [8] <tr>\n<td><input type="checkbox" name="id" value="1451044.015"></td>\n<t ...
##  [9] <tr>\n<td><input type="checkbox" name="id" value="1452754.015"></td>\n<t ...
## [10] <tr>\n<td><input type="checkbox" name="id" value="1450975.015"></td>\n<t ...
## [11] <tr>\n<td><input type="checkbox" name="id" value="1451657.015"></td>\n<t ...
## [12] <tr>\n<td><input type="checkbox" name="id" value="1449936.015"></td>\n<t ...
## [13] <tr>\n<td><input type="checkbox" name="id" value="1451983.015"></td>\n<t ...
## [14] <tr>\n<td><input type="checkbox" name="id" value="1450424.015"></td>\n<t ...
## [15] <tr>\n<td><input type="checkbox" name="id" value="1448259.015"></td>\n<t ...
## [16] <tr>\n<td><input type="checkbox" name="id" value="1447908.015"></td>\n<t ...
## [17] <tr>\n<td><input type="checkbox" name="id" value="1447553.015"></td>\n<t ...
## [18] <tr>\n<td><input type="checkbox" name="id" value="1451336.015"></td>\n<t ...
## [19] <tr>\n<td><input type="checkbox" name="id" value="1447118.015"></td>\n<t ...
## [20] <tr>\n<td><input type="checkbox" name="id" value="1447292.015"></td>\n<t ...
## ...

html_table(): parse the table

After we get the node or the tree branch with that inspections table, we can parse it with html_table() function.

inspections <- html_nodes(webpage,"[class='table table-bordered table-striped']")[[2]] %>% html_table()

inspections <- inspections[,-c(1:2)] ## remove the first two columns. one is empty, the other is useless.

head(inspections)
##   Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1  1452519 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2  1451959 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3  1451637 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4  1451571 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5  1453178 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6  1451355 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name
## 1      144338 - Directlink Logistics, Inc
## 2    317726467 - Portland Pedal Power Llc
## 3                      Fedex Freight, Inc
## 4                   United Parcel Service
## 5      Fedex Ground Package Systems, Inc.
## 6 Wa317957214 - United Parcel Service Inc

Save the table

If you are happy with this table, you can save it locally as a csv file.

#write.csv(inspections, "~/Desktop/nicar2020/nicar_2020_scraping_r/inspections.csv")

Extract activity numbers with html_attr()

In the scraped table, Activity column don’t have decimal places. Let’s rescrape the complete activity numbers from the table and replace the Activity column.

What CSS selector do we use to target the nodes/tree branches with activity numbers?

Inspect elements of those activity numbers, and you will realize they appear as the “title” attribute in the <a> tags, for example: <a href="establishment.inspection_detail?id=1452519.015" title="1452519.015"> (Yes, there can be multiple attributes for a tag.)

<a> tags in HTML are reserved for hyperlinks, so we will want nodes with <a> tags for sure, but not all of them.

Instead, we want <a> tags: * in <td> tags, in other words, in a table cell * in the third column of a table

To find these specific type of nodes/branches we need to understand two things.

First, “:nth-child(A)” selects the nth child element in another element. What appears before the colon defines the type of chile element and parent element.

Go to level 18 of this interactive CSS tutorial and try the game after reading the examples on the right.

Now you will understand that “td:nth-child(3)” selects nodes in every third table cell in every table row on the page.

When you run the next code chunk you will find the first node isn’t what we want. And we will fix it next.

html_nodes(webpage, 'td:nth-child(3)') %>% head()
## {xml_nodeset (6)}
## [1] <td>01/01/2019 to 12/31/2019</td>\n
## [2] <td><a href="establishment.inspection_detail?id=1452519.015" title="14525 ...
## [3] <td><a href="establishment.inspection_detail?id=1451959.015" title="14519 ...
## [4] <td><a href="establishment.inspection_detail?id=1451637.015" title="14516 ...
## [5] <td><a href="establishment.inspection_detail?id=1451571.015" title="14515 ...
## [6] <td><a href="establishment.inspection_detail?id=1453178.015" title="14531 ...

The second thing you need to understand, “A B” selects “all B inside of A”. Go to level 4 of this interactive tutorial and try typing the answer, you will have a deeper understanding.

So “td:nth-child(3) a” selects nodes with <a> tags inside the data cells in the third column, and because the other table’s third column data isn’t hyperlinked and doesn’t include <a> tags, it won’t be selected.

I strongly recommend that you go through all 32 levels of this fun and interactive CSS selector tutorial. SelectorGaght Chrome extention is also really useful in getting you started with scraping by finding the CSS selector based on your point and clicks.

Next save the activity numbers to a vector.

act_num <- html_nodes(webpage, 'td:nth-child(3) a') %>% html_attr("title")
length(act_num) ## double check how many activity numbers
## [1] 161
head(act_num) ## check out the first six
## [1] "1452519.015" "1451959.015" "1451637.015" "1451571.015" "1453178.015"
## [6] "1451355.015"

Replace the Activity column with complete activity numbers

# replace the Activity column with the act_num vector
inspections$Activity <- act_num
# check out the first six rows
head(inspections)
##      Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1 1452519.015 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2 1451959.015 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3 1451637.015 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4 1451571.015 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5 1453178.015 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6 1451355.015 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name
## 1      144338 - Directlink Logistics, Inc
## 2    317726467 - Portland Pedal Power Llc
## 3                      Fedex Freight, Inc
## 4                   United Parcel Service
## 5      Fedex Ground Package Systems, Inc.
## 6 Wa317957214 - United Parcel Service Inc

Extract incomplete inspections based on italic style with html_text()

A piece of information is missing in the table above compared to the table on the webpage. A message on the page says “inspections which are known to be incomplete will have the identifying Activity Nr shown in italic”. We want to include that information in our table too.

Inspect elements and compare italic and non-italic numbers, we realize we need to target numbers wrapped in <em> tags. <em> in HTML means the text is displayed in italic. To avoid getting all <em> tags on the page, “td a em” only selects <em> tags inside <a> tags inside <td> tags, like we explained ealier.

open_cases <- html_nodes(webpage,"td a em") %>%
  html_text()
length(open_cases)
## [1] 54
head(open_cases)
## [1] "1451637.015" "1451571.015" "1451044.015" "1452754.015" "1450975.015"
## [6] "1451657.015"

Create a new column for whether the case is incomplete

We can use ifelse function to create a new column that differentiate incomplete vs complete cases.

inspections$status <- ifelse(inspections$Activity %in% open_cases, "incomplete", "complete")
inspections %>% head()
##      Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1 1452519.015 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2 1451959.015 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3 1451637.015 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4 1451571.015 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5 1453178.015 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6 1451355.015 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name     status
## 1      144338 - Directlink Logistics, Inc   complete
## 2    317726467 - Portland Pedal Power Llc   complete
## 3                      Fedex Freight, Inc incomplete
## 4                   United Parcel Service incomplete
## 5      Fedex Ground Package Systems, Inc.   complete
## 6 Wa317957214 - United Parcel Service Inc   complete

Recap

All we did today pretty much falls into the following rhythm: * step 1: read_html(), read the webpage into R * step 2: html_nodes(), pull elements/nodes with chosen tags or attributes, in other words, get the tree branch with the apples you want * step 3: extract text/attributes with html_attr()/html_text(), or parse table with html_table().

Dealing with errors

When the data you scrape is big, you might be identified as a robot and blocked by the website. You will see “Http 403 Forbidden Error”. You might avoid that by adding a pause between each time your hit the website server by adding Sys.sleep() in your code.

Look for ‘robots.txt’ on the site you’re trying to scrape. Sometimes it tells you how many seconds you need to pause.

In our example, you could change the function to this:

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]])
  colnames(table) <- as.character(unlist(table[1,]))
  table <- table[-1,-1]
  table$activity_number <- act_num
  return(table)
  Sys.sleep(6)
}

Using a “User-agent” and specifying a web browser also helps because it’s telling the server you’re visint by a web browswer. You can achieve that with httr library.

library(httr)
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
html_session(url,user_agent(uastring)) %>% html_node('table')

More on avoid getting blocked while scraping here.

Also, just in case an error occurs and you lose everything you already scraped, you can add tryCatch() to your scraping function. This way R keeps running when error occurs and tells you where error or warning is later. So in our example I would change the function to:

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  out <- tryCatch(
    {
      table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]])
      colnames(table) <- as.character(unlist(table[1,]))
      table <- table[-1,-1]
      table$activity_number <- act_num
      return(table)
      Sys.sleep(6)
    },
    error=function(cond) {
      message(paste("act_num caused an error:", act_num))
      message("Here's the original error message:")
      message(cond)
      # Choose a return value in case of error
      return(NA)
      Sys.sleep(6)
    },
    warning=function(cond) {
      message(paste("act_num caused a warning:", act_num))
      message("Here's the original warning message:")
      message(cond)
      return(NULL)
      Sys.sleep(6)
    }
  )
  return(out)
}

Geting data to show in one page

A lot of data are displayed on multiple pages, some tweaking of the URL helps you get all results to show up all at once. Here’s a walk-through of how I got the URL we used in the scraping.

So you want to look up the OSHA violations in the messenger courier industry in 2019, and you searched NAICS code “492110” here, and you got 161 results split up to 9 pages, 20 results each page for the first 8 pages.

Original url:

https://www.osha.gov/pls/imis/industry.search?p_logger=1&sic=&naics=492110&State=All&officetype=All&Office=All&endmonth=01&endday=01&endyear=2019&startmonth=12&startday=31&startyear=2019&owner=&scope=&FedAgnCode=

When you click page 2, the url becomes:

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=20&p_sort=&p_desc=DESC&p_direction=Next&p_show=20

Click back to page 1, the url becomes:

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=20

Change “show=” to 200, and now you can retrieve all results in one page.

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=200

Homework

Download all the comment letters for this rule? Hint: you will need download.file() function.