Web scraping with R

This is a tutorial that serves as an intro to scraping with R using Rvest library. It walks through how to scrape a table of OSHA inspections, extract information based on italic style of the text, scrape hyperlinked tables. It also offers tips on how to deal with errors and avoid being blocked as a robot.

A quick intro to webpage

Webpages usually consist of: * HTML (HyperText Markup Language) files, which build structure of the page * CSS (Cascading Style Sheet) files, which define the style or look of the page * JavaScript files, which make the page interactive

An HTML file is a text file with HTML tags, which are reserved keywords for certain elements and they remind your your web browswer, “hey, here’s a table/paragraph/list, please display it as a table/paragraph/list”. And most tags must be in pairs, an opening tag and a closing tag, i.e. <table></table>, <p></p>, <li></li>.

These tags can have attributes such as: * hyperlinks: <a href='https://www.osha.gov/'>Occupational Safety and Health</a> * class: <table class='table table-bordered'> * id: <h1 id='myHeader'>Hello World!</h1>

You can learn more about HTML tags here

Inspect elements

To extract the elements we want from the html document, we need to find the right tags and attributes in the source code.

Click here to visit the page we’re going to scrape. Place your mouse on “Activity” column of the table, right click on the page and click “Inspect”. “Elements” tab highlights where your mouse placed. “Sources” tab shows the entire html file.

The HTML code looks like a tree. It has two main branches: <head>, <body>, each has consist of subbranch <div>s, (think of <div>as a section). Within each <div>, there can be more <div>s, or <table> and other elements. Within a <table>, there is <thead> (table head), <tbody> (table body), <tr> (table row), <th> (header cell), <td> (data cell).

Now let’s scrape a table!

First load the libraries we need.

## install the package
#install.packages("rvest")
#install.packages("dplyr")
## load the package
library(rvest)
library(dplyr)

This webpage has 161 osha citations in the messenger courier industry in 2019.

#url of website to be scrapped
url <- "https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=200"

read_html(): read the webpage/html document into R

#read the html content into R and assigns to webpage object
webpage <- read_html(url, encoding = "windows-1252")
webpage

## {html_document}
## <html lang="en-us">
## [1] <head>\n<meta charset="utf-8">\n<title>Industry SIC Search Results Page | ...
## [2] <body>\n<div class="block block-gtranslate block-gtranslate block-gtransl ...

Tip: to find the right encoding, run “document.inputEncoding” in the console tab.

Character encoding is a method of converting bytes into characters. To validate or display an HTML document properly, a program must choose a proper character encoding. You can read more about in this post.

html_nodes(): select elements/nodes from the html

We can select certain elements in the html document, or “nodes”, by picking out certain feature, we do that by passing on what is called “CSS selector” to the html_nodes() function. You can also pass on “Xpath” but we’re not covering it today.

The following line is telling R to pull nodes with “table” tags.

html_nodes(webpage,"table")

## {xml_nodeset (3)}
## [1] <table class="table table-bordered table-striped">\n<tr>\n<th>SIC</th>\n< ...
## [2] <table class="table table-bordered"><tr class="breadcrumb">\n<td>\n<stron ...
## [3] <table class="table table-bordered table-striped">\n<thead><tr>\n<th> </t ...

It can also choose elements based on attributes. Here we pass on the value of class attribute of the table we want to scrape. There are two tables with the same class attribute. The table we want is the second node in the returned nodeset.

html_nodes(webpage,"[class='table table-bordered table-striped']")[[2]]

## {html_node}
## <table class="table table-bordered table-striped">
##  [1] <thead><tr>\n<th> </th>\n<th>#</th>\n<th>Activity</th>\n<th>Opened</th>\ ...
##  [2] <tr>\n<td><input type="checkbox" name="id" value="1452519.015"></td>\n<t ...
##  [3] <tr>\n<td><input type="checkbox" name="id" value="1451959.015"></td>\n<t ...
##  [4] <tr>\n<td><input type="checkbox" name="id" value="1451637.015"></td>\n<t ...
##  [5] <tr>\n<td><input type="checkbox" name="id" value="1451571.015"></td>\n<t ...
##  [6] <tr>\n<td><input type="checkbox" name="id" value="1453178.015"></td>\n<t ...
##  [7] <tr>\n<td><input type="checkbox" name="id" value="1451355.015"></td>\n<t ...
##  [8] <tr>\n<td><input type="checkbox" name="id" value="1451044.015"></td>\n<t ...
##  [9] <tr>\n<td><input type="checkbox" name="id" value="1452754.015"></td>\n<t ...
## [10] <tr>\n<td><input type="checkbox" name="id" value="1450975.015"></td>\n<t ...
## [11] <tr>\n<td><input type="checkbox" name="id" value="1451657.015"></td>\n<t ...
## [12] <tr>\n<td><input type="checkbox" name="id" value="1449936.015"></td>\n<t ...
## [13] <tr>\n<td><input type="checkbox" name="id" value="1451983.015"></td>\n<t ...
## [14] <tr>\n<td><input type="checkbox" name="id" value="1450424.015"></td>\n<t ...
## [15] <tr>\n<td><input type="checkbox" name="id" value="1448259.015"></td>\n<t ...
## [16] <tr>\n<td><input type="checkbox" name="id" value="1447908.015"></td>\n<t ...
## [17] <tr>\n<td><input type="checkbox" name="id" value="1447553.015"></td>\n<t ...
## [18] <tr>\n<td><input type="checkbox" name="id" value="1451336.015"></td>\n<t ...
## [19] <tr>\n<td><input type="checkbox" name="id" value="1447118.015"></td>\n<t ...
## [20] <tr>\n<td><input type="checkbox" name="id" value="1447292.015"></td>\n<t ...
## ...

html_table(): parse the table

After we get the node with the table we’re trying to target, we can parse it with html_table() function.

inspections <- html_nodes(webpage,"[class='table table-bordered table-striped']")[[2]] %>% html_table()

inspections <- inspections[,-c(1:2)] ## remove the first two columns. one is empty, the other is useless.

head(inspections)

##   Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1  1452519 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2  1451959 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3  1451637 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4  1451571 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5  1453178 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6  1451355 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name
## 1      144338 - Directlink Logistics, Inc
## 2    317726467 - Portland Pedal Power Llc
## 3                      Fedex Freight, Inc
## 4                   United Parcel Service
## 5      Fedex Ground Package Systems, Inc.
## 6 Wa317957214 - United Parcel Service Inc

Save the table

If you are happy with this table, you can save it locally as a csv file.

#write.csv(inspections, "~/Desktop/nicar2020/nicar_2020_scraping_r/inspections.csv")

Extract activity numbers with html_attr()

In the scraped table, Activity column don’t have decimal places. Let’s scrape the complete activity numbers and replace the Activity column.

What CSS selector do we use to target the nodes (elements) with activity numbers?

Activity numbers appear as the “title” attribute in the <a> tags, for example: <a href="establishment.inspection_detail?id=1452519.015" title="1452519.015"> (Yes, there can be multiple attributes for a tag.) <a> tags in HTML are reserved for hyperlinks, so we will want nodes with <a> tags for sure, but not all of them. We want <a> tags: * in

tags * in the third column of a table

First, “:nth-child(A)” selects the nth child element in another element. What appears before the colon defines the type of chile element and parent element. As explained in level 18 of this interactive game-like CSS tutorial “:nth-child(8)” selects every element that is the 8th child of another element. “div p:nth-child(2)” selects the second p in every div.

“td:nth-child(3)” selects nodes in cells in third columns of any table on the page.

html_nodes(webpage, 'td:nth-child(3)') %>% head()

## {xml_nodeset (6)}
## [1] <td>01/01/2019 to 12/31/2019</td>\n
## [2] <td><a href="establishment.inspection_detail?id=1452519.015" title="14525 ...
## [3] <td><a href="establishment.inspection_detail?id=1451959.015" title="14519 ...
## [4] <td><a href="establishment.inspection_detail?id=1451637.015" title="14516 ...
## [5] <td><a href="establishment.inspection_detail?id=1451571.015" title="14515 ...
## [6] <td><a href="establishment.inspection_detail?id=1453178.015" title="14531 ...

Next, “A B” selects “all B inside of A”, as shown in level 4 of this interactive tutorial. So “td:nth-child(3) a” selects nodes with <a> tags inside the data cells in the third column, and because the other table’s third column data isn’t hyperlinked and doesn’t include <a> tags, it won’t be selected.

SelectorGaght Chrome extention is really useful in finding the CSS selector based on your point and clicks. I also strongly recommend that you go through this fun and interactive CSS selector tutorial.

Next save the activity numbers to a vector.

act_num <- html_nodes(webpage, 'td:nth-child(3) a') %>% html_attr("title")
length(act_num) ## double check how many activity numbers

## [1] 161

head(act_num) ## check out the first six

## [1] "1452519.015" "1451959.015" "1451637.015" "1451571.015" "1453178.015"
## [6] "1451355.015"

Replace the Activity column with complete activity numbers

# replace the Activity column with the act_num vector
inspections$Activity <- act_num
# check out the first six rows
head(inspections)

##      Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1 1452519.015 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2 1451959.015 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3 1451637.015 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4 1451571.015 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5 1453178.015 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6 1451355.015 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name
## 1      144338 - Directlink Logistics, Inc
## 2    317726467 - Portland Pedal Power Llc
## 3                      Fedex Freight, Inc
## 4                   United Parcel Service
## 5      Fedex Ground Package Systems, Inc.
## 6 Wa317957214 - United Parcel Service Inc

Extract incomplete inspections based on italic style with html_text()

The webpage says “inspections which are known to be incomplete will have the identifying Activity Nr shown in italic”. We want to include that information in our table too.

Inspect elements and compare italic and non-italic numbers, we realize we need to target numbers wrapped in <em> tags. <em> in HTML means the text is displayed in italic. To avoid getting all <em> tags on the page, “td a em” only selects <em> tags inside <a> tags inside <td> tags.

open_cases <- html_nodes(webpage,"td a em") %>%
  html_text()
length(open_cases)

## [1] 54

head(open_cases)

## [1] "1451637.015" "1451571.015" "1451044.015" "1452754.015" "1450975.015"
## [6] "1451657.015"

Create a new column for whether the case is incomplete

inspections$status <- ifelse(inspections$Activity %in% open_cases, "incomplete", "complete")
inspections %>% head()

##      Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1 1452519.015 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2 1451959.015 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3 1451637.015 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4 1451571.015 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5 1453178.015 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6 1451355.015 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name     status
## 1      144338 - Directlink Logistics, Inc   complete
## 2    317726467 - Portland Pedal Power Llc   complete
## 3                      Fedex Freight, Inc incomplete
## 4                   United Parcel Service incomplete
## 5      Fedex Ground Package Systems, Inc.   complete
## 6 Wa317957214 - United Parcel Service Inc   complete

Get the tables hidden behind the hyperlinks

So far we have scraped the inspection table on this page, but for inspections where violations were found, there’s more information on the cases’ hyperlinked pages, and that can be scraped too.

Test things out one inspection with violations

url2 <- "https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1447292.015"
violation <- html_table(html_nodes(read_html(url2, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]]) %>% data.frame()
violation

##   Violation.Items Violation.Items.1 Violation.Items.2 Violation.Items.3
## 1              NA                 #                ID              Type
## 2              NA                1.             01001           Serious
## 3              NA                2.             02002             Other
## 4              NA                3.             02003             Other
## 5              NA                4.             02004             Other
##           Violation.Items.4 Violation.Items.5 Violation.Items.6
## 1                  Standard          Issuance             Abate
## 2 OAR 437-002-0221(13)(A)-1        12/18/2019        12/30/2019
## 3  OAR 437-002-0041(3)(A)-2        12/18/2019        01/06/2020
## 4    OAR 437-002-0221(4)(A)        12/18/2019        01/06/2020
## 5     OAR 437-001-0765(1)-3        12/18/2019        01/06/2020
##   Violation.Items.7 Violation.Items.8 Violation.Items.9 Violation.Items.10
## 1             Curr$             Init$              Fta$            Contest
## 2              $175              $175                $0                   
## 3                $0                $0                $0                   
## 4                $0                $0                $0                   
## 5                $0                $0                $0                   
##   Violation.Items.11
## 1          LastEvent
## 2         Z - Issued
## 3         Z - Issued
## 4         Z - Issued
## 5         Z - Issued

The column names appear this way because the table has a merged cell as header. Fix the column names by making the first row of the dataframe into column names.

#make first row the column names
colnames(violation) <- as.character(unlist(violation[1,]))
#delete the first row because it's now duplicate with column names
violation <- violation[-1,]
violation

##   NA  #    ID    Type                  Standard   Issuance      Abate Curr$
## 2 NA 1. 01001 Serious OAR 437-002-0221(13)(A)-1 12/18/2019 12/30/2019  $175
## 3 NA 2. 02002   Other  OAR 437-002-0041(3)(A)-2 12/18/2019 01/06/2020    $0
## 4 NA 3. 02003   Other    OAR 437-002-0221(4)(A) 12/18/2019 01/06/2020    $0
## 5 NA 4. 02004   Other     OAR 437-001-0765(1)-3 12/18/2019 01/06/2020    $0
##   Init$ Fta$ Contest  LastEvent
## 2  $175   $0         Z - Issued
## 3    $0   $0         Z - Issued
## 4    $0   $0         Z - Issued
## 5    $0   $0         Z - Issued

Create a function for all inspections with violations.

The pages of inspections have consistent urls: “https://www.osha.gov/pls/imis/establishment.inspection_detail?id=”, followed by the activity number. So in the function we just need to stich together that pattern with the individual activity number.

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  # stitch together the pattern and each activity number, leave no space in between
  violation <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]]) %>% data.frame()
  colnames(violation) <- as.character(unlist(violation[1,]))
  violation <- violation[-1,]
  violation$activity_number <- act_num # create a column for the activity number for identifying
  return(violation) # return the violation table
}

Apply the function to all activity numbers that have violations.

## lapply(inspections$Activity[!is.na(inspections$Vio)], readAct_Num) %>% bind_rows() %>% unique() %>% head()

After running it you will realize the first column is always empty. And since the second column is also useless, so we can delete the first two columns in the violations table in the function.

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]]) %>% data.frame()
  colnames(table) <- as.character(unlist(table[1,]))
  table <- table[-1,-c(1:2)] ## deletes the first two columns
  table$activity_number <- act_num
  return(table)
}

Apply the new function to all activity numbers that have violations.

## save to an object called violations
violations <- lapply(inspections$Activity[!is.na(inspections$Vio)], readAct_Num) %>% bind_rows() %>% unique()
violations %>% head()

##      ID    Type                  Standard   Issuance      Abate  Curr$  Init$
## 1 01001 Serious OAR 437-002-0221(13)(A)-1 12/18/2019 12/30/2019   $175   $175
## 2 02002   Other  OAR 437-002-0041(3)(A)-2 12/18/2019 01/06/2020     $0     $0
## 3 02003   Other    OAR 437-002-0221(4)(A) 12/18/2019 01/06/2020     $0     $0
## 4 02004   Other     OAR 437-001-0765(1)-3 12/18/2019 01/06/2020     $0     $0
## 5 01001   Other            19100178 M05 I 01/09/2020                $0     $0
## 6 01001   Other                   3395(I) 01/21/2020 02/24/2020 $1,275 $1,275
##   Fta$ Contest  LastEvent activity_number
## 1   $0         Z - Issued     1447292.015
## 2   $0         Z - Issued     1447292.015
## 3   $0         Z - Issued     1447292.015
## 4   $0         Z - Issued     1447292.015
## 5   $0         Z - Issued     1447493.015
## 6   $0         Z - Issued     1445680.015

Now we can merge the violations information with the inspections table

merge(inspections, violations, by.x = "Activity", by.y = "activity_number", all.x = TRUE) %>% head()

Recap

step 1: read_html(), read the webpage into R
step 2: html_nodes(), pull elements/nodes with chosen tags or attributes
step 3: extract text/attributes with html_attr()/html_text(), or parse table with html_table().

Dealing with errors

When the data you scrape is big, you might be identified as a robot and blocked by the website. You will see “Http 403 Forbidden Error”. You might avoid that by adding a pause between each time your hit the website server by adding Sys.sleep() in your code.

Look for ‘robots.txt’ on the site you’re trying to scrape. Sometimes it tells you how many seconds you need to pause.

In our example, you could change the function to this:

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]])
  colnames(table) <- as.character(unlist(table[1,]))
  table <- table[-1,-1]
  table$activity_number <- act_num
  return(table)
  Sys.sleep(6)
}

Using a “User-agent” and specifying a web browser also helps because it’s telling the server you’re visint by a web browswer. You can achieve that with httr library.

library(httr)
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
html_session(url,user_agent(uastring)) %>% html_node('table')

More on avoid getting blocked while scraping here.

Also, just in case an error occurs and you lose everything you already scraped, you can add tryCatch() to your scraping function. This way R keeps running when error occurs and tells you where error or warning is later. So in our example I would change the function to:

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  out <- tryCatch(
    {
      table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]])
      colnames(table) <- as.character(unlist(table[1,]))
      table <- table[-1,-1]
      table$activity_number <- act_num
      return(table)
      Sys.sleep(6)
    },
    error=function(cond) {
      message(paste("act_num caused an error:", act_num))
      message("Here's the original error message:")
      message(cond)
      # Choose a return value in case of error
      return(NA)
      Sys.sleep(6)
    },
    warning=function(cond) {
      message(paste("act_num caused a warning:", act_num))
      message("Here's the original warning message:")
      message(cond)
      return(NULL)
      Sys.sleep(6)
    }
  )
  return(out)
}

Geting data to show in one page

A lot of data are displayed on multiple pages, some tweaking of the URL helps you get all results to show up all at once. Here’s a walk-through of how I got the URL we used in the scraping.

So you want to look up the OSHA violations in the messenger courier industry in 2019, and you searched NAICS code “492110” here, and you got 161 results split up to 9 pages, 20 results each page for the first 8 pages.

Original url:

https://www.osha.gov/pls/imis/industry.search?p_logger=1&sic=&naics=492110&State=All&officetype=All&Office=All&endmonth=01&endday=01&endyear=2019&startmonth=12&startday=31&startyear=2019&owner=&scope=&FedAgnCode=

When you click page 2, the url becomes:

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=20&p_sort=&p_desc=DESC&p_direction=Next&p_show=20

Click back to page 1, the url becomes:

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=20

Change “show=” to 200, and now you can retrieve all results in one page.

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=200

Homework

Download all the comment letters for this rule? Hint: you will need download.file() function.

nicar20_scraping_r_table

YHan

12/28/2019