This is a tutorial that serves as an intro to scraping with R using Rvest library. It walks through how to scrape a table of OSHA inspections, extract information based on italic style of the text, scrape hyperlinked tables. It also offers tips on how to deal with errors and avoid being blocked as a robot.
Webpages usually consist of: * HTML (HyperText Markup Language) files, which build structure of the page * CSS (Cascading Style Sheet) files, which define the style or look of the page * JavaScript files, which make the page interactive
An HTML file is a text file with HTML tags, which are reserved keywords for certain elements and they remind your your web browswer, “hey, here’s a table/paragraph/list, please display it as a table/paragraph/list”. And most tags must be in pairs, an opening tag and a closing tag, i.e. <table></table>, <p></p>, <li></li>.
These tags can have attributes such as: * hyperlinks: <a href='https://www.osha.gov/'>Occupational Safety and Health</a> * class: <table class='table table-bordered'> * id: <h1 id='myHeader'>Hello World!</h1>
You can learn more about HTML tags here
To extract the elements we want from the html document, we need to find the right tags and attributes in the source code.
Click here to visit the page we’re going to scrape. Place your mouse on “Activity” column of the table, right click on the page and click “Inspect”. “Elements” tab highlights where your mouse placed. “Sources” tab shows the entire html file.
The HTML code looks like a tree. It has two main branches: <head>, <body>, each has consist of subbranch <div>s, (think of <div>as a section). Within each <div>, there can be more <div>s, or <table> and other elements. Within a <table>, there is <thead> (table head), <tbody> (table body), <tr> (table row), <th> (header cell), <td> (data cell).
First load the libraries we need.
## install the package
#install.packages("rvest")
#install.packages("dplyr")
## load the package
library(rvest)
library(dplyr)
This webpage has 161 osha citations in the messenger courier industry in 2019.
#url of website to be scrapped
url <- "https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=200"
#read the html content into R and assigns to webpage object
webpage <- read_html(url, encoding = "windows-1252")
webpage
## {html_document}
## <html lang="en-us">
## [1] <head>\n<meta charset="utf-8">\n<title>Industry SIC Search Results Page | ...
## [2] <body>\n<div class="block block-gtranslate block-gtranslate block-gtransl ...
Tip: to find the right encoding, run “document.inputEncoding” in the console tab.
Character encoding is a method of converting bytes into characters. To validate or display an HTML document properly, a program must choose a proper character encoding. You can read more about in this post.
We can select certain elements in the html document, or “nodes”, by picking out certain feature, we do that by passing on what is called “CSS selector” to the html_nodes() function. You can also pass on “Xpath” but we’re not covering it today.
The following line is telling R to pull nodes with “table” tags.
html_nodes(webpage,"table")
## {xml_nodeset (3)}
## [1] <table class="table table-bordered table-striped">\n<tr>\n<th>SIC</th>\n< ...
## [2] <table class="table table-bordered"><tr class="breadcrumb">\n<td>\n<stron ...
## [3] <table class="table table-bordered table-striped">\n<thead><tr>\n<th> </t ...
It can also choose elements based on attributes. Here we pass on the value of class attribute of the table we want to scrape. There are two tables with the same class attribute. The table we want is the second node in the returned nodeset.
html_nodes(webpage,"[class='table table-bordered table-striped']")[[2]]
## {html_node}
## <table class="table table-bordered table-striped">
## [1] <thead><tr>\n<th> </th>\n<th>#</th>\n<th>Activity</th>\n<th>Opened</th>\ ...
## [2] <tr>\n<td><input type="checkbox" name="id" value="1452519.015"></td>\n<t ...
## [3] <tr>\n<td><input type="checkbox" name="id" value="1451959.015"></td>\n<t ...
## [4] <tr>\n<td><input type="checkbox" name="id" value="1451637.015"></td>\n<t ...
## [5] <tr>\n<td><input type="checkbox" name="id" value="1451571.015"></td>\n<t ...
## [6] <tr>\n<td><input type="checkbox" name="id" value="1453178.015"></td>\n<t ...
## [7] <tr>\n<td><input type="checkbox" name="id" value="1451355.015"></td>\n<t ...
## [8] <tr>\n<td><input type="checkbox" name="id" value="1451044.015"></td>\n<t ...
## [9] <tr>\n<td><input type="checkbox" name="id" value="1452754.015"></td>\n<t ...
## [10] <tr>\n<td><input type="checkbox" name="id" value="1450975.015"></td>\n<t ...
## [11] <tr>\n<td><input type="checkbox" name="id" value="1451657.015"></td>\n<t ...
## [12] <tr>\n<td><input type="checkbox" name="id" value="1449936.015"></td>\n<t ...
## [13] <tr>\n<td><input type="checkbox" name="id" value="1451983.015"></td>\n<t ...
## [14] <tr>\n<td><input type="checkbox" name="id" value="1450424.015"></td>\n<t ...
## [15] <tr>\n<td><input type="checkbox" name="id" value="1448259.015"></td>\n<t ...
## [16] <tr>\n<td><input type="checkbox" name="id" value="1447908.015"></td>\n<t ...
## [17] <tr>\n<td><input type="checkbox" name="id" value="1447553.015"></td>\n<t ...
## [18] <tr>\n<td><input type="checkbox" name="id" value="1451336.015"></td>\n<t ...
## [19] <tr>\n<td><input type="checkbox" name="id" value="1447118.015"></td>\n<t ...
## [20] <tr>\n<td><input type="checkbox" name="id" value="1447292.015"></td>\n<t ...
## ...
After we get the node with the table we’re trying to target, we can parse it with html_table() function.
inspections <- html_nodes(webpage,"[class='table table-bordered table-striped']")[[2]] %>% html_table()
inspections <- inspections[,-c(1:2)] ## remove the first two columns. one is empty, the other is useless.
head(inspections)
## Activity Opened RID St Type Sc SIC NAICS Vio
## 1 1452519 12/20/2019 453710 NC Complaint Partial 4215 492110 NA
## 2 1451959 12/17/2019 1054111 OR Planned No Insp/Other 4215 492110 NA
## 3 1451637 12/17/2019 950633 CA Accident Partial NA 492110 NA
## 4 1451571 12/17/2019 524700 OH Complaint Partial NA 492110 NA
## 5 1453178 12/16/2019 317500 PA Planned Complete NA 492110 NA
## 6 1451355 12/16/2019 1055330 WA Complaint Partial NA 492110 NA
## Establishment Name
## 1 144338 - Directlink Logistics, Inc
## 2 317726467 - Portland Pedal Power Llc
## 3 Fedex Freight, Inc
## 4 United Parcel Service
## 5 Fedex Ground Package Systems, Inc.
## 6 Wa317957214 - United Parcel Service Inc
If you are happy with this table, you can save it locally as a csv file.
#write.csv(inspections, "~/Desktop/nicar2020/nicar_2020_scraping_r/inspections.csv")
In the scraped table, Activity column don’t have decimal places. Let’s scrape the complete activity numbers and replace the Activity column.
What CSS selector do we use to target the nodes (elements) with activity numbers?
Activity numbers appear as the “title” attribute in the<a> tags, for example: <a href="establishment.inspection_detail?id=1452519.015" title="1452519.015"> (Yes, there can be multiple attributes for a tag.) <a> tags in HTML are reserved for hyperlinks, so we will want nodes with <a> tags for sure, but not all of them. We want <a> tags: * in
tags * in the third column of a table
First, “:nth-child(A)” selects the nth child element in another element. What appears before the colon defines the type of chile element and parent element. As explained in level 18 of this interactive game-like CSS tutorial “:nth-child(8)” selects every element that is the 8th child of another element. “div p:nth-child(2)” selects the second p in every div.
“td:nth-child(3)” selects nodes in cells in third columns of any table on the page.
html_nodes(webpage, 'td:nth-child(3)') %>% head()
## {xml_nodeset (6)}
## [1] <td>01/01/2019 to 12/31/2019</td>\n
## [2] <td><a href="establishment.inspection_detail?id=1452519.015" title="14525 ...
## [3] <td><a href="establishment.inspection_detail?id=1451959.015" title="14519 ...
## [4] <td><a href="establishment.inspection_detail?id=1451637.015" title="14516 ...
## [5] <td><a href="establishment.inspection_detail?id=1451571.015" title="14515 ...
## [6] <td><a href="establishment.inspection_detail?id=1453178.015" title="14531 ...
Next, “A B” selects “all B inside of A”, as shown in level 4 of this interactive tutorial. So “td:nth-child(3) a” selects nodes with <a> tags inside the data cells in the third column, and because the other table’s third column data isn’t hyperlinked and doesn’t include <a> tags, it won’t be selected.
SelectorGaght Chrome extention is really useful in finding the CSS selector based on your point and clicks. I also strongly recommend that you go through this fun and interactive CSS selector tutorial.
Next save the activity numbers to a vector.
act_num <- html_nodes(webpage, 'td:nth-child(3) a') %>% html_attr("title")
length(act_num) ## double check how many activity numbers
## [1] 161
head(act_num) ## check out the first six
## [1] "1452519.015" "1451959.015" "1451637.015" "1451571.015" "1453178.015"
## [6] "1451355.015"
# replace the Activity column with the act_num vector
inspections$Activity <- act_num
# check out the first six rows
head(inspections)
## Activity Opened RID St Type Sc SIC NAICS Vio
## 1 1452519.015 12/20/2019 453710 NC Complaint Partial 4215 492110 NA
## 2 1451959.015 12/17/2019 1054111 OR Planned No Insp/Other 4215 492110 NA
## 3 1451637.015 12/17/2019 950633 CA Accident Partial NA 492110 NA
## 4 1451571.015 12/17/2019 524700 OH Complaint Partial NA 492110 NA
## 5 1453178.015 12/16/2019 317500 PA Planned Complete NA 492110 NA
## 6 1451355.015 12/16/2019 1055330 WA Complaint Partial NA 492110 NA
## Establishment Name
## 1 144338 - Directlink Logistics, Inc
## 2 317726467 - Portland Pedal Power Llc
## 3 Fedex Freight, Inc
## 4 United Parcel Service
## 5 Fedex Ground Package Systems, Inc.
## 6 Wa317957214 - United Parcel Service Inc
The webpage says “inspections which are known to be incomplete will have the identifying Activity Nr shown in italic”. We want to include that information in our table too.
Inspect elements and compare italic and non-italic numbers, we realize we need to target numbers wrapped in <em> tags. <em> in HTML means the text is displayed in italic. To avoid getting all <em> tags on the page, “td a em” only selects <em> tags inside <a> tags inside <td> tags.
open_cases <- html_nodes(webpage,"td a em") %>%
html_text()
length(open_cases)
## [1] 54
head(open_cases)
## [1] "1451637.015" "1451571.015" "1451044.015" "1452754.015" "1450975.015"
## [6] "1451657.015"
inspections$status <- ifelse(inspections$Activity %in% open_cases, "incomplete", "complete")
inspections %>% head()
## Activity Opened RID St Type Sc SIC NAICS Vio
## 1 1452519.015 12/20/2019 453710 NC Complaint Partial 4215 492110 NA
## 2 1451959.015 12/17/2019 1054111 OR Planned No Insp/Other 4215 492110 NA
## 3 1451637.015 12/17/2019 950633 CA Accident Partial NA 492110 NA
## 4 1451571.015 12/17/2019 524700 OH Complaint Partial NA 492110 NA
## 5 1453178.015 12/16/2019 317500 PA Planned Complete NA 492110 NA
## 6 1451355.015 12/16/2019 1055330 WA Complaint Partial NA 492110 NA
## Establishment Name status
## 1 144338 - Directlink Logistics, Inc complete
## 2 317726467 - Portland Pedal Power Llc complete
## 3 Fedex Freight, Inc incomplete
## 4 United Parcel Service incomplete
## 5 Fedex Ground Package Systems, Inc. complete
## 6 Wa317957214 - United Parcel Service Inc complete
When the data you scrape is big, you might be identified as a robot and blocked by the website. You will see “Http 403 Forbidden Error”. You might avoid that by adding a pause between each time your hit the website server by adding Sys.sleep() in your code.
Look for ‘robots.txt’ on the site you’re trying to scrape. Sometimes it tells you how many seconds you need to pause.
In our example, you could change the function to this:
readAct_Num <- function(act_num) {
url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]])
colnames(table) <- as.character(unlist(table[1,]))
table <- table[-1,-1]
table$activity_number <- act_num
return(table)
Sys.sleep(6)
}
Using a “User-agent” and specifying a web browser also helps because it’s telling the server you’re visint by a web browswer. You can achieve that with httr library.
library(httr)
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
html_session(url,user_agent(uastring)) %>% html_node('table')
More on avoid getting blocked while scraping here.
Also, just in case an error occurs and you lose everything you already scraped, you can add tryCatch() to your scraping function. This way R keeps running when error occurs and tells you where error or warning is later. So in our example I would change the function to:
readAct_Num <- function(act_num) {
url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
out <- tryCatch(
{
table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]])
colnames(table) <- as.character(unlist(table[1,]))
table <- table[-1,-1]
table$activity_number <- act_num
return(table)
Sys.sleep(6)
},
error=function(cond) {
message(paste("act_num caused an error:", act_num))
message("Here's the original error message:")
message(cond)
# Choose a return value in case of error
return(NA)
Sys.sleep(6)
},
warning=function(cond) {
message(paste("act_num caused a warning:", act_num))
message("Here's the original warning message:")
message(cond)
return(NULL)
Sys.sleep(6)
}
)
return(out)
}
A lot of data are displayed on multiple pages, some tweaking of the URL helps you get all results to show up all at once. Here’s a walk-through of how I got the URL we used in the scraping.
So you want to look up the OSHA violations in the messenger courier industry in 2019, and you searched NAICS code “492110” here, and you got 161 results split up to 9 pages, 20 results each page for the first 8 pages.
Original url:
https://www.osha.gov/pls/imis/industry.search?p_logger=1&sic=&naics=492110&State=All&officetype=All&Office=All&endmonth=01&endday=01&endyear=2019&startmonth=12&startday=31&startyear=2019&owner=&scope=&FedAgnCode=
When you click page 2, the url becomes:
https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=20&p_sort=&p_desc=DESC&p_direction=Next&p_show=20
Click back to page 1, the url becomes:
https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=20
Change “show=” to 200, and now you can retrieve all results in one page.
https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=200
Download all the comment letters for this rule? Hint: you will need download.file() function.