Web scraping with R

This is a tutorial that serves as an intro to scraping with R using Rvest library. It walks through how to scrape a table of OSHA inspections, extract information based on italic style of the text, scrape hyperlinked tables. It also offers tips on how to deal with errors and avoid being blocked as a robot.

A quick intro to webpage

Webpages usually consist of:

HTML (HyperText Markup Language) files, which build structure of the page
CSS (Cascading Style Sheet) files, which define the style or look of the page
JavaScript files, which make the page interactive

An HTML file is a text file with HTML tags, which are reserved keywords for certain elements and they remind your your web browswer, “hey, here’s a table/paragraph/list, please display it as a table/paragraph/list”. And most tags must be in pairs, an opening tag and a closing tag, i.e. <table></table>, <p></p>, <li></li>.

These tags can have attributes such as:

hyperlinks: <a href='https://www.osha.gov/'>Occupational Safety and Health</a>
class: <table class='table table-bordered'>
id: <h1 id='myHeader'>Hello World!</h1>

You can learn more about HTML tags here

Inspect elements

An HTML document is like a tree and scraping data from it is like picking apples. You need to tell R which branches do you want the apples from, features of the branches, more ripened apples, without leaves etc. Tags and attributes help you target the branch we want apples from. To find the right tags and attributes we need to inspect the source code.

Click here to visit the page we’re going to scrape. Place your mouse on “Activity” column of the table in the middle, right click on the page and click “Inspect”. “Elements” tab highlights where your mouse placed. “Sources” tab shows the entire html file.

If our apple is the data in <table> tag that’s highlighted, it is on a <table> branch of the <div> branch, which is a branch of another <div>, which is, several layers of <div> branches later, a branch of the <body> branch of the HTML tree. Any branch or sub-branch of this tree can also be called a “node”, and you will hear this word several times in this session.

Now let’s scrape a table!

First load the libraries we need.

## install the package
#install.packages("rvest")
#install.packages("dplyr")
## load the package
library(rvest)
library(dplyr)

This webpage has 161 osha citations in the messenger courier industry in 2019.

#url of website to be scrapped
url <- "https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=200"

read_html(): read the webpage/html document into R

#read the html content into R and assigns to webpage object
webpage <- read_html(url, encoding = "windows-1252")
webpage

## {html_document}
## <html lang="en-us">
## [1] <head>\n<meta charset="utf-8">\n<title>Industry SIC Search Results Page | ...
## [2] <body>\n<div class="block block-gtranslate block-gtranslate block-gtransl ...

Tip: to find the right encoding, run “document.inputEncoding” in the console tab.

Character encoding is a method of converting bytes into characters. To validate or display an HTML document properly, a program must choose a proper character encoding. You can read more about in this post.

html_nodes(): select elements/nodes from the html

We can select certain elements in the html document, or “nodes”, by picking out certain feature, like we talked about, picking which branches we want the apples from. We do that by passing on what is called “CSS selector” to the html_nodes() function. You can also pass on “Xpath” but we’re not covering it today.

The following line is telling R to pull nodes or tree branches with “table” tags.

html_nodes(webpage,"table")

## {xml_nodeset (3)}
## [1] <table class="table table-bordered table-striped">\n<tr>\n<th>SIC</th>\n< ...
## [2] <table class="table table-bordered"><tr class="breadcrumb">\n<td>\n<stron ...
## [3] <table class="table table-bordered table-striped">\n<thead><tr>\n<th> </t ...

You can also choose elements or tree branches based on attributes. Here we can find the value of class attribute of the <table> node/branch we want and pass that onto html_nodes() function. There are two tables with the same class attribute. The table we want is the second node in the returned nodeset.

html_nodes(webpage,"[class='table table-bordered table-striped']")[[2]]

## {html_node}
## <table class="table table-bordered table-striped">
##  [1] <thead><tr>\n<th> </th>\n<th>#</th>\n<th>Activity</th>\n<th>Opened</th>\ ...
##  [2] <tr>\n<td><input type="checkbox" name="id" value="1452519.015"></td>\n<t ...
##  [3] <tr>\n<td><input type="checkbox" name="id" value="1451959.015"></td>\n<t ...
##  [4] <tr>\n<td><input type="checkbox" name="id" value="1451637.015"></td>\n<t ...
##  [5] <tr>\n<td><input type="checkbox" name="id" value="1451571.015"></td>\n<t ...
##  [6] <tr>\n<td><input type="checkbox" name="id" value="1453178.015"></td>\n<t ...
##  [7] <tr>\n<td><input type="checkbox" name="id" value="1451355.015"></td>\n<t ...
##  [8] <tr>\n<td><input type="checkbox" name="id" value="1451044.015"></td>\n<t ...
##  [9] <tr>\n<td><input type="checkbox" name="id" value="1452754.015"></td>\n<t ...
## [10] <tr>\n<td><input type="checkbox" name="id" value="1450975.015"></td>\n<t ...
## [11] <tr>\n<td><input type="checkbox" name="id" value="1451657.015"></td>\n<t ...
## [12] <tr>\n<td><input type="checkbox" name="id" value="1449936.015"></td>\n<t ...
## [13] <tr>\n<td><input type="checkbox" name="id" value="1451983.015"></td>\n<t ...
## [14] <tr>\n<td><input type="checkbox" name="id" value="1450424.015"></td>\n<t ...
## [15] <tr>\n<td><input type="checkbox" name="id" value="1448259.015"></td>\n<t ...
## [16] <tr>\n<td><input type="checkbox" name="id" value="1447908.015"></td>\n<t ...
## [17] <tr>\n<td><input type="checkbox" name="id" value="1447553.015"></td>\n<t ...
## [18] <tr>\n<td><input type="checkbox" name="id" value="1451336.015"></td>\n<t ...
## [19] <tr>\n<td><input type="checkbox" name="id" value="1447118.015"></td>\n<t ...
## [20] <tr>\n<td><input type="checkbox" name="id" value="1447292.015"></td>\n<t ...
## ...

html_table(): parse the table

After we get the node or the tree branch with that inspections table, we can parse it with html_table() function.

inspections <- html_nodes(webpage,"[class='table table-bordered table-striped']")[[2]] %>% html_table()

inspections <- inspections[,-c(1:2)] ## remove the first two columns. one is empty, the other is useless.

head(inspections)

##   Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1  1452519 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2  1451959 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3  1451637 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4  1451571 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5  1453178 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6  1451355 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name
## 1      144338 - Directlink Logistics, Inc
## 2    317726467 - Portland Pedal Power Llc
## 3                      Fedex Freight, Inc
## 4                   United Parcel Service
## 5      Fedex Ground Package Systems, Inc.
## 6 Wa317957214 - United Parcel Service Inc

Save the table

If you are happy with this table, you can save it locally as a csv file.

#write.csv(inspections, "~/Desktop/nicar2020/nicar_2020_scraping_r/inspections.csv")

Extract activity numbers with html_attr()

In the scraped table, Activity column don’t have decimal places. Let’s rescrape the complete activity numbers from the table and replace the Activity column.

What CSS selector do we use to target the nodes/tree branches with activity numbers?

Inspect elements of those activity numbers, and you will realize they appear as the “title” attribute in the <a> tags, for example: <a href="establishment.inspection_detail?id=1452519.015" title="1452519.015"> (Yes, there can be multiple attributes for a tag.)

<a> tags in HTML are reserved for hyperlinks, so we will want nodes with <a> tags for sure, but not all of them.

Instead, we want <a> tags: * in <td> tags, in other words, in a table cell * in the third column of a table

To find these specific type of nodes/branches we need to understand two things.

First, “:nth-child(A)” selects the nth child element in another element. What appears before the colon defines the type of chile element and parent element.

Go to level 18 of this interactive CSS tutorial and try the game after reading the examples on the right.

Now you will understand that “td:nth-child(3)” selects nodes in every third table cell in every table row on the page.

When you run the next code chunk you will find the first node isn’t what we want. And we will fix it next.

html_nodes(webpage, 'td:nth-child(3)') %>% head()

## {xml_nodeset (6)}
## [1] <td>01/01/2019 to 12/31/2019</td>\n
## [2] <td><a href="establishment.inspection_detail?id=1452519.015" title="14525 ...
## [3] <td><a href="establishment.inspection_detail?id=1451959.015" title="14519 ...
## [4] <td><a href="establishment.inspection_detail?id=1451637.015" title="14516 ...
## [5] <td><a href="establishment.inspection_detail?id=1451571.015" title="14515 ...
## [6] <td><a href="establishment.inspection_detail?id=1453178.015" title="14531 ...

The second thing you need to understand, “A B” selects “all B inside of A”. Go to level 4 of this interactive tutorial and try typing the answer, you will have a deeper understanding.

So “td:nth-child(3) a” selects nodes with <a> tags inside the data cells in the third column, and because the other table’s third column data isn’t hyperlinked and doesn’t include <a> tags, it won’t be selected.

I strongly recommend that you go through all 32 levels of this fun and interactive CSS selector tutorial. SelectorGaght Chrome extention is also really useful in getting you started with scraping by finding the CSS selector based on your point and clicks.

Next save the activity numbers to a vector.

act_num <- html_nodes(webpage, 'td:nth-child(3) a') %>% html_attr("title")
length(act_num) ## double check how many activity numbers

## [1] 161

head(act_num) ## check out the first six

## [1] "1452519.015" "1451959.015" "1451637.015" "1451571.015" "1453178.015"
## [6] "1451355.015"

Replace the Activity column with complete activity numbers

# replace the Activity column with the act_num vector
inspections$Activity <- act_num
# check out the first six rows
head(inspections)

##      Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1 1452519.015 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2 1451959.015 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3 1451637.015 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4 1451571.015 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5 1453178.015 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6 1451355.015 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name
## 1      144338 - Directlink Logistics, Inc
## 2    317726467 - Portland Pedal Power Llc
## 3                      Fedex Freight, Inc
## 4                   United Parcel Service
## 5      Fedex Ground Package Systems, Inc.
## 6 Wa317957214 - United Parcel Service Inc

Extract incomplete inspections based on italic style with html_text()

A piece of information is missing in the table above compared to the table on the webpage. A message on the page says “inspections which are known to be incomplete will have the identifying Activity Nr shown in italic”. We want to include that information in our table too.

Inspect elements and compare italic and non-italic numbers, we realize we need to target numbers wrapped in <em> tags. <em> in HTML means the text is displayed in italic. To avoid getting all <em> tags on the page, “td a em” only selects <em> tags inside <a> tags inside <td> tags, like we explained ealier.

open_cases <- html_nodes(webpage,"td a em") %>%
  html_text()
length(open_cases)

## [1] 54

head(open_cases)

## [1] "1451637.015" "1451571.015" "1451044.015" "1452754.015" "1450975.015"
## [6] "1451657.015"

Create a new column for whether the case is incomplete

We can use ifelse function to create a new column that differentiate incomplete vs complete cases.

inspections$status <- ifelse(inspections$Activity %in% open_cases, "incomplete", "complete")
inspections %>% head()

##      Activity     Opened     RID St      Type            Sc  SIC  NAICS Vio
## 1 1452519.015 12/20/2019  453710 NC Complaint       Partial 4215 492110  NA
## 2 1451959.015 12/17/2019 1054111 OR   Planned No Insp/Other 4215 492110  NA
## 3 1451637.015 12/17/2019  950633 CA  Accident       Partial   NA 492110  NA
## 4 1451571.015 12/17/2019  524700 OH Complaint       Partial   NA 492110  NA
## 5 1453178.015 12/16/2019  317500 PA   Planned      Complete   NA 492110  NA
## 6 1451355.015 12/16/2019 1055330 WA Complaint       Partial   NA 492110  NA
##                        Establishment Name     status
## 1      144338 - Directlink Logistics, Inc   complete
## 2    317726467 - Portland Pedal Power Llc   complete
## 3                      Fedex Freight, Inc incomplete
## 4                   United Parcel Service incomplete
## 5      Fedex Ground Package Systems, Inc.   complete
## 6 Wa317957214 - United Parcel Service Inc   complete

Get the tables hidden behind the hyperlinks

So far we have scraped the inspection table on this page, but for inspections where violations were found, there’s more information about what violations were found and how much money was fined on the cases’ hyperlinked pages, and that can be scraped too.

Test things out one inspection with violations

Similarly, by inspecting elements, we can target the node/tree branch with our apple by passing on the class attribute of the table, but there is another node/branch with the same attribute, so we’re picking the second result.

url2 <- "https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1447292.015"
violation <- html_table(html_nodes(read_html(url2, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]]) %>% data.frame()
violation

##   Violation.Items Violation.Items.1 Violation.Items.2 Violation.Items.3
## 1              NA                 #                ID              Type
## 2              NA                1.             01001           Serious
## 3              NA                2.             02002             Other
## 4              NA                3.             02003             Other
## 5              NA                4.             02004             Other
##           Violation.Items.4 Violation.Items.5 Violation.Items.6
## 1                  Standard          Issuance             Abate
## 2 OAR 437-002-0221(13)(A)-1        12/18/2019        12/30/2019
## 3  OAR 437-002-0041(3)(A)-2        12/18/2019        01/06/2020
## 4    OAR 437-002-0221(4)(A)        12/18/2019        01/06/2020
## 5     OAR 437-001-0765(1)-3        12/18/2019        01/06/2020
##   Violation.Items.7 Violation.Items.8 Violation.Items.9 Violation.Items.10
## 1             Curr$             Init$              Fta$            Contest
## 2              $175              $175                $0                   
## 3                $0                $0                $0                   
## 4                $0                $0                $0                   
## 5                $0                $0                $0                   
##   Violation.Items.11
## 1          LastEvent
## 2         Z - Issued
## 3         Z - Issued
## 4         Z - Issued
## 5         Z - Issued

The column names appear this way because on the webpage the table has a merged cell as header. Fix the column names by making the first row of the dataframe into column names.

#make first row the column names
colnames(violation) <- as.character(unlist(violation[1,]))
#delete the first row because it's now duplicate with column names
violation <- violation[-1,]
violation

##   NA  #    ID    Type                  Standard   Issuance      Abate Curr$
## 2 NA 1. 01001 Serious OAR 437-002-0221(13)(A)-1 12/18/2019 12/30/2019  $175
## 3 NA 2. 02002   Other  OAR 437-002-0041(3)(A)-2 12/18/2019 01/06/2020    $0
## 4 NA 3. 02003   Other    OAR 437-002-0221(4)(A) 12/18/2019 01/06/2020    $0
## 5 NA 4. 02004   Other     OAR 437-001-0765(1)-3 12/18/2019 01/06/2020    $0
##   Init$ Fta$ Contest  LastEvent
## 2  $175   $0         Z - Issued
## 3    $0   $0         Z - Issued
## 4    $0   $0         Z - Issued
## 5    $0   $0         Z - Issued

Create a function for all inspections with violations.

The pages of inspections have consistent urls: “https://www.osha.gov/pls/imis/establishment.inspection_detail?id=”, followed by the activity number. So in the function we just need to stich together that pattern with the individual activity number.

And then for each url, we repeat the same action of finding the branch with the class attribute, parsing the table, fixing the column names, deleting the first row, and adding an activity number column.

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  # stitch together the pattern and each activity number, leave no space in between
  violation <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]]) %>% data.frame()
  colnames(violation) <- as.character(unlist(violation[1,]))
  violation <- violation[-1,]
  violation$activity_number <- act_num # create a column for the activity number for identifying
  return(violation) # return the violation table
}

Apply the function to all activity numbers that have violations.

lapply(inspections$Activity[!is.na(inspections$Vio)], readAct_Num) %>% bind_rows() %>% unique() %>% head()

##   NA  #    ID    Type                  Standard   Issuance      Abate  Curr$
## 1 NA 1. 01001 Serious OAR 437-002-0221(13)(A)-1 12/18/2019 12/30/2019   $175
## 2 NA 2. 02002   Other  OAR 437-002-0041(3)(A)-2 12/18/2019 01/06/2020     $0
## 3 NA 3. 02003   Other    OAR 437-002-0221(4)(A) 12/18/2019 01/06/2020     $0
## 4 NA 4. 02004   Other     OAR 437-001-0765(1)-3 12/18/2019 01/06/2020     $0
## 5 NA 1. 01001   Other            19100178 M05 I 01/09/2020                $0
## 6 NA 1. 01001   Other                   3395(I) 01/21/2020 02/24/2020 $1,275
##    Init$ Fta$ Contest  LastEvent activity_number NA NA NA NA NA NA NA NA NA NA
## 1   $175   $0         Z - Issued     1447292.015 NA NA NA NA NA NA NA NA NA NA
## 2     $0   $0         Z - Issued     1447292.015 NA NA NA NA NA NA NA NA NA NA
## 3     $0   $0         Z - Issued     1447292.015 NA NA NA NA NA NA NA NA NA NA
## 4     $0   $0         Z - Issued     1447292.015 NA NA NA NA NA NA NA NA NA NA
## 5     $0   $0         Z - Issued     1447493.015 NA NA NA NA NA NA NA NA NA NA
## 6 $1,275   $0         Z - Issued     1445680.015 NA NA NA NA NA NA NA NA NA NA
##   NA NA      NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 1 NA NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 5 NA NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA      NA NA NA
## 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA <NA> NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA <NA> NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA <NA> NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA <NA> NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA <NA> NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA <NA> NA NA NA

After running it you will realize the first column is always empty. And since the second column is also useless, so we can delete the first two columns in the violations table in the function.

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]]) %>% data.frame()
  colnames(table) <- as.character(unlist(table[1,]))
  table <- table[-1,-c(1:2)] ## deletes the first two columns
  table$activity_number <- act_num
  return(table)
}

Apply the new function to all activity numbers that have violations.

## save to an object called violations
violations <- lapply(inspections$Activity[!is.na(inspections$Vio)], readAct_Num) %>% bind_rows() %>% unique()
violations %>% head()

##      ID    Type                  Standard   Issuance      Abate  Curr$  Init$
## 1 01001 Serious OAR 437-002-0221(13)(A)-1 12/18/2019 12/30/2019   $175   $175
## 2 02002   Other  OAR 437-002-0041(3)(A)-2 12/18/2019 01/06/2020     $0     $0
## 3 02003   Other    OAR 437-002-0221(4)(A) 12/18/2019 01/06/2020     $0     $0
## 4 02004   Other     OAR 437-001-0765(1)-3 12/18/2019 01/06/2020     $0     $0
## 5 01001   Other            19100178 M05 I 01/09/2020                $0     $0
## 6 01001   Other                   3395(I) 01/21/2020 02/24/2020 $1,275 $1,275
##   Fta$ Contest  LastEvent activity_number
## 1   $0         Z - Issued     1447292.015
## 2   $0         Z - Issued     1447292.015
## 3   $0         Z - Issued     1447292.015
## 4   $0         Z - Issued     1447292.015
## 5   $0         Z - Issued     1447493.015
## 6   $0         Z - Issued     1445680.015

Now we can merge the violations information with the inspections table

merge(inspections, violations, by.x = "Activity", by.y = "activity_number", all.x = TRUE) %>% head()

Recap

All we did today pretty much falls into the following rhythm: * step 1: read_html(), read the webpage into R * step 2: html_nodes(), pull elements/nodes with chosen tags or attributes, in other words, get the tree branch with the apples you want * step 3: extract text/attributes with html_attr()/html_text(), or parse table with html_table().

Dealing with errors

When the data you scrape is big, you might be identified as a robot and blocked by the website. You will see “Http 403 Forbidden Error”. You might avoid that by adding a pause between each time your hit the website server by adding Sys.sleep() in your code.

Look for ‘robots.txt’ on the site you’re trying to scrape. Sometimes it tells you how many seconds you need to pause.

In our example, you could change the function to this:

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]])
  colnames(table) <- as.character(unlist(table[1,]))
  table <- table[-1,-1]
  table$activity_number <- act_num
  return(table)
  Sys.sleep(6)
}

Using a “User-agent” and specifying a web browser also helps because it’s telling the server you’re visint by a web browswer. You can achieve that with httr library.

library(httr)
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
html_session(url,user_agent(uastring)) %>% html_node('table')

More on avoid getting blocked while scraping here.

Also, just in case an error occurs and you lose everything you already scraped, you can add tryCatch() to your scraping function. This way R keeps running when error occurs and tells you where error or warning is later. So in our example I would change the function to:

readAct_Num <- function(act_num) {
  url <- paste("https://www.osha.gov/pls/imis/establishment.inspection_detail?id=",act_num, sep="")
  out <- tryCatch(
    {
      table <- html_table(html_nodes(read_html(url, encoding = "windows-1252"), "[class='tablei table-borderedi']")[[2]])
      colnames(table) <- as.character(unlist(table[1,]))
      table <- table[-1,-1]
      table$activity_number <- act_num
      return(table)
      Sys.sleep(6)
    },
    error=function(cond) {
      message(paste("act_num caused an error:", act_num))
      message("Here's the original error message:")
      message(cond)
      # Choose a return value in case of error
      return(NA)
      Sys.sleep(6)
    },
    warning=function(cond) {
      message(paste("act_num caused a warning:", act_num))
      message("Here's the original warning message:")
      message(cond)
      return(NULL)
      Sys.sleep(6)
    }
  )
  return(out)
}

Geting data to show in one page

A lot of data are displayed on multiple pages, some tweaking of the URL helps you get all results to show up all at once. Here’s a walk-through of how I got the URL we used in the scraping.

So you want to look up the OSHA violations in the messenger courier industry in 2019, and you searched NAICS code “492110” here, and you got 161 results split up to 9 pages, 20 results each page for the first 8 pages.

Original url:

https://www.osha.gov/pls/imis/industry.search?p_logger=1&sic=&naics=492110&State=All&officetype=All&Office=All&endmonth=01&endday=01&endyear=2019&startmonth=12&startday=31&startyear=2019&owner=&scope=&FedAgnCode=

When you click page 2, the url becomes:

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=20&p_sort=&p_desc=DESC&p_direction=Next&p_show=20

Click back to page 1, the url becomes:

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=20

Change “show=” to 200, and now you can retrieve all results in one page.

https://www.osha.gov/pls/imis/industry.search?sic=&sicgroup=&naicsgroup=&naics=492110&state=All&officetype=All&office=All&startmonth=12&startday=31&startyear=2019&endmonth=01&endday=01&endyear=2019&opt=&optt=&scope=&fedagncode=&owner=&emph=&emphtp=&p_start=&p_finish=0&p_sort=&p_desc=DESC&p_direction=Next&p_show=200

Homework

Download all the comment letters for this rule? Hint: you will need download.file() function.

nicar20_scraping_r_table

Jasmine Ye Han

12/28/2019