This workshop will introduce you to the concept and practices of web scraping in R using the rvest package. In the end, you will have worked through the process of writing a scraper for two websites, used the basic functions of rvest. The process of scraping data from the web is a lesson in the power of data science, because it exemplifies the computer-plus-human model of computing. It is also a nice entre into the process of building software, since custom scrapers are essentially software for scraping a specific website.
This workshop is for beginners, but it does assume that you have R installed, have a basic familiarity with how R works, and are familiar with at least one R interface (e.g. RStudio or the R command prompt). The workshop demonstrates multiple data formats, including lists and dataframes, and exploratory data visualization techniques, though this is not the focus of the workshop.
Tutorial is for a range of skill sets, still trying to avoid “something for everyone and everything for no one.”
Since no two websites are the same, webscraping requires you to identify and exploit patterns in the code that renders websites. Each website is rendered by your browser from HTML, and the goal of webscraping is to parse the HTML that is sent to your browser into usable data. Generally speaking, the steps for webscraping are as follows: - access a web page from R, - tell R where to “look” on the page, - and manipulate the data in a usable format within R.
The first section will cover the magrittr package, which enables a piping operator to help us filter the through text. Without magrittr, we would have to assign a bunch of temporary variables or use nested functions to do operations on a string of HTML.
magrittr
Rvest is designed to work with magrittr and the %>% piping notation. This can be a little goofy when you first see it, so let’s take a second to get oriented to the %>% piping notation and what it means. Piping is also useful for understanding other packages, such as ggplot and dplyr. Just remember the main advantage of piping is to minimize the need for nesting functions.
The magrittr package implements a piping function, which takes the output of an operation and “pipes” it to the next function.
Generally speaking, the steps for webscraping are as follows: - Inspect the page in your browser to determine if you really want to scrape it. - Once you have decided, set aside some time and don’t doubt yourself. - Access a web page from R; - Tell R where to “look” on the page; - and manipulate the data in a usable format within R.
Because each website is different, we’ll be scraping a couple during this tutorial to give you an idea of the general principles (as well as a few examples of working code to start your next project). First up is SciStarter.org, a platform for advertising and joining citizen science projects. Take a look at the projects page.
Projects filtered by location “At the beach”.
To scrape the projects info from this page, we need to take a look at the HTML.
You can do this by right-clicking and selecting "view page source"
from Chrome, Firefox, Safari, and most other modern browsers. You should see something that looks like the image below.
Source code for SciStarter.
If you don’t know HTML, this looks pretty daunting! Luckily, there are some built-in tools in your browser that will help you parse the code to find the relevant sections of the page. You can right-click and select the “Inspect Element
” option.
You can right-click and select the “Inspect Element
” option. This allows you to interactively scroll over the page and see how each line of HTML corresponds to different parts of the web page. See the screen shot below for an example.
Example of “Inspect Element” screen for SciStarter.
Ok, now we’re going to start trying to isolate the specific information we’re after. Let’s say our goal is to grab all the data from the first ten records on the page. We can see from the image above that the first piece of information we need is a link with the node “a” and the atttribute “href.” Let’s see if we can grab that using a basic function in the rvest package.
rvest
package with the command install.packages('rvest')
.?rvest
in your command prompt.?html_node
.The basic functions in rvest
are powerful, and you should try to utilize the following functions when starting out a new project.
html_nodes()
: identifies HTML wrappers.html_nodes(".class")
: calls node based on css classhtml_nodes("#id")
: calls node based on <div>
idhtml_nodes(xpath="xpath")
: calls node based on xpath (we’ll cover this later)html_attrs()
: identifies attributes (useful for debugging)html_table()
: turns HTML tables into data frameshtml_text()
: strips the HTML tags and extracts only the textNote on plurals: html_node()
returns metadata for the , html_nodes()
iterates over the matching nodes.
The html_nodes()
function turns each HTML tag into a row in an R dataframe.
library(rvest)
## Loading required package: xml2
# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"
scistarter_html <- read_html(URL)
scistarter_html
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n \n \n <svg style="position: absolute; width: 0; he ...
We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.
The data we want are stored in a table, which we can tell by looking at the “Inspect Element
” window.
This grabs all the nodes that have links in them.
scistarter_html %>%
html_nodes("a") %>%
head()
## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] <a href="/dashboard">My Account</a>
## [3] <a href="/finder" class="is-active">Project Finder</a>
## [4] <a href="/events">Event Finder</a>
## [5] <a href="/people-finder">People Finder</a>
## [6] <a href="#dialog-login" rel="modal:open">log in</a>
In a more complex example, we could use this to “crawl” the page, but that’s for another day.
Every div
on the page:
scistarter_html %>%
html_nodes("div") %>%
head()
## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n < ...
## [3] <div class="nav-tools">\n <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n <div class="field">\n ...
## [5] <div class="field">\n <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n <d ...
… the nav-tools div
. This calls by css where class=nav-tools
.
scistarter_html %>%
html_nodes("div.nav-tools") %>%
head()
## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n <div class="nav-tools__search"> ...
We can call the nodes by id
as follows.
scistarter_html %>%
html_nodes("div#project-listing") %>%
head()
## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n \n ...
All the tables as follows:
scistarter_html %>%
html_nodes("table") %>%
head()
## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
This will help us access one of rvest
’s most powerful features, html_table()
.
Now that it’s clear what html_node() does, lets look at the source code for the site again to see if we can combine rvest functions (and a some base R functions) to get at the right data.
With the piping function and what we learned about rvest
, we can now start scraping this page.
scistarter_html %>%
html_nodes("div#project-listing") %>%
html_nodes("table") %>%
html_table() %>%
"["(1) %>% str()
## List of 1
## $ :'data.frame': 3 obs. of 2 variables:
## ..$ X1: chr [1:3] "Goal" "Task" "Where"
## ..$ X2: chr [1:3] "Study where ants live and what they eat" "Build your kit. Catch your ants. Send them to us!" "United States of America"
Should we try using another function? Let’s see.
scistarter_html %>%
html_nodes("div#project-listing") %>% #filter to the projec-listing div
html_nodes("h3") # filter the tables in the project-listing div
## {xml_nodeset (10)}
## [1] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/17003-School ...
## [2] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/16830-NASA-G ...
## [3] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/17039-Britis ...
## [4] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/17011-Coasta ...
## [5] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/16909-Pl%40n ...
## [6] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/16868-Wiscon ...
## [7] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/15731-Sentin ...
## [8] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/15069-Surfri ...
## [9] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/15066-Surfri ...
## [10] <h3 class="b-heading-5--lc u-mb-xs"> <a href="/project/14743-Surfri ...
scistarter_html %>%
html_nodes("div#project-listing") %>% #filter to the projec-listing div
html_nodes("h3") %>% # get the headings
html_text() %>% #get the text, not the HTML tags
gsub("^\\s+|\\s+$", "", .) #strip the white space from the beginning and end of a string.
## [1] "School of Ants USA"
## [2] "NASA GLOBE Observer: Clouds"
## [3] "British Columbia Beached Bird Survey"
## [4] "Coastal Research Volunteers"
## [5] "Pl@ntNet"
## [6] "Wisconsin Breeding Bird Atlas II"
## [7] "Sentinels of the Sounds"
## [8] "Surfrider Foundation's Blue Water Task Force Rincon"
## [9] "Surfrider Foundation's Blue Water Task Force San Luis Obispo"
## [10] "Surfrider Foundation's Blue Water Task Force Marin County"
Now we have isolated the titles of each citizen science project. This will make up one of the columns of our dataframe, but it’s not the whole thing.
Other columns are embedded in separate tables on the page. This is not ideal for webscraping, and there are multiple ways to access that information. - Example of a method of scraping the information from each table. - Filter to tables
scistarter_html %>%
html_nodes("td") %>% #grab the <td> tags
html_text() %>% # isolate the text from the html tages
gsub("^\\s+|\\s+$", "", .) %>% #strip the white space from the beginning and end of a string.
head(n=12) # take a peek at the first 12 records
## [1] "Study where ants live and what they eat"
## [2] "Build your kit. Catch your ants. Send them to us!"
## [3] "United States of America"
## [4] "Help scientists understand the sky from above and below"
## [5] "Photograph clouds, record sky observations and share with NASA"
## [6] "Anywhere"
## [7] "Monitor seabird mortality in British Columbia"
## [8] "Walk your local beach while looking for dead birds"
## [9] "British Columbia"
## [10] "Contribute to coastal science, stewardship, and resilience"
## [11] "Participate in one of our many coastal research opportunities"
## [12] "New Hampshire"
scistarter_html %>%
html_nodes("td") %>% #grab the <td> tags
html_text() %>% # isolate the text from the html tages
gsub("^\\s+|\\s+$", "", .) %>% #strip the white space from the beginning and end of a string.
head(n=12) # take a peek at the first 12 records
## [1] "Study where ants live and what they eat"
## [2] "Build your kit. Catch your ants. Send them to us!"
## [3] "United States of America"
## [4] "Help scientists understand the sky from above and below"
## [5] "Photograph clouds, record sky observations and share with NASA"
## [6] "Anywhere"
## [7] "Monitor seabird mortality in British Columbia"
## [8] "Walk your local beach while looking for dead birds"
## [9] "British Columbia"
## [10] "Contribute to coastal science, stewardship, and resilience"
## [11] "Participate in one of our many coastal research opportunities"
## [12] "New Hampshire"
Notice the pattern here? Every third record is a goal, task, or location. We can use our base R knowledge to transform this into three columns that have only goals, tasks, or locations. After that, we will put the columns in a data.frame.
===
page_list <- scistarter_html %>%
html_nodes("td") %>%
html_text() %>%
gsub("^\\s+|\\s+$", "", .) #strip the white space from the beginning and end of a string.
goals <- page_list[seq(from=1, to=30,by=3)] # make a sequence to select the goals
task <- page_list[seq(from=2, to=30,by=3)]
location <- page_list[seq(from=3, to=30,by=3)]
title <- scistarter_html %>%
html_nodes("div#project-listing") %>% #filter to the projec-listing div
html_nodes("h3") %>% # get the headings
html_text() %>% #get the text, not the HTML tags
gsub("^\\s+|\\s+$", "", .)
scistarter_df <- data.frame(title, goals, task, location)
Now you have scraped SciStarter’s first project page! From here, you can write a loop that will build up a data frame from multiple pages by going to each page and scraping the data.
pages <- ceiling(832/10) # number of pages to go through
sci_df <- data.frame()
#for (page in (1:pages)) { Uncomment this if you want all the pages.
for (page in (1:5)) {
print(paste0("geting data for page: " , page ))
URL <- paste0("https://scistarter.com/finder?phrase=&lat=&lng=&activity=&topic=&search_filters=&search_audience=&page=", page, "#view-projects")
sci_html <- read_html(URL)
page_list <- sci_html %>%
html_nodes("td") %>%
html_text() %>%
gsub("^\\s+|\\s+$", "", .) #strip the white space from the beginning and end of a string.
goal <- page_list[seq(from=1, to=30,by=3)]
task <- page_list[seq(from=2, to=30,by=3)]
location <- page_list[seq(from=3, to=30,by=3)]
title <- sci_html %>%
html_nodes("div#project-listing") %>% #filter to the projec-listing div
html_nodes("h3") %>% # get the headings
html_text() %>% #get the text, not the HTML tags
gsub("^\\s+|\\s+$", "", .) #strip the white space from the beginning and end of a string.
tmp <- data.frame(title, goal, task, location)
if (pages == 1 ) {
sci_df <- data.frame(tmp)
} else {
sci_df <- rbind(sci_df, tmp)
}
}
## [1] "geting data for page: 1"
## [1] "geting data for page: 2"
## [1] "geting data for page: 3"
## [1] "geting data for page: 4"
## [1] "geting data for page: 5"
sci_df %>% str()
## 'data.frame': 50 obs. of 4 variables:
## $ title : Factor w/ 50 levels "California Roadkill Observation System",..: 10 5 7 6 4 9 3 2 1 8 ...
## $ goal : Factor w/ 50 levels "Contribute to better health care. No experience necessary!",..: 9 7 5 4 8 2 6 1 10 3 ...
## $ task : Factor w/ 50 levels "Add your roadkill observations.",..: 9 5 2 6 10 8 7 3 1 4 ...
## $ location: Factor w/ 22 levels "","Anywhere",..: 1 1 5 2 4 4 2 4 3 2 ...
Some websites publish their data in an easy-to-read table without offering the option to download the data.
html_table()
.html_table()
and viola - a shiny R data frame is ready for you to analyze.Go to https://www.nis.gov.kh/cpi/Jan14.html and inspect the html.
# To scrape a table from a website, the html_table() function can be a game-changer.
# But it doesn't give us the right output right away.
URL2 <- "https://www.nis.gov.kh/cpi/Apr14.html"
# TIP: When debugging or building your scraper, assign a variable to the raw HTML.
# That way you only have to read it once
accounts <- read_html(URL2)
table <- accounts %>%
html_nodes("table") %>%
html_table(header=T)
# You can clean up the table with the following code, or something like it.
# table[[1]]
dict <- table[[1]][,1:2]
accounts_df <- table[[1]][6:18,-1]
names <- c('id', 'weight.pct', 'jan.2013', 'dec.2013', 'jan.2014', 'mo.pctch', 'yr.pctch', 'mo.cont', 'yr.cont')
colnames(accounts_df) <- names
accounts_df #%>% str()
## id weight.pct jan.2013
## 6 All ITEMS (CPI TOTAL) 100.000 150.2
## 7 FOOD AND NON-ALCOHOLIC BEVERAGES 44.775 170.4
## 8 ALCOHOLIC BEVERAGES, TOBACCO AND NARCOTICS 1.625 127.9
## 9 CLOTHING AND FOOTWEAR 3.036 125.8
## 10 HOUSING, WATER, ELECTRICITY, GAS AND OTHER FUELS 17.084 127.8
## 11 FURNISHINGS, HOUSEHOLD MAINTENANCE 2.743 130.5
## 12 HEALTH 5.141 116.8
## 13 TRANSPORT 12.228 128.2
## 14 COMMUNICATION 1.136 71.3
## 15 RECREATION AND CULTURE 2.912 107.8
## 16 EDUCATION 1.174 146.1
## 17 RESTAURANTS 5.861 201.2
## 18 MISCELLANEOUS GOODS AND SERVICES 2.285 152.8
## dec.2013 jan.2014 mo.pctch yr.pctch mo.cont yr.cont
## 6 156.8 157.7 0.5 4.9 0.5 4.9
## 7 178.7 180.0 0.7 5.6 0.4 2.9
## 8 136.0 136.9 0.6 7.1 0.0 0.1
## 9 130.1 130.4 0.2 3.6 0.0 0.1
## 10 131.1 131.5 0.3 2.9 0.0 0.4
## 11 141.3 141.9 0.4 8.7 0.0 0.2
## 12 127.1 127.5 0.3 9.2 0.0 0.4
## 13 129.7 129.8 0.1 1.2 0.0 0.1
## 14 69.6 69.4 -0.3 -2.7 0.0 0.0
## 15 110.7 110.5 -0.2 2.5 0.0 0.1
## 16 159.9 159.9 0.0 9.5 0.0 0.1
## 17 218.0 220.1 0.9 9.4 0.1 0.7
## 18 145.0 144.7 -0.3 -5.3 0.0 -0.1
The final method for extracting data from a webpage is to call the data using it’s xpath. Sometimes, we want very specific data from a website. Maybe we don’t want only specific information from a table.
The xpath
option can be very useful for doing this, but it is not super intuitive. Think of this as directions you are giving rvest
to the specific pice of data you’re interested in scraping.
For the last example, we’ll scrape wunderground.com, inspired by a great tutorial on webscraping in Python from Nathan Yau’s Visualize This.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
URL <- "https://www.wunderground.com/history/airport/WSSS/2016/1/1/DailyHistory.html?req_city=Singapore&req_statename=Singapore"
raw <- read_html(URL)
max <- raw %>%
html_nodes(xpath='//*[@id="historyTable"]/tbody/tr[3]/td[2]/span/span[1]') %>%
html_text() %>% as.numeric()
min <- raw %>%
html_nodes(xpath='//*[@id="historyTable"]/tbody/tr[4]/td[2]/span/span[1]') %>%
html_text() %>% as.numeric()
date <- ymd(paste("2016","1","1", sep="/"))
record <- data.frame(date, min, max)
record
## date min max
## 1 2016-01-01 24 29
What is going on here?
//*[@id="historyTable"]/tbody/tr[3]/td[2]/span/span[1]
rvest
, you might have seen a friendly warning from R that xml
package is a dependency.rvest
uses XML notation to read the HTMLIt isn’t necessary to understand the deeper points regarding this (I certainly don’t!), but it might demystify the next steps to learn about xpath.
The address we supplied is //*[@id="historyTable"]/tbody/tr[3]/td[2]/span/span[1]
. You can go back through the HTML using view source
to reverse-engineer this xpath.
For a nice overview, check out this StackOverflow response.
Loop through the URLs like so:
# go to https://www.wunderground.com/history/airport/WSSS/2017/3/5/DailyHistory.html?req_city=Singapore&req_statename=Singapore
years <- c(2016) # edit for the year(s) you want
months <- c(1:12)
for (y in years) {
for (m in months) {
if (m == c(4 || 6 || 9 || 11) ) {
days <- c(1:31) # Apr, Jun, Sep, Nov have 31
} else if (m == 2 && y %% 4 == 0 ) {
days <- c(1:29) # leap year
} else if (m == 2 && y %% 4 != 0 ) {
days <- c(1:28) # non leap year Febs
} else {
days <- c(1:30) # All the rest have 30 days
}
#for (d in days) {
for (d in 1) {
URL <- paste0("https://www.wunderground.com/history/airport/WSSS/",
y, "/",
m, "/",
d, "/DailyHistory.html?req_city=Singapore&req_statename=Singapore")
print(URL) # try this to test before running the script
raw <- read_html(URL)
max <- raw %>%
html_nodes(xpath='//*[@id="historyTable"]/tbody/tr[3]/td[2]/span/span[1]') %>%
html_text() %>% as.numeric()
min <- raw %>%
html_nodes(xpath='//*[@id="historyTable"]/tbody/tr[4]/td[2]/span/span[1]') %>%
html_text() %>% as.numeric()
date <- ymd(paste(y,m,d, sep="/"))
record <- data.frame(cbind(date, min, max))
if ( date == "2016-01-01") {
sing_temp <- record
} else {
sing_temp <- cbind(sing_temp, record)
}
}
}
}
# Angel Hsu
# Scrape of RE100 members
# October 27, 2016
#setwd("~/Dropbox/NAZCA DATA 2016/RE100")
#URL of the HTML webpage
co.names <- data.frame()
url <- "http://there100.org/companies"
x <- scan(url, what="", sep="\n")
# location of company names starts at line 2464
# <p><a href="http://www.there100.org/ikea" target="_blank"><img src="http://media.virbcdn.com/cdn_images/resize_1024x1365/71/87aded9e69e34239-ikea.jpg" /></a></p>
# <p>The IKEA Group is a home furnishing company with 336 stores in 28 countries. The company has committed to produce as much renewable energy as the total energy it consumes in its buildings by 2020. Alongside Swiss Re, IKEA Group is a founding partner of the RE100 campaign.</p>
start <- grep("The IKEA Group", x) # start of companies
end <- grep("S.p.A is an Italian", x) # end of companies
# each company names is preceded by either a .jpg or a .png
sub <- x[start:end]
sel <- grep("jpg|png", sub)
co.names <- sub[sel+1]
# add back in bank of america
bofa <- grep("Bank of America", x)
co.names <- c(co.names, x[start], bofa)
co.names <- gsub("<.*?>", "", co.names)
write.csv(co.names, "RE100_2016.csv", row.names=F)