Webscraping in R

Ryan Thomas
March 15, 2017

Assumptions

I'm assuming …

that you have R installed,
have a basic familiarity with how R works, and
are familiar with at least one R interface (e.g. RStudio or the R command prompt).

Audience

Introductions
- Interested/ just getting started
- Data-proficient
- Intermediate hackers
- Experienced R/python coders
“Something for everyone and everything for no one.”
- (hopefully not)

#Goals

General process (“steps”) for scraping a web page
Decide when it's appropriate to scrape a web page
- Tell if a page is scrapable before you start
- Make a difficulty/benefit estimation of a page's scrapability.
Know where to look for help

Knowing where to look ... lots of options

Base R : uses functions to sort through text
rvest :relies on pipes and built-in HTML tags
Google specific searches
Stack Overflow
Alternatives
- Chrome apps
- other R packages
- Python packages (beautifulsoup)
- many more …

Knowing where to look ... lots of options

Overview

Why scrape a web page?
- Necessary concepts for web scraping
- Most useful functions in rvest.
WORM HOLE / TIME WARP
Examples
- Two examples in rvest
- Base R example

Why Scrape?

$cure-all$

Why Scrape?

Probably not the first option, but …

Sometimes data are not accessible any other way
You want to update the data in the future
You need to download a lot of data files

Prerequisites for rvest and web scraping

magrittr
HTML tags (in theory)
“Inspect Element”
sometimes loops

A Pre-requisite to rvest: magrittr

This is not a pipe.

This is a pipe: %>%

The main advantage of piping is to minimize the need for nesting functions.
Piping is useful lots of other packages ggplot and dplyr

Silly magrittr exmample

t(cbind(x, x))

  [,1] [,2] [,3] [,4] [,5] [,6]
x    0    4    8   12   16   20
x    0    4    8   12   16   20

x %>% cbind( ., x ) %>% t()

  [,1] [,2] [,3] [,4] [,5] [,6]
.    0    4    8   12   16   20
x    0    4    8   12   16   20

Prerequisites to webscraping: HTML tags

right-click and selecting “view page source” Source code for SciStarter.

<div>
<a href=...
<table>

Too many prerequisites - I'm not interested!

cry baby

You only need to be familiar, not an expert.
They will become second nature.
Don't worry, I will give you working code :)

Now on to the scraping

SciStarter website Projects filtered by location "At the beach". To scrape the projects info from this page, we need to take a look at the HTML tags.

Getting Familiar with the Website Tags

right click -> “Inspect Element” Example of "Inspect Element" screen for SciStarter. We will make an R data frame of all the projects on this website.

Basic Functions in rvest

Open up R or RStudio
Install rvest -> install.packages('rvest').
?html_nodes()
?html_table()
?html_text()

Examples

Go to https://rpubs.com/ryanthomas/YNUS-Webscraping-in-R for the rest of the workshop.

Webscraping in R

Assumptions

Audience

#Goals

Knowing where to look ... lots of options

Knowing where to look ... lots of options

Overview

WORM HOLE / TIME WARP

Why Scrape?

Why Scrape?

Prerequisites for rvest and web scraping

A Pre-requisite to rvest: magrittr

This is a pipe: %>%

Silly magrittr exmample

Prerequisites to webscraping: HTML tags

Too many prerequisites - I'm not interested!

Now on to the scraping

Getting Familiar with the Website Tags

Basic Functions in rvest

Examples