Webscraping in R

Ryan Thomas
March 15, 2017

Assumptions

I'm assuming …

  • that you have R installed,
  • have a basic familiarity with how R works, and
  • are familiar with at least one R interface (e.g. RStudio or the R command prompt).

Audience

  • Introductions

    • Interested/ just getting started
    • Data-proficient
    • Intermediate hackers
    • Experienced R/python coders
  • “Something for everyone and everything for no one.”

    • (hopefully not)

#Goals

  • General process (“steps”) for scraping a web page
  • Decide when it's appropriate to scrape a web page

    • Tell if a page is scrapable before you start
    • Make a difficulty/benefit estimation of a page's scrapability.
  • Know where to look for help

Knowing where to look ... lots of options

  • Base R : uses functions to sort through text
  • rvest :relies on pipes and built-in HTML tags
  • Google specific searches
  • Stack Overflow

  • Alternatives

    • Chrome apps
    • other R packages
    • Python packages (beautifulsoup)
    • many more …

Knowing where to look ... lots of options

Overview

  • Why scrape a web page?

    • Necessary concepts for web scraping
    • Most useful functions in rvest.

    WORM HOLE / TIME WARP

  • Examples

    • Two examples in rvest
    • Base R example

Why Scrape?

cure-all

Why Scrape?

Probably not the first option, but …

  • Sometimes data are not accessible any other way
  • You want to update the data in the future
  • You need to download a lot of data files

Prerequisites for rvest and web scraping

  • magrittr
  • HTML tags (in theory)
  • “Inspect Element”
  • sometimes loops

A Pre-requisite to rvest: magrittr

This is not a pipe.

This is a pipe: %>%

  • The main advantage of piping is to minimize the need for nesting functions.

  • Piping is useful lots of other packages ggplot and dplyr

Silly magrittr exmample

t(cbind(x, x))
  [,1] [,2] [,3] [,4] [,5] [,6]
x    0    4    8   12   16   20
x    0    4    8   12   16   20
x %>% cbind( ., x ) %>% t()
  [,1] [,2] [,3] [,4] [,5] [,6]
.    0    4    8   12   16   20
x    0    4    8   12   16   20

Prerequisites to webscraping: HTML tags

right-click and selecting “view page source” Source code for SciStarter.

  • <div>
  • <a href=...
  • <table>

Too many prerequisites - I'm not interested!

cry baby

  • You only need to be familiar, not an expert.
  • They will become second nature.
  • Don't worry, I will give you working code :)

Now on to the scraping

SciStarter website Projects filtered by location "At the beach". To scrape the projects info from this page, we need to take a look at the HTML tags.

Getting Familiar with the Website Tags

right click -> “Inspect Element” Example of "Inspect Element" screen for SciStarter. We will make an R data frame of all the projects on this website.

Basic Functions in rvest

  • Open up R or RStudio
  • Install rvest -> install.packages('rvest').
  • ?html_nodes()
  • ?html_table()
  • ?html_text()

Examples

Go to https://rpubs.com/ryanthomas/YNUS-Webscraping-in-R for the rest of the workshop.