Web scraping tutorial!

Author

Julian Beckert

Web scraping uses css elements on a website to take information (or whatever). Use inspect element to select different parts of a page.

I recommend pasting this code into a document of your own and following along. You can also download my Quarto document (open in new tab) and some extra web scraping examples on my GitHub.

Libraries

library(rvest)
library(tidyverse) # can also use base R

SOME NOTES

  • If you’re having trouble turning something into text (for example, if you’re trying to take a list, and the result is just itemitemitem with no value to separate by), try using html_text2 instead of html_text!
  • The best way to scrape a list is by taking it as a vector string. Taking it as a one item string makes a mess. More info on this in the magnet example on my GitHub…
  • Some websites will try to block you from accessing them. If this only happens when using loops to scrape multiple pages on the site, it’s probably because you’re accessing too many pages too quickly. Include Sys.sleep or purrr::insistently before read_html to add a pause. This usually fixes it.
  • If it seems impossible to select an element no matter what you do, check the page source (in your browser, right click + “view page source”). Search for the element you want with ctrl+F. If it’s not there, then it’s loaded dynamically, and you won’t be able to access it with read_html. RSelenium/Chromote are options to get around this that I haven’t explored very much (read_html_live is usually very slow on my computer if I’m trying to scrape multiple pages)
  • If you want to scrape just one page and it’s saying it can’t open the connection, try read_html_live