Web scraping tutorial!

Author

Julian

Web scraping uses css elements on a website to take information (or whatever). Use inspect element to select different parts of a page.

I recommend pasting this code into a document of your own and following along. You can also download my Quarto document on GitHub (eventually… link TBA)

Libraries

library(rvest)
library(tidyverse) # can also use base R

SOME NOTES

  • If you’re having trouble turning something into text (for example, if you’re trying to take a list, and the result is just listitemlistitemlistitem with no value to separate by), try using html_text2 instead of html_text!
  • If you want to take a list, try taking it as a vector by using the CSS class that each list item uses (often something like li.list-item-text). You can also do .the-css-class-that-all-list-items-collapse-into li — or ol / ul depending on the list type
  • Some websites will try to block you from accessing them. If this only happens when using loops/scraping multiple pages on the site, it’s likely because you’re accessing too many pages too quickly. Include Sys.sleep or purrr::insistently before read_html to add a pause. This usually fixes it (Sys.sleep() uses seconds, so if you’re scraping hundreds of pages at once, it’s best to make the number small, not more than 2, to make it less painfully slow for you)
  • If it seems impossible to select an element no matter what you do, check the page source (in your browser, right click + “view page source”). Search for the element you want with ctrl+F/find in page. If it’s not there, then it’s loaded dynamically, and you won’t be able to access it with read_html. RSelenium/Chromote are options I haven’t explored very far (read_html_live is very slow on my computer if I’m trying to scrape multiple pages)
  • If you want to scrape just one page and it’s saying it can’t open the connection, try read_html_live
  • Supposedly Python is better(?) for webscraping… I wouldn’t know. I think having to worry about indentation while web scraping is some sort of cruel and unusual punishment