webscraping tutorial!

Author

Gary

Webscraping uses css elements on a website to take information (or whatever). Use inspect element to select different parts of a page.

Libraries

library(rvest)
library(tidyverse) # can also use base R

Using Thriftbooks

Load page

url <- "https://www.thriftbooks.com/b/thriftbooks-deals/" # set the url that you are using
webpage <- read_html(url) # of course you can also just paste the url in here

Start taking elements

Use inspect element to choose elements to scrape.

# book titles
titleelem <- html_elements(webpage, css = "div.LandingPage-bookCardTitle")
title <- html_text(titleelem)
# book authors
authorelem <- html_elements(webpage, css = "div.LandingPage-bookCardAuthor")
author <- html_text(authorelem)
# book prices
priceelem <- html_elements(webpage, css = "div.BookSlide-Price")
price <- html_text(priceelem)
# that one is really messy so:
price_stripped <- parse_number(price)

Create a dataframe

thriftbooks <- tibble(
  title = title,
  author = author, # for some reason this cuts out at a certain point but I'm not going to bother fixing it for this example
  price = price_stripped
)

Using eggplant recipes

Scrape multiple webpages into one dataframe. How you do this:
1) Make a list of the links you want to scrape
2) Write code that will scrape just one of those pages (but that can be applied to any of them)
3) Make an empty dataframe
4) Put the scrape code into a loop that indexes each item on the link list
5) Make a dataframe inside of the for loop and bind/append it to the dataframe from outside the loop

These are the steps I follow here, although I put #2 straight into #4. You can copy the code in my for loop and paste it outside of it, it will work fine for individual pages

url <- "https://www.turkeysforlife.com/2021/11/turkish-aubergine-eggplant-recipes.html"
webpage <- read_html(url)
# here's a method to take all the links off a page, which I'm not using, but it can be helfpul to experiment with:
#recipelinks <- webpage %>% html_nodes("a") %>% html_attr("href")

Take links

Now I actually show you what I’m doing in inspect element!

There’s different ways to take elements (here’s screenshots showing one way that I don’t always use), but this method is usually the best:

Notice the box that appears over the selected element on the page - that is the CSS that you want to copy. You can find it over on the side in the highlighted block in inspect element, just in a slightly different format:

Copy the highlighted CSS, and adjust the formatting to match what you see on the page:

string <- "wprm-recipe-roundup-link wprm-recipe-link wprm-block-text-normal wprm-recipe-roundup-link-inline-button wprm-recipe-link-inline-button wprm-color-accent"
string <- gsub(" ", ".", string) # replace all spaces with periods
string # copy this

[1] "wprm-recipe-roundup-link.wprm-recipe-link.wprm-block-text-normal.wprm-recipe-roundup-link-inline-button.wprm-recipe-link-inline-button.wprm-color-accent"

This piece of css comes from a block that starts with <a class=, so put a. before the above string:

linkel <- html_elements(webpage, css = "a.wprm-recipe-roundup-link.wprm-recipe-link.wprm-block-text-normal.wprm-recipe-roundup-link-inline-button.wprm-recipe-link-inline-button.wprm-color-accent")
links <- linkel %>%  html_attr("href")
links

 [1] "https://www.turkeysforlife.com/2021/09/kizartma-turkish-fried-vegetables.html"        
 [2] "https://www.turkeysforlife.com/2020/04/ali-nazik-kebab-recipe.html"                   
 [3] "https://www.turkeysforlife.com/2020/05/imam-bayildi-recipe.html"                      
 [4] "https://www.turkeysforlife.com/2020/04/karniyarik-stuffed-aubergine-recipe.html"      
 [5] "https://www.turkeysforlife.com/2014/06/turkish-recipes-baba-ganoush-meze-yoghurt.html"
 [6] "https://www.turkeysforlife.com/2021/07/saksuka-recipe-turkish.html"                   
 [7] "https://www.turkeysforlife.com/2011/07/aubergine-salad-eggplant-turkish.html"         
 [8] "https://www.turkeysforlife.com/2010/06/turkish-food-eksili-patlican-sour.html"        
 [9] "https://www.turkeysforlife.com/2012/09/turkish-recipes-hunkar-begendi-ottoman.html"   
[10] "https://www.turkeysforlife.com/2010/10/turkish-food-turkish-musakka-recipe.html"      
[11] "https://www.turkeysforlife.com/2021/05/chickpea-aubergine-stew-recipe.html"           
[12] "https://www.turkeysforlife.com/2022/04/vegetable-guvec-recipe.html"

If the block starts with div, you’ll type div. first, if it starts with h3, you’ll type h3. first, etc.

Create dataframe

First: make a helpful function that fills in NA for elements that just aren’t there (this can prevent errors, so if you’re trying to access the CSS for “prep time”, but there is no “prep time”, it will fill in NA instead of character(0)):

NAcheck <- function(x) {
  if(length(x) == 0) { # if x has no value (length == 0)
    return(NA) # fill in NA
  }
  else { # otherwise
    return(x) # leave as is
  }
}

Use a for loop to scrape each recipe into a dataframe, using the links above

# make an empty tibble outside of the for loop to fill inside it
eggplantrecipes <- tibble()

for (item in 1:12) { # "item" is a variable we just made up in the name of the for loop, with a value of 1-12, which is the number of links in the list
  
  url <- links[item] # index the url that we'll scrape (each "item" - 1 through 12)
  webpage <- read_html(url)
  
  # next: code you would use to scrape just one recipe. because it's inside a for loop, this code will scrape all 12
  
  # html elements
  titleelem <- html_elements(webpage, css = ".entry-title")
  updateelem <- html_elements(webpage, css = ".entry-date")
  courseelem <- html_elements(webpage, css = ".wprm-recipe-course")
  rateelem <- html_elements(webpage, css = ".wprm-recipe-rating-average")
  prepelem <- html_elements(webpage, css = ".wprm-recipe-prep_time")
  cookelem <- html_elements(webpage, css = ".wprm-recipe-cook_time")
  servingselem <- html_elements(webpage, css = ".wprm-recipe-servings")

  # html text
  # using the NAcheck() function here because these are what we're going to put directly into the dataframe. this way, we'll get NA's instead of blank characters that will give errors
  title <- NAcheck(html_text(titleelem))
  update <- NAcheck(html_text(updateelem))
  rate <- NAcheck(html_text(rateelem))
  course <- NAcheck(html_text(courseelem))
  prep <- NAcheck(html_text(prepelem))
  cook <- NAcheck(html_text(cookelem))
  servings <- NAcheck(html_text(servingselem))
  course <- NAcheck(html_text(courseelem))

  # make a dataframe
  # this only makes a dataframe for ONE recipe because it is being made inside of the loop
  recipe <- tibble(
    title = title,
    lastupdated = update,
    course = course, 
    preptime = prep,
    cooktime = cook,
    servings = servings,
    link = url
  )
  
    # join this to dataframe/tibble made outside of loop (because it's outside the loop, each loop through this will add another row to it)
    eggplantrecipes <- bind_rows(eggplantrecipes, recipe) # you should probably be able to use any type of join or bind

}

# now download it
write_csv(eggplantrecipes, "eggplantrecipes.csv")

Goodreads example, using this random person’s book list

I’m doing this in two parts with two loops (1, make a list of links, 2, send the items in the list through the scraping code), but you could probably combine them into one loop.

Part 1: Create list of links

To scrape multiple pages from one site, you need to set up a loop that pastes the next page into the url - eg paste0("website.com/page/", pagenumber). Different websites have different patterns to show page number. If that didn’t make sense just read this code block carefully:

# url
baseurl <- "https://www.goodreads.com/review/list/24109488-arthur"
endurlpattern <- "?shelf=read" # the page number goes in the middle of this link for some reason

# empty tibble
booklinks <- tibble()

# loop to take all links
for (pagenumber in 1:6) {
  if (pagenumber == 1) { # for the first page
    url <- paste0(baseurl, endurlpattern) # just paste the entire link as is
  }
  else if (pagenumber > 1) { # for the pages after page 1
    url <- paste0(baseurl, "?page=", pagenumber, endurlpattern) # paste this pattern and fill in the page number
  }
  
  # put link into a dataframe
  links <- tibble(
    link = url)
  
  # combine dataframe with dataframe defined outside the loop ("append"). it's probably better to do this with a list/vector instead
  booklinks <- bind_rows(booklinks, links)
}

# turn column into list
linklist <- booklinks$link

Part 2: Scrape from each page

shelf <- tibble() # create empty dataframe outside of loop to append/bind to

for (pagenumber in 1:6) { # change as more pages are added
  
  url <- linklist[pagenumber] # index the item from the list of links
  webpage <- read_html(url)
  
  # ratings
  rating_html <- html_elements(webpage, css = "td.field.avg_rating")
  # rating #
  ratenum_html <- html_elements(webpage, css = "td.field.num_ratings")
  # title
  title_html <- html_elements(webpage, css = "td.field.title")
  # author
  author_html <- html_elements(webpage, css = "td.field.author")
  
  # turn to text
  rating_data <- html_text(rating_html)
  ratenum_data <- html_text(ratenum_html)
  title_data <- html_text(title_html)
  author_data <- html_text(author_html)
  
  # strip down to just number for ratings and #
  rating_data_stripped <- parse_number(rating_data)
  ratenum_data_stripped <- parse_number(ratenum_data)
  # cut out characters in title data
  # 17 characters come before the title
  title_data_stripped <- str_sub(title_data, start = 17L)
  # titles: remove everything after the \n
  title_data_strippedtext <- str_extract(title_data_stripped, "^.*(?=(\n))")
  # https://datascience.stackexchange.com/questions/8922/removing-strings-after-a-certain-character-in-a-given-text
  # authors
  author_data_stripped <- str_sub(author_data, start = 13L)
  # remove \n in author
  author_data_strippedtext <- str_extract(author_data_stripped, "^.*(?=(\n))")
  
  # make tibble
  shelfp <- tibble(
  title = title_data_strippedtext,
  author = author_data_strippedtext,
  rating = rating_data_stripped,
  ratenum = ratenum_data_stripped
)
  
  # author was entered backwards as Lastname, Firstname
  # the easiest way to fix that is using tidyverse and mutating columns in a dataframe:
  
  # separate author column into first and last name
  shelfpp <- shelfp %>%
    separate(author, into = c("lastn", "firstn"), sep = ",")
  # put author column back together
  shelfpp <- shelfpp %>% 
    mutate(
      author = paste(firstn, lastn)
      ) %>%
    # remove first and last name columns
    select(-firstn, -lastn) %>% 
    # relocate author column
    relocate(author, .after = title)
  
  # combine to supertibble
  shelf <- bind_rows(shelf, shelfpp)
}

shelf %>% arrange(desc(rating))

# A tibble: 120 × 4
   title                                                   author rating ratenum
   <chr>                                                   <chr>   <dbl>   <dbl>
 1 Node.js Design Patterns: Level up your Node.js skills … " Luc…   5          2
 2 ASP.NET Core 9 Web API Cookbook: Over 60 hands-on reci… " Luk…   5          1
 3 Modern Full-Stack Web Development with ASP.NET Core: A… " Ale…   5          1
 4 Real-World Web Development with .NET 9: Build websites… " Mar…   5          2
 5 API Testing and Development with Postman: API creation… " Dav…   5          1
 6 Mastering Node.js Web Development: Go on a comprehensi… " Ada…   5          4
 7 FuelPHP Application Development Blueprints              " Séb…   5          1
 8 React Key Concepts: An in-depth guide to React's core … " Max…   4.83       6
 9 Dashboards for Excel                                    " Jor…   4.8       10
10 There Are No Foreign Lands: An Inquiry Concerning Inte… " Jef…   4.67       3
# ℹ 110 more rows

# goodness arthur reads some fascinating books

Some websites try to prevent scraping (or to limit how much can be scraped), so sometimes you will get error 403 or similar, meaning that the website denied your request. Some websites may also try to ban you from accessing them (do not try scraping Google - and also don’t run super extensive scraping code too many times in a row, eg, if you’re scraping 5000 pages on a website, limit your testing code to just 5-50 pages or so to limit your burden on the website). There’s always a way around these issues but so far I’ve not experienced any of them except 403. Don’t quote me on any of this