library(rvest)
library(tidyverse) # can also use base Rwebscraping tutorial!
Webscraping uses css elements on a website to take information (or whatever). Use inspect element to select different parts of a page.
Libraries
Using Thriftbooks
Load page
url <- "https://www.thriftbooks.com/b/thriftbooks-deals/" # set the url that you are using
webpage <- read_html(url) # of course you can also just paste the url in hereStart taking elements
Use inspect element to choose elements to scrape.
# book titles
titleelem <- html_elements(webpage, css = "div.LandingPage-bookCardTitle")
title <- html_text(titleelem)
# book authors
authorelem <- html_elements(webpage, css = "div.LandingPage-bookCardAuthor")
author <- html_text(authorelem)
# book prices
priceelem <- html_elements(webpage, css = "div.BookSlide-Price")
price <- html_text(priceelem)
# that one is really messy so:
price_stripped <- parse_number(price)Create a dataframe
thriftbooks <- tibble(
title = title,
author = author, # for some reason this cuts out at a certain point but I'm not going to bother fixing it for this example
price = price_stripped
)Using eggplant recipes
Scrape multiple webpages into one dataframe. How you do this:
1) Make a list of the links you want to scrape
2) Write code that will scrape just one of those pages (but that can be applied to any of them)
3) Make an empty dataframe
4) Put the scrape code into a loop that indexes each item on the link list
5) Make a dataframe inside of the for loop and bind/append it to the dataframe from outside the loop
These are the steps I follow here, although I put #2 straight into #4. You can copy the code in my for loop and paste it outside of it, it will work fine for individual pages
url <- "https://www.turkeysforlife.com/2021/11/turkish-aubergine-eggplant-recipes.html"
webpage <- read_html(url)
# here's a method to take all the links off a page, which I'm not using, but it can be helfpul to experiment with:
#recipelinks <- webpage %>% html_nodes("a") %>% html_attr("href")Take links
Now I actually show you what I’m doing in inspect element!
There’s different ways to take elements (here’s screenshots showing one way that I don’t always use), but this method is usually the best:
Notice the box that appears over the selected element on the page - that is the CSS that you want to copy. You can find it over on the side in the highlighted block in inspect element, just in a slightly different format:
Copy the highlighted CSS, and adjust the formatting to match what you see on the page:
string <- "wprm-recipe-roundup-link wprm-recipe-link wprm-block-text-normal wprm-recipe-roundup-link-inline-button wprm-recipe-link-inline-button wprm-color-accent"
string <- gsub(" ", ".", string) # replace all spaces with periods
string # copy this[1] "wprm-recipe-roundup-link.wprm-recipe-link.wprm-block-text-normal.wprm-recipe-roundup-link-inline-button.wprm-recipe-link-inline-button.wprm-color-accent"
This piece of css comes from a block that starts with <a class=, so put a. before the above string:
linkel <- html_elements(webpage, css = "a.wprm-recipe-roundup-link.wprm-recipe-link.wprm-block-text-normal.wprm-recipe-roundup-link-inline-button.wprm-recipe-link-inline-button.wprm-color-accent")
links <- linkel %>% html_attr("href")
links [1] "https://www.turkeysforlife.com/2021/09/kizartma-turkish-fried-vegetables.html"
[2] "https://www.turkeysforlife.com/2020/04/ali-nazik-kebab-recipe.html"
[3] "https://www.turkeysforlife.com/2020/05/imam-bayildi-recipe.html"
[4] "https://www.turkeysforlife.com/2020/04/karniyarik-stuffed-aubergine-recipe.html"
[5] "https://www.turkeysforlife.com/2014/06/turkish-recipes-baba-ganoush-meze-yoghurt.html"
[6] "https://www.turkeysforlife.com/2021/07/saksuka-recipe-turkish.html"
[7] "https://www.turkeysforlife.com/2011/07/aubergine-salad-eggplant-turkish.html"
[8] "https://www.turkeysforlife.com/2010/06/turkish-food-eksili-patlican-sour.html"
[9] "https://www.turkeysforlife.com/2012/09/turkish-recipes-hunkar-begendi-ottoman.html"
[10] "https://www.turkeysforlife.com/2010/10/turkish-food-turkish-musakka-recipe.html"
[11] "https://www.turkeysforlife.com/2021/05/chickpea-aubergine-stew-recipe.html"
[12] "https://www.turkeysforlife.com/2022/04/vegetable-guvec-recipe.html"
If the block starts with div, you’ll type div. first, if it starts with h3, you’ll type h3. first, etc.
Create dataframe
First: make a helpful function that fills in NA for elements that just aren’t there (this can prevent errors, so if you’re trying to access the CSS for “prep time”, but there is no “prep time”, it will fill in NA instead of character(0)):
NAcheck <- function(x) {
if(length(x) == 0) { # if x has no value (length == 0)
return(NA) # fill in NA
}
else { # otherwise
return(x) # leave as is
}
}Use a for loop to scrape each recipe into a dataframe, using the links above
# make an empty tibble outside of the for loop to fill inside it
eggplantrecipes <- tibble()
for (item in 1:12) { # "item" is a variable we just made up in the name of the for loop, with a value of 1-12, which is the number of links in the list
url <- links[item] # index the url that we'll scrape (each "item" - 1 through 12)
webpage <- read_html(url)
# next: code you would use to scrape just one recipe. because it's inside a for loop, this code will scrape all 12
# html elements
titleelem <- html_elements(webpage, css = ".entry-title")
updateelem <- html_elements(webpage, css = ".entry-date")
courseelem <- html_elements(webpage, css = ".wprm-recipe-course")
rateelem <- html_elements(webpage, css = ".wprm-recipe-rating-average")
prepelem <- html_elements(webpage, css = ".wprm-recipe-prep_time")
cookelem <- html_elements(webpage, css = ".wprm-recipe-cook_time")
servingselem <- html_elements(webpage, css = ".wprm-recipe-servings")
# html text
# using the NAcheck() function here because these are what we're going to put directly into the dataframe. this way, we'll get NA's instead of blank characters that will give errors
title <- NAcheck(html_text(titleelem))
update <- NAcheck(html_text(updateelem))
rate <- NAcheck(html_text(rateelem))
course <- NAcheck(html_text(courseelem))
prep <- NAcheck(html_text(prepelem))
cook <- NAcheck(html_text(cookelem))
servings <- NAcheck(html_text(servingselem))
course <- NAcheck(html_text(courseelem))
# make a dataframe
# this only makes a dataframe for ONE recipe because it is being made inside of the loop
recipe <- tibble(
title = title,
lastupdated = update,
course = course,
preptime = prep,
cooktime = cook,
servings = servings,
link = url
)
# join this to dataframe/tibble made outside of loop (because it's outside the loop, each loop through this will add another row to it)
eggplantrecipes <- bind_rows(eggplantrecipes, recipe) # you should probably be able to use any type of join or bind
}
# now download it
write_csv(eggplantrecipes, "eggplantrecipes.csv")Goodreads example, using this random person’s book list
I’m doing this in two parts with two loops (1, make a list of links, 2, send the items in the list through the scraping code), but you could probably combine them into one loop.
Part 1: Create list of links
To scrape multiple pages from one site, you need to set up a loop that pastes the next page into the url - eg paste0("website.com/page/", pagenumber). Different websites have different patterns to show page number. If that didn’t make sense just read this code block carefully:
# url
baseurl <- "https://www.goodreads.com/review/list/24109488-arthur"
endurlpattern <- "?shelf=read" # the page number goes in the middle of this link for some reason
# empty tibble
booklinks <- tibble()
# loop to take all links
for (pagenumber in 1:6) {
if (pagenumber == 1) { # for the first page
url <- paste0(baseurl, endurlpattern) # just paste the entire link as is
}
else if (pagenumber > 1) { # for the pages after page 1
url <- paste0(baseurl, "?page=", pagenumber, endurlpattern) # paste this pattern and fill in the page number
}
# put link into a dataframe
links <- tibble(
link = url)
# combine dataframe with dataframe defined outside the loop ("append"). it's probably better to do this with a list/vector instead
booklinks <- bind_rows(booklinks, links)
}
# turn column into list
linklist <- booklinks$linkPart 2: Scrape from each page
shelf <- tibble() # create empty dataframe outside of loop to append/bind to
for (pagenumber in 1:6) { # change as more pages are added
url <- linklist[pagenumber] # index the item from the list of links
webpage <- read_html(url)
# ratings
rating_html <- html_elements(webpage, css = "td.field.avg_rating")
# rating #
ratenum_html <- html_elements(webpage, css = "td.field.num_ratings")
# title
title_html <- html_elements(webpage, css = "td.field.title")
# author
author_html <- html_elements(webpage, css = "td.field.author")
# turn to text
rating_data <- html_text(rating_html)
ratenum_data <- html_text(ratenum_html)
title_data <- html_text(title_html)
author_data <- html_text(author_html)
# strip down to just number for ratings and #
rating_data_stripped <- parse_number(rating_data)
ratenum_data_stripped <- parse_number(ratenum_data)
# cut out characters in title data
# 17 characters come before the title
title_data_stripped <- str_sub(title_data, start = 17L)
# titles: remove everything after the \n
title_data_strippedtext <- str_extract(title_data_stripped, "^.*(?=(\n))")
# https://datascience.stackexchange.com/questions/8922/removing-strings-after-a-certain-character-in-a-given-text
# authors
author_data_stripped <- str_sub(author_data, start = 13L)
# remove \n in author
author_data_strippedtext <- str_extract(author_data_stripped, "^.*(?=(\n))")
# make tibble
shelfp <- tibble(
title = title_data_strippedtext,
author = author_data_strippedtext,
rating = rating_data_stripped,
ratenum = ratenum_data_stripped
)
# author was entered backwards as Lastname, Firstname
# the easiest way to fix that is using tidyverse and mutating columns in a dataframe:
# separate author column into first and last name
shelfpp <- shelfp %>%
separate(author, into = c("lastn", "firstn"), sep = ",")
# put author column back together
shelfpp <- shelfpp %>%
mutate(
author = paste(firstn, lastn)
) %>%
# remove first and last name columns
select(-firstn, -lastn) %>%
# relocate author column
relocate(author, .after = title)
# combine to supertibble
shelf <- bind_rows(shelf, shelfpp)
}
shelf %>% arrange(desc(rating))# A tibble: 120 × 4
title author rating ratenum
<chr> <chr> <dbl> <dbl>
1 Node.js Design Patterns: Level up your Node.js skills … " Luc… 5 2
2 ASP.NET Core 9 Web API Cookbook: Over 60 hands-on reci… " Luk… 5 1
3 Modern Full-Stack Web Development with ASP.NET Core: A… " Ale… 5 1
4 Real-World Web Development with .NET 9: Build websites… " Mar… 5 2
5 API Testing and Development with Postman: API creation… " Dav… 5 1
6 Mastering Node.js Web Development: Go on a comprehensi… " Ada… 5 4
7 FuelPHP Application Development Blueprints " Séb… 5 1
8 React Key Concepts: An in-depth guide to React's core … " Max… 4.83 6
9 Dashboards for Excel " Jor… 4.8 10
10 There Are No Foreign Lands: An Inquiry Concerning Inte… " Jef… 4.67 3
# ℹ 110 more rows
# goodness arthur reads some fascinating booksSome websites try to prevent scraping (or to limit how much can be scraped), so sometimes you will get error 403 or similar, meaning that the website denied your request. Some websites may also try to ban you from accessing them (do not try scraping Google - and also don’t run super extensive scraping code too many times in a row, eg, if you’re scraping 5000 pages on a website, limit your testing code to just 5-50 pages or so to limit your burden on the website). There’s always a way around these issues but so far I’ve not experienced any of them except 403. Don’t quote me on any of this