library(rvest)
library(tidyverse) # can also use base RWeb scraping tutorial!
Web scraping uses css elements on a website to take information (or whatever). Use inspect element to select different parts of a page.
I recommend pasting this code into a document of your own and following along. You can also download my Quarto document on GitHub (eventually… link TBA)
Libraries
Using Thriftbooks
Load page
url <- "https://www.thriftbooks.com/b/thriftbooks-deals/" # set the url that you are using
webpage <- read_html(url) # of course you can also just paste the url in hereStart taking elements
Use inspect element to choose elements to scrape.
# book titles
titleelem <- html_elements(webpage, css = "div.LandingPage-bookCardTitle")
title <- html_text(titleelem)
# book authors
authorelem <- html_elements(webpage, css = "div.LandingPage-bookCardAuthor")
author <- html_text(authorelem)
# book prices
priceelem <- html_elements(webpage, css = "div.BookSlide-Price")
price <- html_text(priceelem)
# that one is really messy so:
price_stripped <- parse_number(price)Create a dataframe
thriftbooks <- tibble(
title = title,
author = author, # for some reason this cuts out at a certain point but I'm not going to bother fixing it for this example
price = price_stripped
)
head(thriftbooks)# A tibble: 6 × 3
title author price
<chr> <chr> <dbl>
1 3rd Degree James Patterson 4.79
2 Hatchet Gary Paulsen 3.59
3 The 7 Habits Of Highly Effective Teens Sean Covey 4.79
4 Eclipse Stephenie Meyer 4.19
5 Dr. Seuss's ABC Dr. Seuss 3.59
6 The Napping House Audrey Wood 4.19
Using eggplant recipes
Scrape multiple webpages into one dataframe. How you do this:
1) Make a list of the links you want to scrape
2) Write code that will scrape just one of those pages (but that can be applied to any of them)
3) Make an empty dataframe
4) Put the scrape code into a loop that indexes each item on the link list
5) Make a dataframe inside of the for loop and bind/append it to the dataframe from outside the loop
I explain this again in different wording in the example in the next tab 👍.
These are the steps I follow here, although I put #2 straight into #4. You can copy the code in my for loop and paste it outside of it, it will work fine for individual pages
url <- "https://www.turkeysforlife.com/2021/11/turkish-aubergine-eggplant-recipes.html"
webpage <- read_html(url)
# here's a method to take all the links off a page, which I'm not using, but it can be helfpul to experiment with:
# recipelinks <- webpage %>% html_nodes("a") %>% html_attr("href")Take links
This is the part that requires the browser tool inspect element. In your browser (I recommend Firefox), right click anywhere on your website and click “Inspect Element”. It works best to highlight the text you want to scrape and then open inspect element.
This is the way I usually take elements:
Notice the box that appears over the selected element on the page - that is the CSS that you want to copy. You can find it over on the side in the highlighted block in inspect element, just in a slightly different format:
Copy the highlighted CSS, and adjust the formatting to match what you see on the page:
string <- "wprm-recipe-roundup-link wprm-recipe-link wprm-block-text-normal wprm-recipe-roundup-link-inline-button wprm-recipe-link-inline-button wprm-color-accent"
string <- gsub(" ", ".", string) # replace all spaces with periods
string # copy this[1] "wprm-recipe-roundup-link.wprm-recipe-link.wprm-block-text-normal.wprm-recipe-roundup-link-inline-button.wprm-recipe-link-inline-button.wprm-color-accent"
This piece of css comes from a block that starts with <a class=, so put a. before the above string:
linkel <- html_elements(webpage, css = "a.wprm-recipe-roundup-link.wprm-recipe-link.wprm-block-text-normal.wprm-recipe-roundup-link-inline-button.wprm-recipe-link-inline-button.wprm-color-accent")
links <- linkel %>% html_attr("href")
links [1] "https://www.turkeysforlife.com/2021/09/kizartma-turkish-fried-vegetables.html"
[2] "https://www.turkeysforlife.com/2020/04/ali-nazik-kebab-recipe.html"
[3] "https://www.turkeysforlife.com/2020/05/imam-bayildi-recipe.html"
[4] "https://www.turkeysforlife.com/2020/04/karniyarik-stuffed-aubergine-recipe.html"
[5] "https://www.turkeysforlife.com/2014/06/turkish-recipes-baba-ganoush-meze-yoghurt.html"
[6] "https://www.turkeysforlife.com/2021/07/saksuka-recipe-turkish.html"
[7] "https://www.turkeysforlife.com/2011/07/aubergine-salad-eggplant-turkish.html"
[8] "https://www.turkeysforlife.com/2010/06/turkish-food-eksili-patlican-sour.html"
[9] "https://www.turkeysforlife.com/2012/09/turkish-recipes-hunkar-begendi-ottoman.html"
[10] "https://www.turkeysforlife.com/2010/10/turkish-food-turkish-musakka-recipe.html"
[11] "https://www.turkeysforlife.com/2021/05/chickpea-aubergine-stew-recipe.html"
[12] "https://www.turkeysforlife.com/2022/04/vegetable-guvec-recipe.html"
If the block starts with div, you’ll type div. first, if it starts with h3, you’ll type h3. first, etc.
Sometimes, html_elements piped to html_attr("href") only takes a single link. In that case it usually works to instead pipe html_elements to html_nodes("a") and then html_attr("href"). You just need to experiment until you get it because there is always a way.
!! Higher level tip !!
Here’s another method to extract elements - by copying the CSS Selector. Sometimes using the manual method I described above will give you a list instead of the single element that you want, and when that happens, the CSS Selector method is usually better. However, if you’re scraping multiple pages, then the CSS selector method can be a problem because sometimes different pages use that same bit of CSS to refer to different things. In that case use the manual/class method and find a new way around any issues.
Create dataframe
First: make a helpful function that fills in NA for elements that just aren’t there (this can prevent errors, so if you’re trying to access the CSS for “prep time”, but there is no “prep time”, it will fill in NA instead of character(0)):
NAcheck <- function(x) {
if(length(x) == 0) { # if x has no value (length == 0)
return(NA) # fill in NA
}
else { # otherwise
return(x) # leave as is
}
}Use a for loop to scrape each recipe into a dataframe, using the links above
If you have never seen a ‘for loop’ before, it’s basically just a block of code that repeats to a certain point. You write for (repeat this amount of times){} and then everything within the curly braces is repeated. The code you write inside the loop is just normal R code. I wrote for (item in 1:12). This means that I made a variable called “item” that will be assigned values 1 to 12, and everything inside the loop is repeated 12 times.
# make an empty tibble outside of the for loop to fill inside it
eggplantrecipes <- tibble()
for (item in 1:12) {
url <- links[item] # index the url to be scraped (each "item" - 1 through 12)
webpage <- read_html(url)
# next: code you would use to scrape just one recipe. because it's inside a for loop, this code will scrape all 12
# html elements
titleelem <- html_elements(webpage, css = ".entry-title")
updateelem <- html_elements(webpage, css = ".entry-date")
courseelem <- html_elements(webpage, css = ".wprm-recipe-course")
rateelem <- html_elements(webpage, css = ".wprm-recipe-rating-average")
prepelem <- html_elements(webpage, css = ".wprm-recipe-prep_time")
cookelem <- html_elements(webpage, css = ".wprm-recipe-cook_time")
servingselem <- html_elements(webpage, css = ".wprm-recipe-servings")
# html text
# using the NAcheck() function here because these are what are going to be put directly into the dataframe. this way, you get NA's instead of blank characters that will give errors
title <- NAcheck(html_text(titleelem))
update <- NAcheck(html_text(updateelem))
rate <- NAcheck(html_text(rateelem))
course <- NAcheck(html_text(courseelem))
prep <- NAcheck(html_text(prepelem))
cook <- NAcheck(html_text(cookelem))
servings <- NAcheck(html_text(servingselem))
course <- NAcheck(html_text(courseelem))
# make a dataframe
# this only makes a dataframe for ONE recipe because it is being made inside of the loop
recipe <- tibble(
title = title,
lastupdated = update,
course = course,
preptime = prep,
cooktime = cook,
servings = servings,
link = url
)
# join this to dataframe/tibble made outside of loop (because it's outside the loop, each loop through this will add another row to it)
eggplantrecipes <- bind_rows(eggplantrecipes, recipe)
}
# now download it
write_csv(eggplantrecipes, "eggplantrecipes.csv")Scraping disc and cylinder magnets product listings. The results go up to page 23, but for this example I only want 3 pages.
I’m doing this in two parts with two loops: 1, make a list of links, and 2, send the items in the list through the scraping code.
Part 1: Create list of links
To scrape multiple pages from one site, you need to set up a loop that pastes the next page into the url - eg paste0("website.com/page/", pagenumber). Different websites have different patterns to show page number.
# set base URL (string). will paste next page index onto it.
baseurl <- "https://www.kjmagnetics.com/products/disc-and-cylinder-magnets"
mylinks <- tibble()
for (page in 1:3) {
# for peace of mind I recommend making the loop tell you where it is
print(page)
if (page == 1) {
# the first page is the "baseurl", needs no changes
url <- baseurl
webpage <- read_html(baseurl)
}
if (page > 1) {
# now scrape the next pages
url <- paste0(baseurl, "?pg=", page) # paste0 makes a string with no spaces
webpage <- read_html(url)
}
# now scrape all of the links
link <- html_elements(webpage, css = "h2.product-card__title") %>%
html_nodes("a") %>%
html_attr("href")
# make a dataframe to hold the links
loopedlinks <- tibble(link = link)
# bind the rows from to the dataframe made outside the loop
mylinks <- bind_rows(mylinks, loopedlinks)
}[1] 1
[1] 2
[1] 3
Clean up links
# sometimes links are repeated on multiple pages, I always make sure to remove any duplicates
mylinks <- mylinks %>%
distinct(.keep_all = TRUE) # this just removes extra copies of rows
mylinks$link[1][1] "/d0505-neodymium-cylinder-magnet?pl=1.1&pf="
The link is missing the actual website, so that needs to be added:
mylinks$link <- paste0("https://www.kjmagnetics.com", mylinks$link)
mylinks$link[1][1] "https://www.kjmagnetics.com/d0505-neodymium-cylinder-magnet?pl=1.1&pf="
Part 2: Scrape all pages inside a loop
When using scraping multiple pages, I do these within a for loop:
1. Pull URL and read webpage
2. Extract elements
3. Create a dataframe with these elements (and do any necessary data cleaning)
4. Bind dataframe to a dataframe created outside of the loop
You need to make two dataframes: an empty dataframe outsisde (before) the loop, and a dataframe inside the loop containing the elements you’re scraping. The loop will recreate the dataframe inside it with only one row over and over until the loop finishes, so you bind this inner dataframe (using bind_rows) to the empty outer dataframe while still inside the loop. This makes every row created attach itself to the outer dataframe.
listedlinks <- mylinks$link # turn link DF into a vector
magnets <- tibble()
library(cowsay)
for (magnet in 1:length(listedlinks)) { # length(listedlinks) = total number of items
# Step 1: Pull and read URL
url <- listedlinks[magnet] # index link
webpage <- read_html(url)
# Step 2: Extract elements
product <- html_elements(webpage, css = "h1.product-title") %>% html_text2()
innercategory <- html_elements(webpage, css = "li.ectbreadcrumb:nth-child(3) > a:nth-child(1) > span:nth-child(1)") %>% html_text() # the category text above the product image. I used "copy CSS selector" in inspect element
priceforone <- html_elements(webpage, css = "span.price") %>% html_text()
details <- html_elements(webpage, css = "li.product-details__spec") %>% html_text2()
# Step 3: Clean data and make new dataframe
cleandetails <- str_remove(details, "\r")
cleandetails <- cleandetails[!grepl("Grade:", cleandetails)]
cleandetails <- cleandetails[grepl(":", cleandetails)]
cleandetails <- gsub("\r" ,"", cleandetails)
# ⚠️ this part is where errors usually occur. I add a line that prints the URL and loop number in case it gives an error and fails in the next line.
say(url)
message(magnet)
# using the custom NAcheck() function when putting the elements into a dataframe instead of when scraping them
loopedmagnets <- tibble(
title = NAcheck(product),
category = NAcheck(innercategory),
price = NAcheck(priceforone),
link = NAcheck(url),
details = NAcheck(cleandetails)
)
# clean up the details column
sep_magnets <- loopedmagnets %>%
# first, separate the details columns into one column called "titles" and one column called "details". what were the titles on the webpage all end in ":", so I'm separating the rows at that.
separate(details, into = c("titles", "details"), sep = ":") %>%
# next, clean up the details column (by making a new column with the same name to replace it with), removing everything before the "\n" symbol in the strings
mutate(details = str_remove(details, ".*\n")) %>% # ".*\n" is regex
# finally, "pivot" the dataframe. make the rows in the "titles" column become their own columns with their rows being the values in the "details" column
pivot_wider(names_from = titles, values_from = details)
# Step 4: Bind inner dataframe to outer dataframe
magnets <- bind_rows(magnets, sep_magnets)
}
________________________________________________________________________
< https://www.kjmagnetics.com/d0505-neodymium-cylinder-magnet?pl=1.1&pf= >
------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________
< https://www.kjmagnetics.com/d052-neodymium-disc-magnet?pl=1.2&pf= >
-------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d054-neodymium-cylinder-magnet?pl=1.3&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d055-neodymium-cylinder-magnet?pl=1.4&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d101-n52-neodymium-disc-magnet?pl=1.5&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
__________________________________________________________________________
< https://www.kjmagnetics.com/d11-n52-neodymium-cylinder-magnet?pl=1.6&pf= >
--------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
__________________________________________________________________________________
< https://www.kjmagnetics.com/d11sh-neodymium-high-temp-cylinder-magnet?pl=1.7&pf= >
----------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
__________________________________________________________________________
< https://www.kjmagnetics.com/d12-n52-neodymium-cylinder-magnet?pl=1.8&pf= >
--------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________
< https://www.kjmagnetics.com/d14-neodymium-cylinder-magnet?pl=1.9&pf= >
----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________________
< https://www.kjmagnetics.com/d14-n52-neodymium-cylinder-magnet?pl=1.10&pf= >
---------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d16-neodymium-cylinder-magnet?pl=1.11&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d18-neodymium-cylinder-magnet?pl=1.12&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d1c-neodymium-cylinder-magnet?pl=1.13&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
________________________________________________________________________
< https://www.kjmagnetics.com/d1x0-neodymium-cylinder-magnet?pl=1.14&pf= >
------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
____________________________________________________________________
< https://www.kjmagnetics.com/d201-neodymium-disc-magnet?pl=1.15&pf= >
--------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
________________________________________________________________________
< https://www.kjmagnetics.com/d201-n52-neodymium-disc-magnet?pl=1.16&pf= >
------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
____________________________________________________________________
< https://www.kjmagnetics.com/d203-neodymium-disc-magnet?pl=1.17&pf= >
--------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________
< https://www.kjmagnetics.com/d21-neodymium-disc-magnet?pl=1.18&pf= >
-------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
________________________________________________________________________
< https://www.kjmagnetics.com/d21b-n52-neodymium-disc-magnet?pl=1.19&pf= >
------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________________
< https://www.kjmagnetics.com/d21sh-neodymium-high-temp-disc-magnet?pl=1.20&pf= >
-------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________
< https://www.kjmagnetics.com/d22-neodymium-cylinder-magnet?pl=2.1&pf= >
----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
__________________________________________________________________________
< https://www.kjmagnetics.com/d22-n52-neodymium-cylinder-magnet?pl=2.2&pf= >
--------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________________________
< https://www.kjmagnetics.com/d22-n52sh-neodymium-high-temp-cylinder-magnet?pl=2.3&pf= >
--------------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________
< https://www.kjmagnetics.com/d23-neodymium-cylinder-magnet?pl=2.4&pf= >
----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________
< https://www.kjmagnetics.com/d24-neodymium-cylinder-magnet?pl=2.5&pf= >
----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________________________
< https://www.kjmagnetics.com/d24dia-neodymium-diametric-cylinder-magnet?pl=2.6&pf= >
-----------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
__________________________________________________________________________
< https://www.kjmagnetics.com/d24-n52-neodymium-cylinder-magnet?pl=2.7&pf= >
--------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________
< https://www.kjmagnetics.com/d26-neodymium-cylinder-magnet?pl=2.8&pf= >
----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
__________________________________________________________________________
< https://www.kjmagnetics.com/d26-n52-neodymium-cylinder-magnet?pl=2.9&pf= >
--------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d28-neodymium-cylinder-magnet?pl=2.10&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________________
< https://www.kjmagnetics.com/d28-n52-neodymium-cylinder-magnet?pl=2.11&pf= >
---------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d2a-neodymium-cylinder-magnet?pl=2.12&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d2c-neodymium-cylinder-magnet?pl=2.13&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d2e-neodymium-cylinder-magnet?pl=2.14&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
____________________________________________________________________
< https://www.kjmagnetics.com/d2h1-neodymium-disc-magnet?pl=2.15&pf= >
--------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
________________________________________________________________________
< https://www.kjmagnetics.com/d2h2-neodymium-cylinder-magnet?pl=2.16&pf= >
------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
________________________________________________________________________
< https://www.kjmagnetics.com/d2x0-neodymium-cylinder-magnet?pl=2.17&pf= >
------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
________________________________________________________________________
< https://www.kjmagnetics.com/d2x8-neodymium-cylinder-magnet?pl=2.18&pf= >
------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
________________________________________________________________________
< https://www.kjmagnetics.com/d2y0-neodymium-cylinder-magnet?pl=2.19&pf= >
------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
____________________________________________________________________
< https://www.kjmagnetics.com/d301-neodymium-disc-magnet?pl=2.20&pf= >
--------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d301-n52-neodymium-disc-magnet?pl=3.1&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________
< https://www.kjmagnetics.com/d303-neodymium-disc-magnet?pl=3.2&pf= >
-------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
__________________________________________________________________
< https://www.kjmagnetics.com/d31-neodymium-disc-magnet?pl=3.3&pf= >
------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________
< https://www.kjmagnetics.com/d31-n52-neodymium-disc-magnet?pl=3.4&pf= >
----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________________
< https://www.kjmagnetics.com/d31sh-neodymium-high-temp-disc-magnet?pl=3.5&pf= >
------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
__________________________________________________________________
< https://www.kjmagnetics.com/d32-neodymium-disc-magnet?pl=3.6&pf= >
------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________________
< https://www.kjmagnetics.com/d32ah-neodymium-high-temp-disc-magnet?pl=3.7&pf= >
------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________
< https://www.kjmagnetics.com/d32-n52-neodymium-disc-magnet?pl=3.8&pf= >
----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
______________________________________________________________________________
< https://www.kjmagnetics.com/d32sh-neodymium-high-temp-disc-magnet?pl=3.9&pf= >
------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d33-neodymium-cylinder-magnet?pl=3.10&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________________________
< https://www.kjmagnetics.com/d33ah-neodymium-high-temp-cylinder-magnet?pl=3.11&pf= >
-----------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________________
< https://www.kjmagnetics.com/d33-n52-neodymium-cylinder-magnet?pl=3.12&pf= >
---------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________________________
< https://www.kjmagnetics.com/d33sh-neodymium-high-temp-cylinder-magnet?pl=3.13&pf= >
-----------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d34-neodymium-cylinder-magnet?pl=3.14&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________________
< https://www.kjmagnetics.com/d34-n52-neodymium-cylinder-magnet?pl=3.15&pf= >
---------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________________________
< https://www.kjmagnetics.com/d34-n52sh-neodymium-high-temp-cylinder-magnet?pl=3.16&pf= >
---------------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d36-neodymium-cylinder-magnet?pl=3.17&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
____________________________________________________________________________________
< https://www.kjmagnetics.com/d36dia-neodymium-diametric-cylinder-magnet?pl=3.18&pf= >
------------------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
___________________________________________________________________________
< https://www.kjmagnetics.com/d36-n52-neodymium-cylinder-magnet?pl=3.19&pf= >
---------------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
_______________________________________________________________________
< https://www.kjmagnetics.com/d38-neodymium-cylinder-magnet?pl=3.20&pf= >
-----------------------------------------------------------------------
\
\
^__^
(oo)\ ________
(__)\ )\ /\
||------w|
|| ||
# download your dataset (and then comment it out)
# write_csv(magnets, "kjmagnetics_discscylinders.csv")SOME NOTES
- If you’re having trouble turning something into text (for example, if you’re trying to take a list, and the result is just listitemlistitemlistitem with no value to separate by), try using
html_text2instead ofhtml_text!
- If you want to take a list, try taking it as a vector by using the CSS class that each list item uses (often something like
li.list-item-text). You can also do.the-css-class-that-all-list-items-collapse-into li— orol/uldepending on the list type
- Some websites will try to block you from accessing them. If this only happens when using loops/scraping multiple pages on the site, it’s likely because you’re accessing too many pages too quickly. Include
Sys.sleeporpurrr::insistentlybeforeread_htmlto add a pause. This usually fixes it (Sys.sleep()uses seconds, so if you’re scraping hundreds of pages at once, it’s best to make the number small, not more than 2, to make it less painfully slow for you)
- If it seems impossible to select an element no matter what you do, check the page source (in your browser, right click + “view page source”). Search for the element you want with ctrl+F/find in page. If it’s not there, then it’s loaded dynamically, and you won’t be able to access it with
read_html. RSelenium/Chromote are options I haven’t explored very far (read_html_liveis very slow on my computer if I’m trying to scrape multiple pages)
- If you want to scrape just one page and it’s saying it can’t open the connection, try
read_html_live - Supposedly Python is better(?) for webscraping… I wouldn’t know. I think having to worry about indentation while web scraping is some sort of cruel and unusual punishment