The goal of this note is to show a simple application of web scraping using the rvest package. Some of the dificulties of this process will be highlighted as well.

First of all the SelectorGadget should be installed on Google Chrome.

Load the libraries.

library(rvest)
## Loading required package: xml2
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(pander)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Here is an image of the page to be scraped.

knitr::include_graphics('./ML_books printscr.jpg')

Get the URL

url <- "https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=machine+learning&rh=n%3A283155%2Ck%3Amachine+learning"

webpage <- read_html(url)

Get the data! Start by turning on the SelectorGadget on the webpage, click on the information to be extracted to figure out te CSS selectors, the matching ones will be highlighted. We pass the CSS selectors extracted with SelectorGadget to the html_nodes function.

price_data <- html_nodes(webpage, ".sx-price-whole")
price_data <- html_text(price_data)
length(price_data)
## [1] 24
price_frac_data <- html_nodes(webpage, ".sx-price-fractional")
price_frac_data <- html_text(price_frac_data)
length(price_frac_data)
## [1] 24
title_data <- html_nodes(webpage, ".s-access-title")
title_data <- html_text(title_data)
length(title_data)
## [1] 13

From now on it’s classic R! All the values from the web are already stored in our local machine.

Note that the lengths differ. We have 13 titles and 24 prices, when we should have 26 prices (1 for the physical copy and 1 for kindle). Visually inspecting the page we see that title 9 has a rental price, it will be removed now.

price_data <- price_data[-16] #remove rental price
price_frac_data <- price_frac_data[-16] #remove rental price

Again by visual inspection identify 3 books that only have prices for physical or kindle editions, but not for both. We’ll fill the missing values with NA’s.

a <- price_data[1:6]
b <- price_data[7:length(price_data)]
price_data <- append(a, "NA")
price_data <- append(price_data, b)

a <- price_frac_data[1:6]
b <- price_frac_data[7:length(price_frac_data)]
price_frac_data <- append(a, "NA")
price_frac_data <- append(price_frac_data, b)


a <- price_data[1:23]
b <- price_data[24:length(price_data)]
price_data <- append(a, "NA")
price_data <- append(price_data, b)


a <- price_frac_data[1:23]
b <- price_frac_data[24:length(price_frac_data)]
price_frac_data <- append(a, "NA")
price_frac_data <- append(price_frac_data, b)

#Final
price_data <- append(price_data, "NA")
price_frac_data <- append(price_frac_data, "NA")

Now the vectors have the right size. The next step is to split the prices between physical and kindle editions.

physical_price <- price_data[c(TRUE, FALSE)]
physical_frac_price <- price_frac_data[c(TRUE, FALSE)]


kindle_price <- price_data[c(FALSE, TRUE)]
kindle_frac_price <- price_frac_data[c(FALSE, TRUE)]

Next the main price and the fractional price will be unified and converted to numeric class.

physical_price <- as.numeric(paste(physical_price, physical_frac_price, sep = "."))
## Warning: NAs introduzidos por coerção
kindle_price <- as.numeric(paste(kindle_price, kindle_frac_price, sep = "."))
## Warning: NAs introduzidos por coerção

At this point we are ready to bundle all the information together into a data frame and see the result.

ML_books <- data.frame(title = title_data, physical_price = physical_price, kindle_price = kindle_price )

pander(ML_books)
title physical_price kindle_price
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent… 31.96 22.56
Machine Learning for Absolute Beginners: A Plain English Introduction 9.89 2.81
Deep Learning (Adaptive Computation and Machine Learning series) 72 79.99
Machine Learning With Random Forests And Decision Trees: A Visual Guide For Beginners NA 1.8
Python Machine Learning 40.49 30.88
Machine Learning: The New AI (The MIT Press Essential Knowledge series) 11.24 14.49
Machine Learning with R - Second Edition 35.99 37.75
Introduction to Machine Learning with Python: A Guide for Data Scientists 40.26 29.22
Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms 27.72 18.78
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series… 69.41 67.34
Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series) 86.49 105
TensorFlow Machine Learning Cookbook 49.49 NA
Python Machine Learning 40.49 NA

Why not take a look at some summary statistics on prices?

stargazer(ML_books, type = "text")
## 
## ===============================================
## Statistic      N   Mean  St. Dev.  Min    Max  
## -----------------------------------------------
## physical_price 12 42.953  23.358  9.890 86.490 
## kindle_price   11 37.328  33.081  1.800 104.990
## -----------------------------------------------
ML_books[,2:3] %>% filter(complete.cases(.)) %>% cor()
##                physical_price kindle_price
## physical_price      1.0000000    0.9731935
## kindle_price        0.9731935    1.0000000

As we can see, web scraping is simple with rvest and SelectorGadget. But we have to pay attention to the page we are working on, missing values won’t be automatically filled and there might be additional information we might need to get rid of.