The goal of this note is to show a simple application of web scraping using the rvest package. Some of the dificulties of this process will be highlighted as well.
First of all the SelectorGadget should be installed on Google Chrome.
Load the libraries.
library(rvest)
## Loading required package: xml2
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(pander)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Here is an image of the page to be scraped.
knitr::include_graphics('./ML_books printscr.jpg')
Get the URL
url <- "https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=machine+learning&rh=n%3A283155%2Ck%3Amachine+learning"
webpage <- read_html(url)
Get the data! Start by turning on the SelectorGadget on the webpage, click on the information to be extracted to figure out te CSS selectors, the matching ones will be highlighted. We pass the CSS selectors extracted with SelectorGadget to the html_nodes function.
price_data <- html_nodes(webpage, ".sx-price-whole")
price_data <- html_text(price_data)
length(price_data)
## [1] 24
price_frac_data <- html_nodes(webpage, ".sx-price-fractional")
price_frac_data <- html_text(price_frac_data)
length(price_frac_data)
## [1] 24
title_data <- html_nodes(webpage, ".s-access-title")
title_data <- html_text(title_data)
length(title_data)
## [1] 13
From now on it’s classic R! All the values from the web are already stored in our local machine.
Note that the lengths differ. We have 13 titles and 24 prices, when we should have 26 prices (1 for the physical copy and 1 for kindle). Visually inspecting the page we see that title 9 has a rental price, it will be removed now.
price_data <- price_data[-16] #remove rental price
price_frac_data <- price_frac_data[-16] #remove rental price
Again by visual inspection identify 3 books that only have prices for physical or kindle editions, but not for both. We’ll fill the missing values with NA’s.
a <- price_data[1:6]
b <- price_data[7:length(price_data)]
price_data <- append(a, "NA")
price_data <- append(price_data, b)
a <- price_frac_data[1:6]
b <- price_frac_data[7:length(price_frac_data)]
price_frac_data <- append(a, "NA")
price_frac_data <- append(price_frac_data, b)
a <- price_data[1:23]
b <- price_data[24:length(price_data)]
price_data <- append(a, "NA")
price_data <- append(price_data, b)
a <- price_frac_data[1:23]
b <- price_frac_data[24:length(price_frac_data)]
price_frac_data <- append(a, "NA")
price_frac_data <- append(price_frac_data, b)
#Final
price_data <- append(price_data, "NA")
price_frac_data <- append(price_frac_data, "NA")
Now the vectors have the right size. The next step is to split the prices between physical and kindle editions.
physical_price <- price_data[c(TRUE, FALSE)]
physical_frac_price <- price_frac_data[c(TRUE, FALSE)]
kindle_price <- price_data[c(FALSE, TRUE)]
kindle_frac_price <- price_frac_data[c(FALSE, TRUE)]
Next the main price and the fractional price will be unified and converted to numeric class.
physical_price <- as.numeric(paste(physical_price, physical_frac_price, sep = "."))
## Warning: NAs introduzidos por coerção
kindle_price <- as.numeric(paste(kindle_price, kindle_frac_price, sep = "."))
## Warning: NAs introduzidos por coerção
At this point we are ready to bundle all the information together into a data frame and see the result.
ML_books <- data.frame(title = title_data, physical_price = physical_price, kindle_price = kindle_price )
pander(ML_books)
| title | physical_price | kindle_price |
|---|---|---|
| Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent… | 31.96 | 22.56 |
| Machine Learning for Absolute Beginners: A Plain English Introduction | 9.89 | 2.81 |
| Deep Learning (Adaptive Computation and Machine Learning series) | 72 | 79.99 |
| Machine Learning With Random Forests And Decision Trees: A Visual Guide For Beginners | NA | 1.8 |
| Python Machine Learning | 40.49 | 30.88 |
| Machine Learning: The New AI (The MIT Press Essential Knowledge series) | 11.24 | 14.49 |
| Machine Learning with R - Second Edition | 35.99 | 37.75 |
| Introduction to Machine Learning with Python: A Guide for Data Scientists | 40.26 | 29.22 |
| Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms | 27.72 | 18.78 |
| The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series… | 69.41 | 67.34 |
| Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series) | 86.49 | 105 |
| TensorFlow Machine Learning Cookbook | 49.49 | NA |
| Python Machine Learning | 40.49 | NA |
Why not take a look at some summary statistics on prices?
stargazer(ML_books, type = "text")
##
## ===============================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------
## physical_price 12 42.953 23.358 9.890 86.490
## kindle_price 11 37.328 33.081 1.800 104.990
## -----------------------------------------------
ML_books[,2:3] %>% filter(complete.cases(.)) %>% cor()
## physical_price kindle_price
## physical_price 1.0000000 0.9731935
## kindle_price 0.9731935 1.0000000
As we can see, web scraping is simple with rvest and SelectorGadget. But we have to pay attention to the page we are working on, missing values won’t be automatically filled and there might be additional information we might need to get rid of.