Assignment Overview

This assignment will focus on creating three files – HTML, XML, and JSON – to be parsed into R dataframes based on attributes of three selected books of interest. Using the attributes selected such as title and author, create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”).

The goal of this assignment is to use R code and any R packages of choice to load the information from each of the three sources into separate R data frames.

Setup

This assignment requires the following R packages:

  • XML
  • RCurl
  • plyr
  • jsonlite
  • knitr

The code for this assignment can be found on GitHub here.

The three files, books.html, books.xml, and books.json, can be found on GitHub using the link below:

https://github.com/kfolsom98/DATA607/tree/master/Week8/Data

Parsing HTML

Load the books.html file from GitHub:

# HTML file location on GitHub 
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.html"
txt <- getURL(url=baseURL)

Below is the structure of the books.html information in HTML format.

Books HTML table

Parse the HTML table using htmlParse and Xpath

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")

# solution used below found on stackoverflow
# modified code to apply to books.html file
# http://stackoverflow.com/questions/6427061/parsing-html-tables-using-the-xml-rcurl-r-packages-without-using-the-readhtml

html_books <- as.data.frame(t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7, 9, 11, 13)])))

The resulting dataframe is shown below, but does not include column names. Additionally, all variables are defined as factors.

V1 V2 V3 V4 V5 V6 V7
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis John Wiley and Sons, Ltd 978-1118834817 480 Data Mining English
OpenIntro Statistics Second Edition David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel CreateSpace Independent Publishing Platform 978-1478217206 426 Probability / Statistics English
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan John Kruschke Academic Press 978-0124058880 776 Statistics English
## 'data.frame':    3 obs. of  7 variables:
##  $ V1: Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 3 2
##  $ V2: Factor w/ 3 levels "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel",..: 3 1 2
##  $ V3: Factor w/ 3 levels "Academic Press",..: 3 2 1
##  $ V4: Factor w/ 3 levels "978-0124058880",..: 2 3 1
##  $ V5: Factor w/ 3 levels "426","480","776": 2 1 3
##  $ V6: Factor w/ 3 levels "Data Mining",..: 1 2 3
##  $ V7: Factor w/ 1 level "English": 1 1 1

Apply column names and convert factor variables to characters:

colnames(html_books) <- c("Title",  "Authors", "Publisher", "ISBN", "Pages", "Topic",   "Language")

# convert the factors to characters
# http://stackoverflow.com/questions/27528907/how-to-convert-data-frame-column-from-factor-to-numeric

indx <- sapply(html_books, is.factor)
html_books[indx] <- lapply(html_books[indx], function(x) as.character(x))

html_books$Pages <- as.numeric(html_books$Pages)
Title Authors Publisher ISBN Pages Topic Language
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis John Wiley and Sons, Ltd 978-1118834817 480 Data Mining English
OpenIntro Statistics Second Edition David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel CreateSpace Independent Publishing Platform 978-1478217206 426 Probability / Statistics English
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan John Kruschke Academic Press 978-0124058880 776 Statistics English

Parsing XML

Load the books.xml file from GitHub:

# XML file location on GitHub 
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.xml"
txt <- getURL(url=baseURL)

Below is the structure of the books.xml information in XML format.

Books XML Structure

Two Options to parse the XML structure

1. Parse the XML using xmlParse and dply

xml_books <- xmlParse(txt,  validate = F)

#http://www.informit.com/articles/article.aspx?p=2215520
books1 <- ldply(xmlToList(txt), data.frame)

str(books1)
## 'data.frame':    3 obs. of  9 variables:
##  $ .id      : chr  "book" "book" "book"
##  $ Title    : Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 2 3
##  $ Authors  : Factor w/ 3 levels "Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis",..: 1 2 3
##  $ Publisher: Factor w/ 3 levels "John Wiley and Sons, Ltd",..: 1 2 3
##  $ ISBN     : Factor w/ 3 levels "978-1118834817",..: 1 2 3
##  $ Pages    : Factor w/ 3 levels "480","426","776": 1 2 3
##  $ Topic    : Factor w/ 3 levels "Data Mining",..: 1 2 3
##  $ Language : Factor w/ 1 level "English": 1 1 1
##  $ .attrs   : Factor w/ 3 levels "1","2","3": 1 2 3

This method of parsing the XML includes some additional fields in the resulting dataframe: .id and .attrs. These correspond to the book elements in the XML file such as. Convert to a helpful ID field.

The resulting dataframe also contains all factors for the variables such as Title, Authors, Pages, etc. Convert factors to character variables.

Books Dataframe using XML Parsing Option 1
Title Authors Publisher ISBN Pages Topic Language Book.ID
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis John Wiley and Sons, Ltd 978-1118834817 480 Data Mining English 1
OpenIntro Statistics Second Edition David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel CreateSpace Independent Publishing Platform 978-1478217206 426 Probability / Statistics English 2
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan John Kruschke Academic Press 978-0124058880 776 Statistics English 3

2. Parse the XML using xmlRoot and using xmlToDataFrame

This option seems somewhat simpler but does not include the option to include the id attribute from the book element. As with option 1, The resulting dataframe contains all factors for the variables such as Title, Authors, Pages, etc. Convert factors to character variables.

root <- xmlRoot(xml_books)

books2 <- xmlToDataFrame(root)

str(books2)
## 'data.frame':    3 obs. of  7 variables:
##  $ Title    : Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 3 2
##  $ Authors  : Factor w/ 3 levels "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel",..: 3 1 2
##  $ Publisher: Factor w/ 3 levels "Academic Press",..: 3 2 1
##  $ ISBN     : Factor w/ 3 levels "978-0124058880",..: 2 3 1
##  $ Pages    : Factor w/ 3 levels "426","480","776": 2 1 3
##  $ Topic    : Factor w/ 3 levels "Data Mining",..: 1 2 3
##  $ Language : Factor w/ 1 level "English": 1 1 1
Books Dataframe using XML Parsing Option 2
Title Authors Publisher ISBN Pages Topic Language
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis John Wiley and Sons, Ltd 978-1118834817 2 Data Mining English
OpenIntro Statistics Second Edition David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel CreateSpace Independent Publishing Platform 978-1478217206 1 Probability / Statistics English
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan John Kruschke Academic Press 978-0124058880 3 Statistics English

Parsing JSON

Load the books.json file from GitHub:

# JSON file location on GitHub
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.json"
txt <- getURL(url=baseURL)

Below is the structure of the books.json information in JSON format.

Books JSON Structure

Parse the JSON file using the jsonlite package.

json_books <- fromJSON(txt)

str(json_books)
## 'data.frame':    3 obs. of  7 variables:
##  $ Title    : chr  "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining" "OpenIntro Statistics Second Edition" "Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan"
##  $ Authors  : chr  "Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis" "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel" "John Kruschke"
##  $ Publisher: chr  "John Wiley and Sons, Ltd" "CreateSpace Independent Publishing Platform " "Academic Press"
##  $ ISBN     : chr  "978-1118834817" "978-1478217206" "978-0124058880"
##  $ Pages    : chr  "480" "426" "776"
##  $ Topic    : chr  "Data Mining" "Probability / Statistics" "Statistics"
##  $ Language : chr  "English" "English" "English"
json_books$Pages <- as.numeric(json_books$Pages)

In this case, the variables were all loaded as characters instead of factors. This method using jsonlite seemed to be the most straightforward and simplest approach.

The final dataframe looks like:

Title Authors Publisher ISBN Pages Topic Language
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis John Wiley and Sons, Ltd 978-1118834817 480 Data Mining English
OpenIntro Statistics Second Edition David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel CreateSpace Independent Publishing Platform 978-1478217206 426 Probability / Statistics English
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan John Kruschke Academic Press 978-0124058880 776 Statistics English