DATA 607 Week 8 Assignment - Working with XML and JSON in R

Assignment Overview

This assignment will focus on creating three files – HTML, XML, and JSON – to be parsed into R dataframes based on attributes of three selected books of interest. Using the attributes selected such as title and author, create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”).

The goal of this assignment is to use R code and any R packages of choice to load the information from each of the three sources into separate R data frames.

Setup

This assignment requires the following R packages:

XML
RCurl
plyr
jsonlite
knitr

The code for this assignment can be found on GitHub here.

The three files, books.html, books.xml, and books.json, can be found on GitHub using the link below:

https://github.com/kfolsom98/DATA607/tree/master/Week8/Data

Parsing HTML

Load the books.html file from GitHub:

# HTML file location on GitHub 
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.html"
txt <- getURL(url=baseURL)

Below is the structure of the books.html information in HTML format.

Books HTML table

Parse the HTML table using htmlParse and Xpath

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")

# solution used below found on stackoverflow
# modified code to apply to books.html file
# http://stackoverflow.com/questions/6427061/parsing-html-tables-using-the-xml-rcurl-r-packages-without-using-the-readhtml

html_books <- as.data.frame(t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7, 9, 11, 13)])))

The resulting dataframe is shown below, but does not include column names. Additionally, all variables are defined as factors.

V1	V2	V3	V4	V5	V6	V7
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining	Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis	John Wiley and Sons, Ltd	978-1118834817	480	Data Mining	English
OpenIntro Statistics Second Edition	David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel	CreateSpace Independent Publishing Platform	978-1478217206	426	Probability / Statistics	English
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan	John Kruschke	Academic Press	978-0124058880	776	Statistics	English

## 'data.frame':    3 obs. of  7 variables:
##  $ V1: Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 3 2
##  $ V2: Factor w/ 3 levels "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel",..: 3 1 2
##  $ V3: Factor w/ 3 levels "Academic Press",..: 3 2 1
##  $ V4: Factor w/ 3 levels "978-0124058880",..: 2 3 1
##  $ V5: Factor w/ 3 levels "426","480","776": 2 1 3
##  $ V6: Factor w/ 3 levels "Data Mining",..: 1 2 3
##  $ V7: Factor w/ 1 level "English": 1 1 1

Apply column names and convert factor variables to characters:

colnames(html_books) <- c("Title",  "Authors", "Publisher", "ISBN", "Pages", "Topic",   "Language")

# convert the factors to characters
# http://stackoverflow.com/questions/27528907/how-to-convert-data-frame-column-from-factor-to-numeric

indx <- sapply(html_books, is.factor)
html_books[indx] <- lapply(html_books[indx], function(x) as.character(x))

html_books$Pages <- as.numeric(html_books$Pages)

Title	Authors	Publisher	ISBN	Pages	Topic	Language
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining	Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis	John Wiley and Sons, Ltd	978-1118834817	480	Data Mining	English
OpenIntro Statistics Second Edition	David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel	CreateSpace Independent Publishing Platform	978-1478217206	426	Probability / Statistics	English
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan	John Kruschke	Academic Press	978-0124058880	776	Statistics	English

Parsing XML

Load the books.xml file from GitHub:

# XML file location on GitHub 
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.xml"
txt <- getURL(url=baseURL)

Below is the structure of the books.xml information in XML format.

Books XML Structure

Two Options to parse the XML structure

1. Parse the XML using xmlParse and dply

xml_books <- xmlParse(txt,  validate = F)

#http://www.informit.com/articles/article.aspx?p=2215520
books1 <- ldply(xmlToList(txt), data.frame)

str(books1)

## 'data.frame':    3 obs. of  9 variables:
##  $ .id      : chr  "book" "book" "book"
##  $ Title    : Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 2 3
##  $ Authors  : Factor w/ 3 levels "Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis",..: 1 2 3
##  $ Publisher: Factor w/ 3 levels "John Wiley and Sons, Ltd",..: 1 2 3
##  $ ISBN     : Factor w/ 3 levels "978-1118834817",..: 1 2 3
##  $ Pages    : Factor w/ 3 levels "480","426","776": 1 2 3
##  $ Topic    : Factor w/ 3 levels "Data Mining",..: 1 2 3
##  $ Language : Factor w/ 1 level "English": 1 1 1
##  $ .attrs   : Factor w/ 3 levels "1","2","3": 1 2 3

This method of parsing the XML includes some additional fields in the resulting dataframe: .id and .attrs. These correspond to the book elements in the XML file such as. Convert to a helpful ID field.

The resulting dataframe also contains all factors for the variables such as Title, Authors, Pages, etc. Convert factors to character variables.

Books Dataframe using XML Parsing Option 1
Title	Authors	Publisher	ISBN	Pages	Topic	Language	Book.ID
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining	Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis	John Wiley and Sons, Ltd	978-1118834817	480	Data Mining	English	1
OpenIntro Statistics Second Edition	David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel	CreateSpace Independent Publishing PlatformÂ	978-1478217206	426	Probability / Statistics	English	2
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan	John Kruschke	Academic Press	978-0124058880	776	Statistics	English	3

2. Parse the XML using xmlRoot and using xmlToDataFrame

This option seems somewhat simpler but does not include the option to include the id attribute from the book element. As with option 1, The resulting dataframe contains all factors for the variables such as Title, Authors, Pages, etc. Convert factors to character variables.

root <- xmlRoot(xml_books)

books2 <- xmlToDataFrame(root)

str(books2)

## 'data.frame':    3 obs. of  7 variables:
##  $ Title    : Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 3 2
##  $ Authors  : Factor w/ 3 levels "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel",..: 3 1 2
##  $ Publisher: Factor w/ 3 levels "Academic Press",..: 3 2 1
##  $ ISBN     : Factor w/ 3 levels "978-0124058880",..: 2 3 1
##  $ Pages    : Factor w/ 3 levels "426","480","776": 2 1 3
##  $ Topic    : Factor w/ 3 levels "Data Mining",..: 1 2 3
##  $ Language : Factor w/ 1 level "English": 1 1 1

Books Dataframe using XML Parsing Option 2
Title	Authors	Publisher	ISBN	Pages	Topic	Language
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining	Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis	John Wiley and Sons, Ltd	978-1118834817	2	Data Mining	English
OpenIntro Statistics Second Edition	David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel	CreateSpace Independent Publishing PlatformÂ	978-1478217206	1	Probability / Statistics	English
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan	John Kruschke	Academic Press	978-0124058880	3	Statistics	English

Parsing JSON

Load the books.json file from GitHub:

# JSON file location on GitHub
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.json"
txt <- getURL(url=baseURL)

Below is the structure of the books.json information in JSON format.

Books JSON Structure

Parse the JSON file using the jsonlite package.

json_books <- fromJSON(txt)

str(json_books)

## 'data.frame':    3 obs. of  7 variables:
##  $ Title    : chr  "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining" "OpenIntro Statistics Second Edition" "Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan"
##  $ Authors  : chr  "Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis" "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel" "John Kruschke"
##  $ Publisher: chr  "John Wiley and Sons, Ltd" "CreateSpace Independent Publishing Platform " "Academic Press"
##  $ ISBN     : chr  "978-1118834817" "978-1478217206" "978-0124058880"
##  $ Pages    : chr  "480" "426" "776"
##  $ Topic    : chr  "Data Mining" "Probability / Statistics" "Statistics"
##  $ Language : chr  "English" "English" "English"

json_books$Pages <- as.numeric(json_books$Pages)

In this case, the variables were all loaded as characters instead of factors. This method using jsonlite seemed to be the most straightforward and simplest approach.

The final dataframe looks like:

Title	Authors	Publisher	ISBN	Pages	Topic	Language
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining	Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis	John Wiley and Sons, Ltd	978-1118834817	480	Data Mining	English
OpenIntro Statistics Second Edition	David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel	CreateSpace Independent Publishing Platform	978-1478217206	426	Probability / Statistics	English
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan	John Kruschke	Academic Press	978-0124058880	776	Statistics	English