Week 9: XML and JSON

Summary

This is a warm up exercise to help you to get more familiar with the HTML, XML, and JSON file formats, and using packages to read these data formats for downstream use in R data frames. In the next two class weeks, we’ll be loading these file formats from the web, using web scraping and web APIs.

I selected three of my favorite books along with the following attributes:

Title
Authors
Publish Date
Hardcover Price
Kindle Price

The data was saved in three different formats:

HTML
JSON
XML

The Libraries

library(RCurl)

## Loading required package: bitops

library(knitr)
library(rjson)
library(plyr)
library(XML)

Getting JSON Data in R

json_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.json"
json_data <- getURL(json_data_url)

books_json_data <- fromJSON(json_data)

books_json_df <- ldply(books_json_data,  function(x) { data.frame(x) } )

kable(books_json_df)

title	authors	publish_date	hardcover_price	kindle_price
Drive: The Surprising Truth About What Motivates Us	Daniel Pink	April 5, 2011	10.72	7.99
Quiet: The Power of Introverts in a World That Can’t Stop Talking	Susan Cain	January 24, 2012	17.91	8.99
Switch: How to Change Things When Change Is Hard	Chip Heath	February 16, 2010	17.63	11.99
Switch: How to Change Things When Change Is Hard	Dan Heath	February 16, 2010	17.63	11.99

Getting XML Data in R

xml_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.xml"
xml_data <- getURL(xml_data_url)
books_xml_list <- xmlToList(xmlParse(xml_data))

books_xml_df <- ldply(books_xml_list, function(x) { data.frame(x) } )

kable(books_xml_df)

.id	title	author	hardcover_price	kindle_price	publish_date	authors.author	authors.author.1
book	Drive: The Surprising Truth About What Motivates Us	Daniel Pink	10.72	7.99	April 5, 2011	NA	NA
book	Quiet: The Power of Introverts in a World That Can’t Stop Talking	Susan Cain	17.91	8.99	January 24, 2012	NA	NA
book	Switch: How to Change Things When Change Is Hard	NA	17.63	11.99	February 16, 2010	Chip Heath	Dan Heath

Getting HTML Data in R

html_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.html"
html_data <- getURL(html_data_url)

#get all the tables in the data
book_tables <- readHTMLTable(html_data)

# get the row number for all tables in html
n.rows <- unlist(lapply(book_tables, function(t) dim(t)[1]))

# select the table with the most rows
books_html_df<- book_tables[[which.max(n.rows)]]


kable(books_html_df)

title	authors	publish_date	hardcover_price	kindle_price
Drive: The Surprising Truth About What Motivates Us	Daniel Pink	April 5, 2011	10.72	7.99
Quiet: The Power of Introverts in a World That Can’t Stop Talking	Susan Cain	January 24, 2012	17.91	8.99
Switch: How to Change Things When Change Is Hard	Chip Heath	February 16, 2010	17.63	11.99
Switch: How to Change Things When Change Is Hard	Dan Heath	February 16, 2010	17.63	11.99

Conclusion

Even though there are R-libraries available for loading XML, HMTL, and JSON, the resulting R-dataframe schema varies slightly.

In addition, the different data-sources have unique challenges, particularly with HTML where handling documents that contain multiple tables can be a challenge.