Summary

This is a warm up exercise to help you to get more familiar with the HTML, XML, and JSON file formats, and using packages to read these data formats for downstream use in R data frames. In the next two class weeks, we’ll be loading these file formats from the web, using web scraping and web APIs.

I selected three of my favorite books along with the following attributes:

The data was saved in three different formats:

The Libraries

library(RCurl)
## Loading required package: bitops
library(knitr)
library(rjson)
library(plyr)
library(XML)

Getting JSON Data in R

json_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.json"
json_data <- getURL(json_data_url)

books_json_data <- fromJSON(json_data)

books_json_df <- ldply(books_json_data,  function(x) { data.frame(x) } )

kable(books_json_df)
title authors publish_date hardcover_price kindle_price
Drive: The Surprising Truth About What Motivates Us Daniel Pink April 5, 2011 10.72 7.99
Quiet: The Power of Introverts in a World That Can’t Stop Talking Susan Cain January 24, 2012 17.91 8.99
Switch: How to Change Things When Change Is Hard Chip Heath February 16, 2010 17.63 11.99
Switch: How to Change Things When Change Is Hard Dan Heath February 16, 2010 17.63 11.99

Getting XML Data in R

xml_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.xml"
xml_data <- getURL(xml_data_url)
books_xml_list <- xmlToList(xmlParse(xml_data))

books_xml_df <- ldply(books_xml_list, function(x) { data.frame(x) } )

kable(books_xml_df)
.id title author hardcover_price kindle_price publish_date authors.author authors.author.1
book Drive: The Surprising Truth About What Motivates Us Daniel Pink 10.72 7.99 April 5, 2011 NA NA
book Quiet: The Power of Introverts in a World That Can’t Stop Talking Susan Cain 17.91 8.99 January 24, 2012 NA NA
book Switch: How to Change Things When Change Is Hard NA 17.63 11.99 February 16, 2010 Chip Heath Dan Heath

Getting HTML Data in R

html_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.html"
html_data <- getURL(html_data_url)

#get all the tables in the data
book_tables <- readHTMLTable(html_data)

# get the row number for all tables in html
n.rows <- unlist(lapply(book_tables, function(t) dim(t)[1]))

# select the table with the most rows
books_html_df<- book_tables[[which.max(n.rows)]]


kable(books_html_df)
title authors publish_date hardcover_price kindle_price
Drive: The Surprising Truth About What Motivates Us Daniel Pink April 5, 2011 10.72 7.99
Quiet: The Power of Introverts in a World That Can’t Stop Talking Susan Cain January 24, 2012 17.91 8.99
Switch: How to Change Things When Change Is Hard Chip Heath February 16, 2010 17.63 11.99
Switch: How to Change Things When Change Is Hard Dan Heath February 16, 2010 17.63 11.99

Conclusion

Even though there are R-libraries available for loading XML, HMTL, and JSON, the resulting R-dataframe schema varies slightly.

In addition, the different data-sources have unique challenges, particularly with HTML where handling documents that contain multiple tables can be a challenge.