This is a warm up exercise to help you to get more familiar with the HTML, XML, and JSON file formats, and using packages to read these data formats for downstream use in R data frames. In the next two class weeks, we’ll be loading these file formats from the web, using web scraping and web APIs.
I selected three of my favorite books along with the following attributes:
The data was saved in three different formats:
library(RCurl)
## Loading required package: bitops
library(knitr)
library(rjson)
library(plyr)
library(XML)
json_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.json"
json_data <- getURL(json_data_url)
books_json_data <- fromJSON(json_data)
books_json_df <- ldply(books_json_data, function(x) { data.frame(x) } )
kable(books_json_df)
| title | authors | publish_date | hardcover_price | kindle_price |
|---|---|---|---|---|
| Drive: The Surprising Truth About What Motivates Us | Daniel Pink | April 5, 2011 | 10.72 | 7.99 |
| Quiet: The Power of Introverts in a World That Can’t Stop Talking | Susan Cain | January 24, 2012 | 17.91 | 8.99 |
| Switch: How to Change Things When Change Is Hard | Chip Heath | February 16, 2010 | 17.63 | 11.99 |
| Switch: How to Change Things When Change Is Hard | Dan Heath | February 16, 2010 | 17.63 | 11.99 |
xml_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.xml"
xml_data <- getURL(xml_data_url)
books_xml_list <- xmlToList(xmlParse(xml_data))
books_xml_df <- ldply(books_xml_list, function(x) { data.frame(x) } )
kable(books_xml_df)
| .id | title | author | hardcover_price | kindle_price | publish_date | authors.author | authors.author.1 |
|---|---|---|---|---|---|---|---|
| book | Drive: The Surprising Truth About What Motivates Us | Daniel Pink | 10.72 | 7.99 | April 5, 2011 | NA | NA |
| book | Quiet: The Power of Introverts in a World That Can’t Stop Talking | Susan Cain | 17.91 | 8.99 | January 24, 2012 | NA | NA |
| book | Switch: How to Change Things When Change Is Hard | NA | 17.63 | 11.99 | February 16, 2010 | Chip Heath | Dan Heath |
html_data_url <- "https://raw.githubusercontent.com/rmalarc/is607/master/assignment9/books.html"
html_data <- getURL(html_data_url)
#get all the tables in the data
book_tables <- readHTMLTable(html_data)
# get the row number for all tables in html
n.rows <- unlist(lapply(book_tables, function(t) dim(t)[1]))
# select the table with the most rows
books_html_df<- book_tables[[which.max(n.rows)]]
kable(books_html_df)
| title | authors | publish_date | hardcover_price | kindle_price |
|---|---|---|---|---|
| Drive: The Surprising Truth About What Motivates Us | Daniel Pink | April 5, 2011 | 10.72 | 7.99 |
| Quiet: The Power of Introverts in a World That Can’t Stop Talking | Susan Cain | January 24, 2012 | 17.91 | 8.99 |
| Switch: How to Change Things When Change Is Hard | Chip Heath | February 16, 2010 | 17.63 | 11.99 |
| Switch: How to Change Things When Change Is Hard | Dan Heath | February 16, 2010 | 17.63 | 11.99 |
Even though there are R-libraries available for loading XML, HMTL, and JSON, the resulting R-dataframe schema varies slightly.
In addition, the different data-sources have unique challenges, particularly with HTML where handling documents that contain multiple tables can be a challenge.