Instructions

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Table Layout

I chose to use the following columns for my datasets:

  • TITLE
  • AUTHOR1
  • AUTHOR2
  • YEAR
  • SUBJECT
  • AMAZON RATING

Libraries

library(XML)
library(RJSONIO)
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
## 
##     xml
library(RCurl)
## Loading required package: bitops

Loading in the Files

HTML File

I will use the rvest package to pull in the html file and save it as a data frame.

htmlURL <- "https://raw.githubusercontent.com/amberferger/DATA607_HW7/master/books.html"

readHtml <- read_html(htmlURL)
tables <- html_nodes(readHtml,"table")
tables_ls <- html_table(tables, fill = TRUE)
booksHTML.df <- as.data.frame(tables_ls)

booksHTML.df
##   Book_Id
## 1       1
## 2       2
## 3       3
##                                                                    Title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2                          Blink: The Power of Thinking Without Thinking
## 3                    The Black Swan: The Impact of the Highly Improbable
##                               Authors Year    Subject Amazon_Rating
## 1 Steven D. Levitt, Stephen J. Dubner 2009  Economics           4.4
## 2                    Malcolm Gladwell 2007 Leadership           4.3
## 3               Nassim Nicholas Taleb 2010 Leadership           4.0

XML Files

Once again, we will use the XML library to load in the XML file and convert it to a data frame. Since the default data type is a string, we will convert the YEAR and AMAZONRATING columns to numeric.

xmlData <- getURL('https://raw.githubusercontent.com/amberferger/DATA607_HW7/master/books.xml', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)

booksXML <- xmlParse(file = xmlData[1])

root <- xmlRoot(booksXML)
booksXML.df <- xmlToDataFrame(root)

booksXML.df$year <- as.numeric(as.character(booksXML.df$year))
booksXML.df$amazon_rating <- as.numeric(as.character(booksXML.df$amazon_rating))

booksXML.df
##                                                                    title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2                          Blink: The Power of Thinking Without Thinking
## 3                    The Black Swan: The Impact of the Highly Improbable
##                             authors year    subject amazon_rating
## 1 Steven D. LevittStephen J. Dubner 2009  Economics           4.4
## 2                  Malcolm Gladwell 2007 Leadership           4.3
## 3             Nassim Nicholas Taleb 2010 Leadership           4.0

JSON File

We will use the RJSONIO library to pull in our JSON file and convert it to a dataframe.

jsonURL <- "https://raw.githubusercontent.com/amberferger/DATA607_HW7/master/books.json"

booksJSON <- fromJSON(jsonURL)
booksJSON.df <- do.call("rbind", lapply(booksJSON, as.data.frame))

booksJSON.df
##         book_id
## books.1       1
## books.2       1
##                                                                          title
## books.1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## books.2 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
##                   authors year   subject amazon_rating book_id.1
## books.1  Steven D. Levitt 2009 Economics           4.4         2
## books.2 Stephen J. Dubner 2009 Economics           4.4         2
##                                               title.1        authors.1
## books.1 Blink: The Power of Thinking Without Thinking Malcolm Gladwell
## books.2 Blink: The Power of Thinking Without Thinking Malcolm Gladwell
##         year.1  subject.1 amazon_rating.1 book_id.2
## books.1   2007 Leadership             4.3         3
## books.2   2007 Leadership             4.3         3
##                                                     title.2
## books.1 The Black Swan: The Impact of the Highly Improbable
## books.2 The Black Swan: The Impact of the Highly Improbable
##                     authors.2 year.2  subject.2 amazon_rating.2
## books.1 Nassim Nicholas Taleb   2010 Leadership               4
## books.2 Nassim Nicholas Taleb   2010 Leadership               4