Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
I chose to use the following columns for my datasets:
library(XML)
library(RJSONIO)
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
##
## xml
library(RCurl)
## Loading required package: bitops
I will use the rvest package to pull in the html file and save it as a data frame.
htmlURL <- "https://raw.githubusercontent.com/amberferger/DATA607_HW7/master/books.html"
readHtml <- read_html(htmlURL)
tables <- html_nodes(readHtml,"table")
tables_ls <- html_table(tables, fill = TRUE)
booksHTML.df <- as.data.frame(tables_ls)
booksHTML.df
## Book_Id
## 1 1
## 2 2
## 3 3
## Title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2 Blink: The Power of Thinking Without Thinking
## 3 The Black Swan: The Impact of the Highly Improbable
## Authors Year Subject Amazon_Rating
## 1 Steven D. Levitt, Stephen J. Dubner 2009 Economics 4.4
## 2 Malcolm Gladwell 2007 Leadership 4.3
## 3 Nassim Nicholas Taleb 2010 Leadership 4.0
Once again, we will use the XML library to load in the XML file and convert it to a data frame. Since the default data type is a string, we will convert the YEAR and AMAZONRATING columns to numeric.
xmlData <- getURL('https://raw.githubusercontent.com/amberferger/DATA607_HW7/master/books.xml', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
booksXML <- xmlParse(file = xmlData[1])
root <- xmlRoot(booksXML)
booksXML.df <- xmlToDataFrame(root)
booksXML.df$year <- as.numeric(as.character(booksXML.df$year))
booksXML.df$amazon_rating <- as.numeric(as.character(booksXML.df$amazon_rating))
booksXML.df
## title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2 Blink: The Power of Thinking Without Thinking
## 3 The Black Swan: The Impact of the Highly Improbable
## authors year subject amazon_rating
## 1 Steven D. LevittStephen J. Dubner 2009 Economics 4.4
## 2 Malcolm Gladwell 2007 Leadership 4.3
## 3 Nassim Nicholas Taleb 2010 Leadership 4.0
We will use the RJSONIO library to pull in our JSON file and convert it to a dataframe.
jsonURL <- "https://raw.githubusercontent.com/amberferger/DATA607_HW7/master/books.json"
booksJSON <- fromJSON(jsonURL)
booksJSON.df <- do.call("rbind", lapply(booksJSON, as.data.frame))
booksJSON.df
## book_id
## books.1 1
## books.2 1
## title
## books.1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## books.2 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## authors year subject amazon_rating book_id.1
## books.1 Steven D. Levitt 2009 Economics 4.4 2
## books.2 Stephen J. Dubner 2009 Economics 4.4 2
## title.1 authors.1
## books.1 Blink: The Power of Thinking Without Thinking Malcolm Gladwell
## books.2 Blink: The Power of Thinking Without Thinking Malcolm Gladwell
## year.1 subject.1 amazon_rating.1 book_id.2
## books.1 2007 Leadership 4.3 3
## books.2 2007 Leadership 4.3 3
## title.2
## books.1 The Black Swan: The Impact of the Highly Improbable
## books.2 The Black Swan: The Impact of the Highly Improbable
## authors.2 year.2 subject.2 amazon_rating.2
## books.1 Nassim Nicholas Taleb 2010 Leadership 4
## books.2 Nassim Nicholas Taleb 2010 Leadership 4