Grando Week 7

For this assignment, I have chosen some books written by ex-NHL players because I love all things relating to hockey and I hope to read a few of these during the upcoming break. For the the additional variables I have chosen are pages (number of pages in the book) and ranking (the amazon best sellers rank)

options(width = 100)
# This is a standard setup I include so that my working
# directory is set correctly whether I work on one of my
# windows or linux machines.
if (Sys.info()["sysname"] == "Windows") {
    setwd("~/Masters/DATA607/Week7/Assignment")
} else {
    setwd("~/Documents/Masters/DATA607/Week7/Assignment")
}
suppressWarnings(suppressMessages(library(XML)))
suppressWarnings(suppressMessages(library(RCurl)))
suppressWarnings(suppressMessages(library(jsonlite)))
suppressWarnings(suppressMessages(library(plyr)))

Load the HTML Table

html_url = getURL("https://raw.githubusercontent.com/john-grando/Masters/master/DATA607/Week7/Assignment/books.html?token=AXxIeh0P-FAhUUYjuzjXIh5kAO6KV_rfks5Z6VyKwA%3D%3D")
html_file <- as.data.frame(readHTMLTable(html_url))
names(html_file) <- c("Title", "Pages", "Ranking", "Author", 
    "Author2")

Load the JSON File

I know I could have arranged the json file to be consistent with the html table, but this is a neat trick.

json_url <- getURL("https://raw.githubusercontent.com/john-grando/Masters/master/DATA607/Week7/Assignment/books.json?token=AXxIek28j8YdBaH74UamYBwo1FB7WpRtks5Z6VzAwA%3D%3D")
json_file <- as.data.frame(fromJSON(json_url, simplifyDataFrame = TRUE))[, 
    c(2, 4, 5, 3)]
json_file$Author1 <- sapply(json_file$books.book.author, function(x) {
    x <- x[1]
})
json_file$Author2 <- sapply(json_file$books.book.author, function(x) {
    x <- x[2]
})
json_file <- json_file[, c(1, 2, 3, 5, 6)]
names(json_file) <- c("Title", "Pages", "Ranking", "Author", 
    "Author2")
html_file == json_file
##      Title Pages Ranking Author Author2
## [1,]  TRUE  TRUE    TRUE   TRUE    TRUE
## [2,]  TRUE  TRUE    TRUE   TRUE      NA
## [3,]  TRUE  TRUE    TRUE   TRUE      NA

Load the XML File

xml_url <- getURL("https://raw.githubusercontent.com/john-grando/Masters/master/DATA607/Week7/Assignment/books.xml?token=AXxIesZfmIIIHxQ3k7RW3PUbflox0UvJks5Z6V0lwA%3D%3D")
xml_file <- ldply(xmlToList(xml_url), function(x) {
    data.frame(x)
})[, c(2, 5, 6, 3, 4)]
names(xml_file) <- c("Title", "Pages", "Ranking", "Author", "Author2")
xml_file == html_file
##      Title Pages Ranking Author Author2
## [1,]  TRUE  TRUE    TRUE   TRUE    TRUE
## [2,]  TRUE  TRUE    TRUE   TRUE      NA
## [3,]  TRUE  TRUE    TRUE   TRUE      NA
xml_file == json_file
##      Title Pages Ranking Author Author2
## [1,]  TRUE  TRUE    TRUE   TRUE    TRUE
## [2,]  TRUE  TRUE    TRUE   TRUE      NA
## [3,]  TRUE  TRUE    TRUE   TRUE      NA

Here is the table

html_file
##                                                                      Title Pages Ranking
## 1                    Ice Capades: A Memoir of Fast Living and Tough Hockey   323  132147
## 2                                  A Guy Like Me: Fighting to Make the Cut   224   35138
## 3 Take Your Eye Off the Puck: How to Watch Hockey By Knowing Where to Look   256   37643
##           Author          Author2
## 1     Sean Avery Michael McKinley
## 2     John Scott             <NA>
## 3 Greg Wyshynski             <NA>

Note, this table could be tidyed by making two separate tables (one for books, one for authors), but for this assigment my goal is to make one consistent dataframe across all file types so I will not edit it further.