For this assignment, I have chosen some books written by ex-NHL players because I love all things relating to hockey and I hope to read a few of these during the upcoming break. For the the additional variables I have chosen are pages (number of pages in the book) and ranking (the amazon best sellers rank)
options(width = 100)
# This is a standard setup I include so that my working
# directory is set correctly whether I work on one of my
# windows or linux machines.
if (Sys.info()["sysname"] == "Windows") {
setwd("~/Masters/DATA607/Week7/Assignment")
} else {
setwd("~/Documents/Masters/DATA607/Week7/Assignment")
}
suppressWarnings(suppressMessages(library(XML)))
suppressWarnings(suppressMessages(library(RCurl)))
suppressWarnings(suppressMessages(library(jsonlite)))
suppressWarnings(suppressMessages(library(plyr)))
html_url = getURL("https://raw.githubusercontent.com/john-grando/Masters/master/DATA607/Week7/Assignment/books.html?token=AXxIeh0P-FAhUUYjuzjXIh5kAO6KV_rfks5Z6VyKwA%3D%3D")
html_file <- as.data.frame(readHTMLTable(html_url))
names(html_file) <- c("Title", "Pages", "Ranking", "Author",
"Author2")
I know I could have arranged the json file to be consistent with the html table, but this is a neat trick.
json_url <- getURL("https://raw.githubusercontent.com/john-grando/Masters/master/DATA607/Week7/Assignment/books.json?token=AXxIek28j8YdBaH74UamYBwo1FB7WpRtks5Z6VzAwA%3D%3D")
json_file <- as.data.frame(fromJSON(json_url, simplifyDataFrame = TRUE))[,
c(2, 4, 5, 3)]
json_file$Author1 <- sapply(json_file$books.book.author, function(x) {
x <- x[1]
})
json_file$Author2 <- sapply(json_file$books.book.author, function(x) {
x <- x[2]
})
json_file <- json_file[, c(1, 2, 3, 5, 6)]
names(json_file) <- c("Title", "Pages", "Ranking", "Author",
"Author2")
html_file == json_file
## Title Pages Ranking Author Author2
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE NA
## [3,] TRUE TRUE TRUE TRUE NA
xml_url <- getURL("https://raw.githubusercontent.com/john-grando/Masters/master/DATA607/Week7/Assignment/books.xml?token=AXxIesZfmIIIHxQ3k7RW3PUbflox0UvJks5Z6V0lwA%3D%3D")
xml_file <- ldply(xmlToList(xml_url), function(x) {
data.frame(x)
})[, c(2, 5, 6, 3, 4)]
names(xml_file) <- c("Title", "Pages", "Ranking", "Author", "Author2")
xml_file == html_file
## Title Pages Ranking Author Author2
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE NA
## [3,] TRUE TRUE TRUE TRUE NA
xml_file == json_file
## Title Pages Ranking Author Author2
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE NA
## [3,] TRUE TRUE TRUE TRUE NA
Here is the table
html_file
## Title Pages Ranking
## 1 Ice Capades: A Memoir of Fast Living and Tough Hockey 323 132147
## 2 A Guy Like Me: Fighting to Make the Cut 224 35138
## 3 Take Your Eye Off the Puck: How to Watch Hockey By Knowing Where to Look 256 37643
## Author Author2
## 1 Sean Avery Michael McKinley
## 2 John Scott <NA>
## 3 Greg Wyshynski <NA>
Note, this table could be tidyed by making two separate tables (one for books, one for authors), but for this assigment my goal is to make one consistent dataframe across all file types so I will not edit it further.