Intoduction

For week 7, the task was to record book titles in several different formats and then read them into R. The book-data for this work is available in my github, and this document can be found on rpubs.

Clear the Workspace

rm(list = ls())

Load the Required Packages

library(XML)
library(RCurl)
library(rlist)
library(jsonlite)
library(compare)
library(data.table)

Load the HTML

First we’ll get the HTML file. We’ll read it from github using RCurl’s getURL

html <-getURL("https://raw.githubusercontent.com/plb2018/DATA607/master/books/data_607_books.html")


html.table <- readHTMLTable(html)
html.table <- html.table[[1]]

html.table

##                                                                    title
## 1 Shackleton's Way: Leadership Lessons from the Great Antarctic Explorer
## 2             The Last Gentleman Adventurer: Coming of Age in the Arctic
## 3                                                    The Great Explorers
##                              authors      type pages     ISBN10
## 1 Margot Morrel, Stephanie Capparell Paperback   256 0142002364
## 2           Edward Beauclerk Maurice Hardcover   416 0618517510
## 3              Robin Hanbury Tenison Hardcover   304 050025169X
##           ISBN13 amazonRating reviewCount
## 1 978-0142002360      3.7/5.0          27
## 2 978-0618517510      5.0/5.0           1
## 3 978-0500251690          N/A           0

The data is loaded to the dataframe!

Load the XML

Next we load the XML file. We’ll read the data from github in the same way as the HTML file.

xml.file <-getURL("https://raw.githubusercontent.com/plb2018/DATA607/master/books/data_607_books.xml")


xml <- xmlParse(xml.file)
xml.table <- xmlToDataFrame(xml)
xml.table

##                                                                    title
## 1 Shackleton's Way: Leadership Lessons from the Great Antarctic Explorer
## 2             The Last Gentleman Adventurer: Coming of Age in the Arctic
## 3                                                    The Great Explorers
##                              authors      type pages     isbn10
## 1 Margot Morrel, Stephanie Capparell Paperback   256 0142002364
## 2           Edward Beauclerk Maurice Hardcover   416 0618517510
## 3              Robin Hanbury Tenison Hardcover   304 050025169X
##           isbn13 amazonRating reviewCount
## 1 978-0142002360      3.7/5.0          27
## 2 978-0618517510      5.0/5.0           1
## 3 978-0500251690          N/A           0

Once again, the data is loaded to the dataframe without issue!

Load the JSON

We’ll now load the JSON using jsonlite:

json.file <-getURL("https://raw.githubusercontent.com/plb2018/DATA607/master/books/data_607_books.json")

json.table <- fromJSON(json.file)

json.table <- data.table::rbindlist(json.table)

json.table

##                                                                     title
## 1: Shackleton's Way: Leadership Lessons from the Great Antarctic Explorer
## 2:             The Last Gentleman Adventurer: Coming of Age in the Arctic
## 3:                                                    The Great Explorers
##                               authors      type pages     isbn10
## 1: Margot Morrel, Stephanie Capparell Paperback   256 0142002364
## 2:           Edward Beauclerk Maurice Hardcover   416 0618517510
## 3:              Robin Hanbury Tenison Hardcover   304 050025169X
##            isbn13 amazonRating reviewCount
## 1: 978-0142002360      3.7/5.0          27
## 2: 978-0618517510      5.0/5.0           1
## 3: 978-0500251690          N/A           0

The data is loaded.

Quick comparison

compare(html.table,xml.table,equal=TRUE)

## FALSE [TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE]

compare(html.table,json.table,equal=TRUE)

## FALSE [FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE]

We can see visually that the content is the same for all, and that the html and xml are the same, however, the json appears to be ever-so-slightly different in that the columns are factors instead of chrs. The is easy to change, as needed. Based on this experience, I’d probably say that XML was the easiest. It worked on my first attempt, and is easier to read than HTML. JSON seems like the easiest for a human to work with, and seems less verbose than XML, but i had some issues getting it to work.

DATA 607 - WEEK 7 Assignment

Paul Britton