For week 7, the task was to record book titles in several different formats and then read them into R. The book-data for this work is available in my github, and this document can be found on rpubs.
rm(list = ls())library(XML)
library(RCurl)
library(rlist)
library(jsonlite)
library(compare)
library(data.table)First we’ll get the HTML file. We’ll read it from github using RCurl’s getURL
html <-getURL("https://raw.githubusercontent.com/plb2018/DATA607/master/books/data_607_books.html")
html.table <- readHTMLTable(html)
html.table <- html.table[[1]]
html.table## title
## 1 Shackleton's Way: Leadership Lessons from the Great Antarctic Explorer
## 2 The Last Gentleman Adventurer: Coming of Age in the Arctic
## 3 The Great Explorers
## authors type pages ISBN10
## 1 Margot Morrel, Stephanie Capparell Paperback 256 0142002364
## 2 Edward Beauclerk Maurice Hardcover 416 0618517510
## 3 Robin Hanbury Tenison Hardcover 304 050025169X
## ISBN13 amazonRating reviewCount
## 1 978-0142002360 3.7/5.0 27
## 2 978-0618517510 5.0/5.0 1
## 3 978-0500251690 N/A 0
The data is loaded to the dataframe!
Next we load the XML file. We’ll read the data from github in the same way as the HTML file.
xml.file <-getURL("https://raw.githubusercontent.com/plb2018/DATA607/master/books/data_607_books.xml")
xml <- xmlParse(xml.file)
xml.table <- xmlToDataFrame(xml)
xml.table## title
## 1 Shackleton's Way: Leadership Lessons from the Great Antarctic Explorer
## 2 The Last Gentleman Adventurer: Coming of Age in the Arctic
## 3 The Great Explorers
## authors type pages isbn10
## 1 Margot Morrel, Stephanie Capparell Paperback 256 0142002364
## 2 Edward Beauclerk Maurice Hardcover 416 0618517510
## 3 Robin Hanbury Tenison Hardcover 304 050025169X
## isbn13 amazonRating reviewCount
## 1 978-0142002360 3.7/5.0 27
## 2 978-0618517510 5.0/5.0 1
## 3 978-0500251690 N/A 0
Once again, the data is loaded to the dataframe without issue!
We’ll now load the JSON using jsonlite:
json.file <-getURL("https://raw.githubusercontent.com/plb2018/DATA607/master/books/data_607_books.json")
json.table <- fromJSON(json.file)
json.table <- data.table::rbindlist(json.table)
json.table## title
## 1: Shackleton's Way: Leadership Lessons from the Great Antarctic Explorer
## 2: The Last Gentleman Adventurer: Coming of Age in the Arctic
## 3: The Great Explorers
## authors type pages isbn10
## 1: Margot Morrel, Stephanie Capparell Paperback 256 0142002364
## 2: Edward Beauclerk Maurice Hardcover 416 0618517510
## 3: Robin Hanbury Tenison Hardcover 304 050025169X
## isbn13 amazonRating reviewCount
## 1: 978-0142002360 3.7/5.0 27
## 2: 978-0618517510 5.0/5.0 1
## 3: 978-0500251690 N/A 0
The data is loaded.
compare(html.table,xml.table,equal=TRUE)## FALSE [TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE]
compare(html.table,json.table,equal=TRUE)## FALSE [FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE]
We can see visually that the content is the same for all, and that the html and xml are the same, however, the json appears to be ever-so-slightly different in that the columns are factors instead of chrs. The is easy to change, as needed. Based on this experience, I’d probably say that XML was the easiest. It worked on my first attempt, and is easier to read than HTML. JSON seems like the easiest for a human to work with, and seems less verbose than XML, but i had some issues getting it to work.