Assignment
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
Solution
I have manually created 3 different files (books.html, books.xml, books.json) with the same input and located them into my GitHub repository.
The information contained in those three books is the same and was taken from real data provided by Barnes and Noble.
Library Definitions
library(knitr)
library(XML)
library(RCurl)
library(jsonlite)
jsonlite
As you might notice I am selecting the Package ‘jsonlite’
Description:
A fast JSON parser and generator optimized for statistical data and the web. Started out as a fork of ‘RJSONIO’, but has been completely rewritten in recent versions. The package offers flexible, robust, high performance tools for working with JSON in R and is particularly powerful for building pipelines and interacting with a web API. The implementation is based on the mapping described in the vignette (Ooms, 2014). In addition to converting JSON data from/to R objects, ‘jsonlite’ contains functions to stream, validate, and prettify JSON data. The unit tests included with the package verify that all edge cases are encoded and decoded consistently for use with dynamic data in systems and applications.
See official package reference manual documentation here:
https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf
HTML
.html url
url <- "https://raw.githubusercontent.com/dvillalobos/MSDA/master/607/Homework/Homework7/books.html"
html_file <- getURL(url)
.html table
html_table <- readHTMLTable(html_file, header=TRUE, which=1)
ID | Title | Author | ISBN-13 | Publisher | Publication date | Pages | Related Subject |
---|---|---|---|---|---|---|---|
01 | The Official Ubuntu Server Book | Kyle Rankin, Benjamin Mako Hill | 9780133017564 | Pearson Education | 7/12/2013 | 600 | Linux |
02 | The Official Joomla! Book | Jennifer Marriott, Elin Waring | 9780132978958 | Pearson Education | 12/29/2012 | 512 | Joomla Content Management |
03 | App Inventor 2 with MySQL database: remote management of data | Mr Antonio Taccetti | 9781537680156 | CreateSpace Publishing | 9/15/2016 | 74 | Android |
.html structure
## 'data.frame': 3 obs. of 8 variables:
## $ ID : Factor w/ 3 levels "01","02","03": 1 2 3
## $ Title : Factor w/ 3 levels "App Inventor 2 with MySQL database: remote management of data",..: 3 2 1
## $ Author : Factor w/ 3 levels "Jennifer Marriott, Elin Waring",..: 2 1 3
## $ ISBN-13 : Factor w/ 3 levels "9780132978958",..: 2 1 3
## $ Publisher : Factor w/ 2 levels "CreateSpace Publishing",..: 2 2 1
## $ Publication date: Factor w/ 3 levels "12/29/2012","7/12/2013",..: 2 1 3
## $ Pages : Factor w/ 3 levels "512","600","74": 2 1 3
## $ Related Subject : Factor w/ 3 levels "Android","Joomla Content Management",..: 3 2 1
XML
.xml url
url <- "https://raw.githubusercontent.com/dvillalobos/MSDA/master/607/Homework/Homework7/books.xml"
xml_file <- getURL(url)
.xml table
xml_table <- xmlToDataFrame(xml_file)
ID | Title | Author | ISBN-13 | Publisher | Publication_date | Pages | Related_Subject |
---|---|---|---|---|---|---|---|
01 | The Official Ubuntu Server Book | Kyle Rankin, Benjamin Mako Hill | 9780133017564 | Pearson Education | 7/12/2013 | 600 | Linux |
02 | The Official Joomla! Book | Jennifer Marriott, Elin Waring | 9780132978958 | Pearson Education | 12/29/2012 | 512 | Joomla Content Management |
03 | App Inventor 2 with MySQL database: remote management of data | Mr Antonio Taccetti | 9781537680156 | CreateSpace Publishing | 9/15/2016 | 74 | Android |
.xml structure
## 'data.frame': 3 obs. of 8 variables:
## $ ID : Factor w/ 3 levels "01","02","03": 1 2 3
## $ Title : Factor w/ 3 levels "App Inventor 2 with MySQL database: remote management of data",..: 3 2 1
## $ Author : Factor w/ 3 levels "Jennifer Marriott, Elin Waring",..: 2 1 3
## $ ISBN-13 : Factor w/ 3 levels "9780132978958",..: 2 1 3
## $ Publisher : Factor w/ 2 levels "CreateSpace Publishing",..: 2 2 1
## $ Publication_date: Factor w/ 3 levels "12/29/2012","7/12/2013",..: 2 1 3
## $ Pages : Factor w/ 3 levels "512","600","74": 2 1 3
## $ Related_Subject : Factor w/ 3 levels "Android","Joomla Content Management",..: 3 2 1
JSON
.json url
url <- "https://raw.githubusercontent.com/dvillalobos/MSDA/master/607/Homework/Homework7/books.json"
json_file <- fromJSON(url)
.json table
|
.json structure
## List of 1
## $ book-table:List of 1
## ..$ book:'data.frame': 3 obs. of 8 variables:
## .. ..$ ID : chr [1:3] "01" "02" "03"
## .. ..$ Title : chr [1:3] "The Official Ubuntu Server Book" "The Official Joomla! Book" "App Inventor 2 with MySQL database: remote management of data"
## .. ..$ Author : chr [1:3] "Kyle Rankin, Benjamin Mako Hill" "Jennifer Marriott, Elin Waring" "Mr Antonio Taccetti"
## .. ..$ ISBN-13 : chr [1:3] "9780133017564" "9780132978958" "9781537680156"
## .. ..$ Publisher : chr [1:3] "Pearson Education" "Pearson Education" "CreateSpace Publishing"
## .. ..$ Publication date: chr [1:3] "7/12/2013" "12/29/2012" "9/15/2016"
## .. ..$ Pages : chr [1:3] "600" "512" "74"
## .. ..$ Related Subject : chr [1:3] "Linux" "Joomla Content Management" "Android"
Are the three data frames identical?
Data frame comparisons
html vs xml
html_xml <- html_table == xml_table
ID | Title | Author | ISBN-13 | Publisher | Publication date | Pages | Related Subject |
---|---|---|---|---|---|---|---|
TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
html vs json
html_json <- html_table == json_table
ID | Title | Author | ISBN-13 | Publisher | Publication date | Pages | Related Subject |
---|---|---|---|---|---|---|---|
FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
xml vs json
xml_json <- xml_table == json_table
ID | Title | Author | ISBN-13 | Publisher | Publication_date | Pages | Related_Subject |
---|---|---|---|---|---|---|---|
FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
Conclusion
If we compared the three data frame tables, we can conclude as follows:
Display
By performing a visual inspection of the original information displayed in the tables, we noticed that all of them display the same and correct information.
Structure
By analyzing the structure of the three different data frames we find out that the html table and the xml table have the same structure (as Factors) but the json table has it as character.
Final conclusion
Based on display we can be confident enough to know that the data has been taken accordingly but by performing a deeper analysis we find that R is managing the information in a completely different way making these data frames not equal among them. Eventually this “problem” could be worked out and all will be the same.