Assignment

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Solution

I have manually created 3 different files (books.html, books.xml, books.json) with the same input and located them into my GitHub repository.

The information contained in those three books is the same and was taken from real data provided by Barnes and Noble.

Library Definitions

library(knitr)
library(XML)
library(RCurl)
library(jsonlite)

jsonlite

As you might notice I am selecting the Package ‘jsonlite’

Description:

A fast JSON parser and generator optimized for statistical data and the web. Started out as a fork of ‘RJSONIO’, but has been completely rewritten in recent versions. The package offers flexible, robust, high performance tools for working with JSON in R and is particularly powerful for building pipelines and interacting with a web API. The implementation is based on the mapping described in the vignette (Ooms, 2014). In addition to converting JSON data from/to R objects, ‘jsonlite’ contains functions to stream, validate, and prettify JSON data. The unit tests included with the package verify that all edge cases are encoded and decoded consistently for use with dynamic data in systems and applications.

See official package reference manual documentation here:

https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf

HTML

.html url

url <- "https://raw.githubusercontent.com/dvillalobos/MSDA/master/607/Homework/Homework7/books.html"
html_file <- getURL(url)

.html table

html_table <- readHTMLTable(html_file, header=TRUE, which=1)
ID Title Author ISBN-13 Publisher Publication date Pages Related Subject
01 The Official Ubuntu Server Book Kyle Rankin, Benjamin Mako Hill 9780133017564 Pearson Education 7/12/2013 600 Linux
02 The Official Joomla! Book Jennifer Marriott, Elin Waring 9780132978958 Pearson Education 12/29/2012 512 Joomla Content Management
03 App Inventor 2 with MySQL database: remote management of data Mr Antonio Taccetti 9781537680156 CreateSpace Publishing 9/15/2016 74 Android

.html structure

## 'data.frame':    3 obs. of  8 variables:
##  $ ID              : Factor w/ 3 levels "01","02","03": 1 2 3
##  $ Title           : Factor w/ 3 levels "App Inventor 2 with MySQL database: remote management of data",..: 3 2 1
##  $ Author          : Factor w/ 3 levels "Jennifer Marriott, Elin Waring",..: 2 1 3
##  $ ISBN-13         : Factor w/ 3 levels "9780132978958",..: 2 1 3
##  $ Publisher       : Factor w/ 2 levels "CreateSpace Publishing",..: 2 2 1
##  $ Publication date: Factor w/ 3 levels "12/29/2012","7/12/2013",..: 2 1 3
##  $ Pages           : Factor w/ 3 levels "512","600","74": 2 1 3
##  $ Related Subject : Factor w/ 3 levels "Android","Joomla Content Management",..: 3 2 1

XML

.xml url

url <- "https://raw.githubusercontent.com/dvillalobos/MSDA/master/607/Homework/Homework7/books.xml"
xml_file <- getURL(url)

.xml table

xml_table <- xmlToDataFrame(xml_file)
ID Title Author ISBN-13 Publisher Publication_date Pages Related_Subject
01 The Official Ubuntu Server Book Kyle Rankin, Benjamin Mako Hill 9780133017564 Pearson Education 7/12/2013 600 Linux
02 The Official Joomla! Book Jennifer Marriott, Elin Waring 9780132978958 Pearson Education 12/29/2012 512 Joomla Content Management
03 App Inventor 2 with MySQL database: remote management of data Mr Antonio Taccetti 9781537680156 CreateSpace Publishing 9/15/2016 74 Android

.xml structure

## 'data.frame':    3 obs. of  8 variables:
##  $ ID              : Factor w/ 3 levels "01","02","03": 1 2 3
##  $ Title           : Factor w/ 3 levels "App Inventor 2 with MySQL database: remote management of data",..: 3 2 1
##  $ Author          : Factor w/ 3 levels "Jennifer Marriott, Elin Waring",..: 2 1 3
##  $ ISBN-13         : Factor w/ 3 levels "9780132978958",..: 2 1 3
##  $ Publisher       : Factor w/ 2 levels "CreateSpace Publishing",..: 2 2 1
##  $ Publication_date: Factor w/ 3 levels "12/29/2012","7/12/2013",..: 2 1 3
##  $ Pages           : Factor w/ 3 levels "512","600","74": 2 1 3
##  $ Related_Subject : Factor w/ 3 levels "Android","Joomla Content Management",..: 3 2 1

JSON

.json url

url <- "https://raw.githubusercontent.com/dvillalobos/MSDA/master/607/Homework/Homework7/books.json"
json_file <- fromJSON(url)

.json table

ID Title Author ISBN-13 Publisher Publication date Pages Related Subject
01 The Official Ubuntu Server Book Kyle Rankin, Benjamin Mako Hill 9780133017564 Pearson Education 7/12/2013 600 Linux
02 The Official Joomla! Book Jennifer Marriott, Elin Waring 9780132978958 Pearson Education 12/29/2012 512 Joomla Content Management
03 App Inventor 2 with MySQL database: remote management of data Mr Antonio Taccetti 9781537680156 CreateSpace Publishing 9/15/2016 74 Android

.json structure

## List of 1
##  $ book-table:List of 1
##   ..$ book:'data.frame': 3 obs. of  8 variables:
##   .. ..$ ID              : chr [1:3] "01" "02" "03"
##   .. ..$ Title           : chr [1:3] "The Official Ubuntu Server Book" "The Official Joomla! Book" "App Inventor 2 with MySQL database: remote management of data"
##   .. ..$ Author          : chr [1:3] "Kyle Rankin, Benjamin Mako Hill" "Jennifer Marriott, Elin Waring" "Mr Antonio Taccetti"
##   .. ..$ ISBN-13         : chr [1:3] "9780133017564" "9780132978958" "9781537680156"
##   .. ..$ Publisher       : chr [1:3] "Pearson Education" "Pearson Education" "CreateSpace Publishing"
##   .. ..$ Publication date: chr [1:3] "7/12/2013" "12/29/2012" "9/15/2016"
##   .. ..$ Pages           : chr [1:3] "600" "512" "74"
##   .. ..$ Related Subject : chr [1:3] "Linux" "Joomla Content Management" "Android"

Are the three data frames identical?

Data frame comparisons

html vs xml

html_xml <- html_table == xml_table
ID Title Author ISBN-13 Publisher Publication date Pages Related Subject
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

html vs json

html_json <- html_table == json_table
ID Title Author ISBN-13 Publisher Publication date Pages Related Subject
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

xml vs json

xml_json <- xml_table == json_table
ID Title Author ISBN-13 Publisher Publication_date Pages Related_Subject
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Conclusion

If we compared the three data frame tables, we can conclude as follows:

Display

By performing a visual inspection of the original information displayed in the tables, we noticed that all of them display the same and correct information.

Structure

By analyzing the structure of the three different data frames we find out that the html table and the xml table have the same structure (as Factors) but the json table has it as character.

Final conclusion

Based on display we can be confident enough to know that the data has been taken accordingly but by performing a deeper analysis we find that R is managing the information in a completely different way making these data frames not equal among them. Eventually this “problem” could be worked out and all will be the same.