Week 7 Assignment - Working with XML and JSON in R (D607)

Loading the required libraries

#install.packages("XML")
require(XML)

## Loading required package: XML

library(RCurl)
library(XML)
library(jsonlite)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)

loading from HTML

# loading from HTML
html_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.html"
html_url

## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.html"

html_books = readHTMLTable(getURLContent(html_url))[[1]]
kable(html_books)

title	author	ISBN-13	pages	Date_of_Published
The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future	Andrew Yang	978-0316414210	304	April 2, 2019
Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead	Laszlo Bock	978-1455554799	416	April 7, 2015
Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series)	Deborah Nolan, Duncan Temple Lang	978-1482234817	539	April 21, 2015

loading from XML

# make sure xml file is ridded of any special characters in XML like ampersand (&) and apostrophe (')

xml_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.xml"
xml_url

## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.xml"

xml_books = xmlToDataFrame(xmlParse(getURLContent(xml_url)))
kable(xml_books)

title	author	ISBN-13	pages	Date_of_Published
The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future	Andrew Yang	978-0316414210	304	April 2, 2019
Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead	Laszlo Bock	978-1455554799	416	April 7, 2015
Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series)	Deborah Nolan, Duncan Temple Lang	978-1482234817	539	April 21, 2015

loading from JSON

# the characteristics with JSON is even with filename is case sensitive

json_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.JSON"
json_url

## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.JSON"

json_books = fromJSON(json_url)[[1]]
kable(json_books)

title	author	ISBN-13	pages	Date_of_Published
The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future	Andrew Yang	978-0316414210	304	April 2, 2019
Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead	Laszlo Bock	978-1455554799	416	April 7, 2015
Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series)	Deborah Nolan, Duncan Temple Lang	978-1482234817	539	April 21, 2015

Checking to see if HTML and XML has any content differences in the two data.frames

all_equal(html_books, xml_books)

## [1] TRUE

Checking to see if HTML and JSON has any content differences in the two data.frames

all_equal(html_books, json_books)

## [1] "Incompatible type for column `title`: x factor, y character"            
## [2] "Incompatible type for column `author`: x factor, y list"                
## [3] "Incompatible type for column `ISBN-13`: x factor, y character"          
## [4] "Incompatible type for column `pages`: x factor, y integer"              
## [5] "Incompatible type for column `Date_of_Published`: x factor, y character"

It’s obvious that JSON preserves the data types to its original datatypes while HTML / XML would automatically convert all character vectors into factors. Skipping the check between JSON and XML as XML is the same as HTML.

Conclusions

No content differences among all 3 source files.
No Surprise here. When we wanted to preserve the original datatypes, we should always default to JSON as it doesn’t automatically coerce the character vectors into factors. XML and HTML are mostly the same and mutually compatible in the way the data structure. Each one has its advantage. It seems HTML is more standardized up front but it doesn’t pan out well if you have a very lengthy hierarchy where you have to scroll all the way to the beginning for the definitions of the header info. XML displays a more modern type of structuring table of information where each book is an independent block or unit of codes.