Loading the required libraries

#install.packages("XML")
require(XML)
## Loading required package: XML
library(RCurl)
library(XML)
library(jsonlite)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)

loading from HTML

# loading from HTML
html_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.html"
html_url
## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.html"
html_books = readHTMLTable(getURLContent(html_url))[[1]]
kable(html_books)
title author ISBN-13 pages Date_of_Published
The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future Andrew Yang 978-0316414210 304 April 2, 2019
Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead Laszlo Bock 978-1455554799 416 April 7, 2015
Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series) Deborah Nolan, Duncan Temple Lang 978-1482234817 539 April 21, 2015

loading from XML

# make sure xml file is ridded of any special characters in XML like ampersand (&) and apostrophe (')

xml_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.xml"
xml_url
## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.xml"
xml_books = xmlToDataFrame(xmlParse(getURLContent(xml_url)))
kable(xml_books)
title author ISBN-13 pages Date_of_Published
The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future Andrew Yang 978-0316414210 304 April 2, 2019
Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead Laszlo Bock 978-1455554799 416 April 7, 2015
Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series) Deborah Nolan, Duncan Temple Lang 978-1482234817 539 April 21, 2015

loading from JSON

# the characteristics with JSON is even with filename is case sensitive

json_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.JSON"
json_url
## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.JSON"
json_books = fromJSON(json_url)[[1]]
kable(json_books)
title author ISBN-13 pages Date_of_Published
The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future Andrew Yang 978-0316414210 304 April 2, 2019
Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead Laszlo Bock 978-1455554799 416 April 7, 2015
Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series) Deborah Nolan, Duncan Temple Lang 978-1482234817 539 April 21, 2015

Checking to see if HTML and XML has any content differences in the two data.frames

all_equal(html_books, xml_books)
## [1] TRUE

Checking to see if HTML and JSON has any content differences in the two data.frames

all_equal(html_books, json_books)
## [1] "Incompatible type for column `title`: x factor, y character"            
## [2] "Incompatible type for column `author`: x factor, y list"                
## [3] "Incompatible type for column `ISBN-13`: x factor, y character"          
## [4] "Incompatible type for column `pages`: x factor, y integer"              
## [5] "Incompatible type for column `Date_of_Published`: x factor, y character"

It’s obvious that JSON preserves the data types to its original datatypes while HTML / XML would automatically convert all character vectors into factors. Skipping the check between JSON and XML as XML is the same as HTML.

Conclusions

  1. No content differences among all 3 source files.

  2. No Surprise here. When we wanted to preserve the original datatypes, we should always default to JSON as it doesn’t automatically coerce the character vectors into factors. XML and HTML are mostly the same and mutually compatible in the way the data structure. Each one has its advantage. It seems HTML is more standardized up front but it doesn’t pan out well if you have a very lengthy hierarchy where you have to scroll all the way to the beginning for the definitions of the header info. XML displays a more modern type of structuring table of information where each book is an independent block or unit of codes.