#install.packages("XML")
require(XML)
## Loading required package: XML
library(RCurl)
library(XML)
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
# loading from HTML
html_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.html"
html_url
## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.html"
html_books = readHTMLTable(getURLContent(html_url))[[1]]
kable(html_books)
| title | author | ISBN-13 | pages | Date_of_Published |
|---|---|---|---|---|
| The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future | Andrew Yang | 978-0316414210 | 304 | April 2, 2019 |
| Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead | Laszlo Bock | 978-1455554799 | 416 | April 7, 2015 |
| Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series) | Deborah Nolan, Duncan Temple Lang | 978-1482234817 | 539 | April 21, 2015 |
# make sure xml file is ridded of any special characters in XML like ampersand (&) and apostrophe (')
xml_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.xml"
xml_url
## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.xml"
xml_books = xmlToDataFrame(xmlParse(getURLContent(xml_url)))
kable(xml_books)
| title | author | ISBN-13 | pages | Date_of_Published |
|---|---|---|---|---|
| The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future | Andrew Yang | 978-0316414210 | 304 | April 2, 2019 |
| Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead | Laszlo Bock | 978-1455554799 | 416 | April 7, 2015 |
| Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series) | Deborah Nolan, Duncan Temple Lang | 978-1482234817 | 539 | April 21, 2015 |
# the characteristics with JSON is even with filename is case sensitive
json_url = "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.JSON"
json_url
## [1] "https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/assignment_7.JSON"
json_books = fromJSON(json_url)[[1]]
kable(json_books)
| title | author | ISBN-13 | pages | Date_of_Published |
|---|---|---|---|---|
| The War on Normal People: The Truth About America’s Disappearing Jobs and Why Universal Basic Income Is Our Future | Andrew Yang | 978-0316414210 | 304 | April 2, 2019 |
| Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead | Laszlo Bock | 978-1455554799 | 416 | April 7, 2015 |
| Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving (Chapman & Hall/CRC The R Series) | Deborah Nolan, Duncan Temple Lang | 978-1482234817 | 539 | April 21, 2015 |
all_equal(html_books, xml_books)
## [1] TRUE
all_equal(html_books, json_books)
## [1] "Incompatible type for column `title`: x factor, y character"
## [2] "Incompatible type for column `author`: x factor, y list"
## [3] "Incompatible type for column `ISBN-13`: x factor, y character"
## [4] "Incompatible type for column `pages`: x factor, y integer"
## [5] "Incompatible type for column `Date_of_Published`: x factor, y character"
It’s obvious that JSON preserves the data types to its original datatypes while HTML / XML would automatically convert all character vectors into factors. Skipping the check between JSON and XML as XML is the same as HTML.
No content differences among all 3 source files.
No Surprise here. When we wanted to preserve the original datatypes, we should always default to JSON as it doesn’t automatically coerce the character vectors into factors. XML and HTML are mostly the same and mutually compatible in the way the data structure. Each one has its advantage. It seems HTML is more standardized up front but it doesn’t pan out well if you have a very lengthy hierarchy where you have to scroll all the way to the beginning for the definitions of the header info. XML displays a more modern type of structuring table of information where each book is an independent block or unit of codes.