Overview

This is a warm up exercise to help you to get more familiar with the HTML, XML, and JSON file formats, and using packages to read these data formats for downstream use in R data frames. In the next two class weeks, we’ll be loading these file formats from the web, using web scraping and web APIs.

Load required libraries

Step 1 is to install and load required libraries to read XML, JSON and HTML files

knitr::opts_chunk$set(eval = TRUE, results = FALSE)

library(tidyverse)
library(RCurl)
library(XML)
library(rvest)
library(rjson)
library(knitr)

Load HTML dataset

books_loc = read_html(x = "https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.html")

books_html = html_table(html_nodes(books_loc,"table")[[1]])

knitr::kable(books_html)
Title Author Published_date Languages_available Available_on_Amazon Price
The Heartfulness Way Kamlesh D Patel and Joshua Pollock 1/16/2018 English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi Yes $11.39
The Outliers Malcolm Gladwell 11/18/2008 English Yes $17.95
The Challenger Sale Brent Adamson and Matthew Dixon 11/10/2011 English Yes $21.00

Load XML dataset

books_url <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.xml")
books_xml <- xmlParse(books_url)
books_xml <- xmlToDataFrame(books_xml)
knitr::kable(books_xml)
Title Author Published_date Languages_available Available_on_Amazon Price
The Heartfulness Way Kamlesh D Patel and Joshua Pollock 1/16/2018 English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi Yes $11.39
The Outliers Malcolm Gladwell 11/18/2008 English Yes $17.95
The Challenger Sale Brent Adamson and Matthew Dixon 11/10/2011 English Yes $21.00

Load json dataset

book_jurl <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.json")
book_jloc <- fromJSON(book_jurl)

books_json <- do.call("rbind", lapply(book_jloc , data.frame, stringsAsFactors=FALSE))
rownames(books_json) <- NULL
knitr::kable(books_json)
Title Author Published_date Languages_available Available_on_Amazon Price
The Heartfulness Way Kamlesh D Patel and Joshua Pollock 1/16/2018 English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi Yes $11.39
The Outliers Malcolm Gladwell 11/18/2008 English Yes $17.95
The Challenger Sale Brent Adamson and Matthew Dixon 11/10/2011 English Yes $21.00

Comparison

We observe that, from the overall output perspective, there is NO difference between the books files read in HTML, XML and JSON formats. That is, the dataframes are same.

#1 Comparing HTML and XML
all.equal(books_html,books_xml)

[1] TRUE

#2 Comparing HTML and JSON
all.equal(books_html,books_json)

[1] TRUE

#3 Comparing XML and JSON
all.equal(books_xml,books_json)

[1] TRUE

The dataframe class/structure are equal as well.

#1 HTML file
class(books_html$Price)

[1] “character”

#2 XML file
class(books_xml$Price)

[1] “character”

#3 JSON file
class(books_json$Price)

[1] “character”

Based on above, we can conclude that the difference is only in the way the files are parsed and loaded with different set of libraries. But the outputs are similar.