Assignment 7: Working with XML and JSON

Overview

This is a warm up exercise to help you to get more familiar with the HTML, XML, and JSON file formats, and using packages to read these data formats for downstream use in R data frames. In the next two class weeks, we’ll be loading these file formats from the web, using web scraping and web APIs.

Load required libraries

Step 1 is to install and load required libraries to read XML, JSON and HTML files

knitr::opts_chunk$set(eval = TRUE, results = FALSE)

library(tidyverse)
library(RCurl)
library(XML)
library(rvest)
library(rjson)
library(knitr)

Load HTML dataset

books_loc = read_html(x = "https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.html")

books_html = html_table(html_nodes(books_loc,"table")[[1]])

knitr::kable(books_html)

Title	Author	Published_date	Languages_available	Available_on_Amazon	Price
The Heartfulness Way	Kamlesh D Patel and Joshua Pollock	1/16/2018	English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi	Yes	$11.39
The Outliers	Malcolm Gladwell	11/18/2008	English	Yes	$17.95
The Challenger Sale	Brent Adamson and Matthew Dixon	11/10/2011	English	Yes	$21.00

Load XML dataset

books_url <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.xml")
books_xml <- xmlParse(books_url)
books_xml <- xmlToDataFrame(books_xml)
knitr::kable(books_xml)

Title	Author	Published_date	Languages_available	Available_on_Amazon	Price
The Heartfulness Way	Kamlesh D Patel and Joshua Pollock	1/16/2018	English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi	Yes	$11.39
The Outliers	Malcolm Gladwell	11/18/2008	English	Yes	$17.95
The Challenger Sale	Brent Adamson and Matthew Dixon	11/10/2011	English	Yes	$21.00

Load json dataset

book_jurl <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.json")
book_jloc <- fromJSON(book_jurl)

books_json <- do.call("rbind", lapply(book_jloc , data.frame, stringsAsFactors=FALSE))
rownames(books_json) <- NULL
knitr::kable(books_json)

Title	Author	Published_date	Languages_available	Available_on_Amazon	Price
The Heartfulness Way	Kamlesh D Patel and Joshua Pollock	1/16/2018	English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi	Yes	$11.39
The Outliers	Malcolm Gladwell	11/18/2008	English	Yes	$17.95
The Challenger Sale	Brent Adamson and Matthew Dixon	11/10/2011	English	Yes	$21.00

Comparison

We observe that, from the overall output perspective, there is NO difference between the books files read in HTML, XML and JSON formats. That is, the dataframes are same.

#1 Comparing HTML and XML
all.equal(books_html,books_xml)

[1] TRUE

#2 Comparing HTML and JSON
all.equal(books_html,books_json)

[1] TRUE

#3 Comparing XML and JSON
all.equal(books_xml,books_json)

[1] TRUE

The dataframe class/structure are equal as well.

#1 HTML file
class(books_html$Price)

[1] “character”

#2 XML file
class(books_xml$Price)

[1] “character”

#3 JSON file
class(books_json$Price)

[1] “character”

Based on above, we can conclude that the difference is only in the way the files are parsed and loaded with different set of libraries. But the outputs are similar.

Assignment 7: Working with XML and JSON

Bharani Nittala

2020-10-10

Overview

Load required libraries

Load HTML dataset

Load XML dataset

Load json dataset

Comparison