This is a warm up exercise to help you to get more familiar with the HTML, XML, and JSON file formats, and using packages to read these data formats for downstream use in R data frames. In the next two class weeks, we’ll be loading these file formats from the web, using web scraping and web APIs.
Step 1 is to install and load required libraries to read XML, JSON and HTML files
knitr::opts_chunk$set(eval = TRUE, results = FALSE)
library(tidyverse)
library(RCurl)
library(XML)
library(rvest)
library(rjson)
library(knitr)
books_loc = read_html(x = "https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.html")
books_html = html_table(html_nodes(books_loc,"table")[[1]])
knitr::kable(books_html)
| Title | Author | Published_date | Languages_available | Available_on_Amazon | Price |
|---|---|---|---|---|---|
| The Heartfulness Way | Kamlesh D Patel and Joshua Pollock | 1/16/2018 | English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi | Yes | $11.39 |
| The Outliers | Malcolm Gladwell | 11/18/2008 | English | Yes | $17.95 |
| The Challenger Sale | Brent Adamson and Matthew Dixon | 11/10/2011 | English | Yes | $21.00 |
books_url <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.xml")
books_xml <- xmlParse(books_url)
books_xml <- xmlToDataFrame(books_xml)
knitr::kable(books_xml)
| Title | Author | Published_date | Languages_available | Available_on_Amazon | Price |
|---|---|---|---|---|---|
| The Heartfulness Way | Kamlesh D Patel and Joshua Pollock | 1/16/2018 | English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi | Yes | $11.39 |
| The Outliers | Malcolm Gladwell | 11/18/2008 | English | Yes | $17.95 |
| The Challenger Sale | Brent Adamson and Matthew Dixon | 11/10/2011 | English | Yes | $21.00 |
book_jurl <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/books.json")
book_jloc <- fromJSON(book_jurl)
books_json <- do.call("rbind", lapply(book_jloc , data.frame, stringsAsFactors=FALSE))
rownames(books_json) <- NULL
knitr::kable(books_json)
| Title | Author | Published_date | Languages_available | Available_on_Amazon | Price |
|---|---|---|---|---|---|
| The Heartfulness Way | Kamlesh D Patel and Joshua Pollock | 1/16/2018 | English, Hindi, Tamil, Malayalam, Gujarati, Kannada, Marathi | Yes | $11.39 |
| The Outliers | Malcolm Gladwell | 11/18/2008 | English | Yes | $17.95 |
| The Challenger Sale | Brent Adamson and Matthew Dixon | 11/10/2011 | English | Yes | $21.00 |
We observe that, from the overall output perspective, there is NO difference between the books files read in HTML, XML and JSON formats. That is, the dataframes are same.
#1 Comparing HTML and XML
all.equal(books_html,books_xml)
[1] TRUE
#2 Comparing HTML and JSON
all.equal(books_html,books_json)
[1] TRUE
#3 Comparing XML and JSON
all.equal(books_xml,books_json)
[1] TRUE
The dataframe class/structure are equal as well.
#1 HTML file
class(books_html$Price)
[1] “character”
#2 XML file
class(books_xml$Price)
[1] “character”
#3 JSON file
class(books_json$Price)
[1] “character”
Based on above, we can conclude that the difference is only in the way the files are parsed and loaded with different set of libraries. But the outputs are similar.