Introduction

In this assignment I used Notepad to write the data for 3 books in 3 different formats: HTML, XML, and JSON. Next, I load each of these files into R and place in separate data frames.

XML Data

Using the XML and RCurl packages (as xmlParse wouldn’t accept my file as XML without RCurl), I access the .xml from my GitHub, parse it, and put it into a data frame.

library(XML)
library(RCurl)
library(kableExtra)
xml_URL <- "https://raw.githubusercontent.com/rachel-greenlee/data607_assign7/master/book_xml.xml"
xml_link <- getURL(xml_URL)
xml_data <- xmlParse(file = xml_link)
xml_df <- xmlToDataFrame(nodes = getNodeSet(xml_data, "//Book"))
kable(xml_df, format = "markdown")
Title Author Genre Story_Location Goodreads_Rating Year_Published
The Lowland Jhumpa Lahiri Fiction Calcutta, India 3.85 2013
An American Marriage Tayari Jones Fiction Georgia, USA 3.96 2018
Leviathan Wakes Daniel Abraham, Ty Franck Science Fiction The Milky Way 4.25 2011

HTML Data

Conveniently also from the XML package, there is a readHTMLTable function that allows us to read straight into a dataframe.

html_URL <- "https://raw.githubusercontent.com/rachel-greenlee/data607_assign7/master/books_html.html"
html_link <- getURL(html_URL)

html_df <- readHTMLTable(html_link)
kable(html_df, format = "markdown")
Title Author Genre Story_Location Goodreads_Rating Year_Published
The Lowland Jhumpa Lahiri Fiction Calcutta, India 3.85 2013
An American Marriage Tayari Jones Fiction Georgia, USA 3.96 2018
Leviathan Wakes Daniel Abraham, Ty Franck Science Fiction The Milky Way 4.25 2011

JSON Data

The jsonlite package made quick work of reading my JSON file into a dataframe, and I didn’t need to use RCurl this time.

library("jsonlite")
json_df <-fromJSON("https://raw.githubusercontent.com/rachel-greenlee/data607_assign7/master/book_json.json")
kable(json_df, format = "markdown")
Title Author Genre Story_Location Goodreads_Rating Year_Published
The Lowland Juhumpa Lahiri Fiction Clacutta, India 3.85 2031
An American Marriage Tayari Jones Fiction Georgia, USA 3.96 2018
Leviathan Wakes Daniel Abraham, Ty Franck Science Fiction The Milky Way 4.25 2011

Data Frame Comparisons

By sight these data frames all seem identical, but when we use the all.equal function we see a list of some of the differences between the three data frames.

all.equal(html_df, xml_df, json_df)
##  [1] "Names: 1 string mismatch"                                                   
##  [2] "Attributes: < names for current but not for target >"                       
##  [3] "Attributes: < Length mismatch: comparison on first 0 components >"          
##  [4] "Length mismatch: comparison on first 1 components"                          
##  [5] "Component 1: Modes: list, character"                                        
##  [6] "Component 1: names for target but not for current"                          
##  [7] "Component 1: Attributes: < Modes: list, NULL >"                             
##  [8] "Component 1: Attributes: < names for target but not for current >"          
##  [9] "Component 1: Attributes: < current is not list-like >"                      
## [10] "Component 1: Length mismatch: comparison on first 3 components"             
## [11] "Component 1: Component 1: Lengths (3, 1) differ (string compare on first 1)"
## [12] "Component 1: Component 2: Lengths (3, 1) differ (string compare on first 1)"
## [13] "Component 1: Component 2: 1 string mismatch"                                
## [14] "Component 1: Component 3: Lengths (3, 1) differ (string compare on first 1)"
## [15] "Component 1: Component 3: 1 string mismatch"

Conclusion

Loading these three formats into data frames was not as difficult as I had expected it would be, though I see from some online resources that as the nodes increase, the code needs to increase as well. In my case, even before using kable, each data frame looks identical to the next which is pretty amazing! However, when we dig deeper using the all.equal function there are differences between them that could make further analysis easier/harder depending on the task.