Overview

This assignment goes over the parsing of different data storage formats, specifically HTML, JSON, and XML. I will examine building dataframes from these files and comparing their structure and ease of use.

JSON

booksJsonFile <- getURL("https://raw.githubusercontent.com/Kadaeux/DATA607XMLJSON/master/data/books.json")
booksJson <- fromJSON(booksJsonFile)

jsondf <- booksJson$books
head(jsondf)
##                                     title                         authors
## 1      Surely You're Joking, Mr. Feynman! Ralph Leighton, Richard Feynman
## 2                              The Hobbit       John Ronald Reuel Tolkien
## 3 A People's History of the United States                     Howard Zinn
##            subject length
## 1          Physics    350
## 2          Fantasy    310
## 3 American History    729

XML

booksXmlFile <- getURL("https://raw.githubusercontent.com/Kadaeux/DATA607XMLJSON/master/data/books.xml")
booksXml <- xmlParse(booksXmlFile)

booksXml
## <?xml version="1.0"?>
## <books>
##   <book>
##     <title>Surely You're Joking, Mr. Feynman!</title>
##     <authors>
##       <author>Ralph Leighton</author>
##       <author>Richard Feynman</author>
##     </authors>
##     <subject>Physics</subject>
##     <length>350</length>
##   </book>
##   <book>
##     <title>The Hobbit</title>
##     <authors>John Ronald Reuel Tolkien</authors>
##     <subject>Fantasy</subject>
##     <length>310</length>
##   </book>
##   <book>
##     <title>A People's History of the United States</title>
##     <authors>Howard Zinn</authors>
##     <subject>American History</subject>
##     <length>729</length>
##   </book>
## </books>
## 

Ok, so unlike JSON I still need to deconstruct the returned xml ‘object’ to build my dataframe. I’ll do this by getting the root node, transforming the book nodes into a matrix, then transposing the matrix and reshaping to a dataframe.

booksMatrix <- t(xmlSApply(xmlRoot(booksXml), function(x) xmlSApply(x, xmlValue)))

xmldf <- as.data.frame(booksMatrix)
head(xmldf)
##                                          title                       authors
## book        Surely You're Joking, Mr. Feynman! Ralph LeightonRichard Feynman
## book.1                              The Hobbit     John Ronald Reuel Tolkien
## book.2 A People's History of the United States                   Howard Zinn
##                 subject length
## book            Physics    350
## book.1          Fantasy    310
## book.2 American History    729

This way, we still have the issue of authors being smashed together… that’ll need to be cleaned.

HTML

booksHtmlFile <- getURL("https://raw.githubusercontent.com/Kadaeux/DATA607XMLJSON/master/data/books.html")
booksHtml <- readHTMLTable(booksHtmlFile,header=TRUE)

htmldf <- booksHtml$`NULL`
head(htmldf)
##                                     Title                         Authors
## 1      Surely You're Joking, Mr. Feynman! Ralph Leighton, Richard Feynman
## 2                              The Hobbit       John Ronald Reuel Tolkien
## 3 A People's History of the United States                     Howard Zinn
##            Subject Length
## 1          Physics    350
## 2          Fantasy    310
## 3 American History    729

It looks like HTML table parsing uses some part of the table as a name, which I can access the same way I would a df column. Since I did not explicitly name it, it is accessible through ‘null’. This is essentially the same basic way I access JSON.

Conclusion

HTML and JSON parsing result in very similar dataframes, with JSON producing vectors in nested objects; e.g. authors array turning into vector of names. HTML instead produced a comma delimited list. They were also both exceptionally easy to parse.

XML was much more hands on and annoying to parse, even though I would say it was a more human-readable format than HTML. Its dataframe was much… dirtier, with the authors variable being concatenated without any delimination and therefore requires more effort to handle parsing.