This assignment goes over the parsing of different data storage formats, specifically HTML, JSON, and XML. I will examine building dataframes from these files and comparing their structure and ease of use.
booksJsonFile <- getURL("https://raw.githubusercontent.com/Kadaeux/DATA607XMLJSON/master/data/books.json")
booksJson <- fromJSON(booksJsonFile)
jsondf <- booksJson$books
head(jsondf)## title authors
## 1 Surely You're Joking, Mr. Feynman! Ralph Leighton, Richard Feynman
## 2 The Hobbit John Ronald Reuel Tolkien
## 3 A People's History of the United States Howard Zinn
## subject length
## 1 Physics 350
## 2 Fantasy 310
## 3 American History 729
booksXmlFile <- getURL("https://raw.githubusercontent.com/Kadaeux/DATA607XMLJSON/master/data/books.xml")
booksXml <- xmlParse(booksXmlFile)
booksXml## <?xml version="1.0"?>
## <books>
## <book>
## <title>Surely You're Joking, Mr. Feynman!</title>
## <authors>
## <author>Ralph Leighton</author>
## <author>Richard Feynman</author>
## </authors>
## <subject>Physics</subject>
## <length>350</length>
## </book>
## <book>
## <title>The Hobbit</title>
## <authors>John Ronald Reuel Tolkien</authors>
## <subject>Fantasy</subject>
## <length>310</length>
## </book>
## <book>
## <title>A People's History of the United States</title>
## <authors>Howard Zinn</authors>
## <subject>American History</subject>
## <length>729</length>
## </book>
## </books>
##
Ok, so unlike JSON I still need to deconstruct the returned xml ‘object’ to build my dataframe. I’ll do this by getting the root node, transforming the book nodes into a matrix, then transposing the matrix and reshaping to a dataframe.
booksMatrix <- t(xmlSApply(xmlRoot(booksXml), function(x) xmlSApply(x, xmlValue)))
xmldf <- as.data.frame(booksMatrix)
head(xmldf)## title authors
## book Surely You're Joking, Mr. Feynman! Ralph LeightonRichard Feynman
## book.1 The Hobbit John Ronald Reuel Tolkien
## book.2 A People's History of the United States Howard Zinn
## subject length
## book Physics 350
## book.1 Fantasy 310
## book.2 American History 729
This way, we still have the issue of authors being smashed together… that’ll need to be cleaned.
booksHtmlFile <- getURL("https://raw.githubusercontent.com/Kadaeux/DATA607XMLJSON/master/data/books.html")
booksHtml <- readHTMLTable(booksHtmlFile,header=TRUE)
htmldf <- booksHtml$`NULL`
head(htmldf)## Title Authors
## 1 Surely You're Joking, Mr. Feynman! Ralph Leighton, Richard Feynman
## 2 The Hobbit John Ronald Reuel Tolkien
## 3 A People's History of the United States Howard Zinn
## Subject Length
## 1 Physics 350
## 2 Fantasy 310
## 3 American History 729
It looks like HTML table parsing uses some part of the table as a name, which I can access the same way I would a df column. Since I did not explicitly name it, it is accessible through ‘null’. This is essentially the same basic way I access JSON.
HTML and JSON parsing result in very similar dataframes, with JSON producing vectors in nested objects; e.g. authors array turning into vector of names. HTML instead produced a comma delimited list. They were also both exceptionally easy to parse.
XML was much more hands on and annoying to parse, even though I would say it was a more human-readable format than HTML. Its dataframe was much… dirtier, with the authors variable being concatenated without any delimination and therefore requires more effort to handle parsing.