Data607-Week07-Working with XML and JSON in R

Working with XML and JSON in R

Books Selected

I have picked the below three books (at random) from the Barnes & Nobles website:

Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future ; by Ashlee Vance
A Brief History of Time: From the Big Bang to Black Holes ; by Stephen Hawking
1066 Turned Upside Down; by Joanna Courtney / Hellen Hollick / Richard Dee / Alison Morton

Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future

File Generation

Have manually created the below three files “by hand” to capture the mentioned key information/attributes:

books.html
books.json
books.xml

Name or Title of the book,
Author(s) of the book,
ISBN and BN Identifiers of the book,
the publishing house,
date of publishing, and
the number of pages in the boook.

R Data Frames

We now load the information from each of the three files into separate R data frames and compare the structures:

HTML File Handling:

theHtmlUrl <- "https://raw.githubusercontent.com/kamathvk1982/Data607-Week07/master/books.html"
HtmlUrldata <- getURL(theHtmlUrl)
html.data <- readHTMLTable(HtmlUrldata, header = TRUE
                           , stringsAsFactors = FALSE, )
html.df <-  as.data.frame(html.data)

datatable(html.df)

JSON File Handling:

theJsonUrl <- "https://raw.githubusercontent.com/kamathvk1982/Data607-Week07/master/books.json"
JsonUrldata <- getURL(theJsonUrl)
json.data <- fromJSON(JsonUrldata)
json.df <- data.frame(json.data$`books`)

datatable(json.df)

XML File Handling:

theXmlUrl <- "https://raw.githubusercontent.com/kamathvk1982/Data607-Week07/master/books.xml"
XmlUrlData <- getURL(theXmlUrl)
xml.data <- xmlParse(XmlUrlData)

xml.root <- xmlRoot(xml.data)

xml.df <- data.frame(t(xmlSApply(xml.root, function(x) xmlSApply(x, xmlValue))), row.names = NULL)
datatable(xml.df)

Comments

Based on above we can see the data structures looks very similar. For the HTML file we can see the column names came with addtional NULL. value in it. We may need to diuring actual processing see how the data types handling would hold.