Loading the same data into XML, HTML, and JSON file formats, converting to data frames, and comparing the results…

============================================================

Libraries required

library(XML)
library(rjson)
library(plyr)
library(dplyr)

============================================================

XML File: CLICK FOR XML

#using the XML library. The file could not be sourced to github.
url<- "http://delineator.org/storage/travelogues.xml"
xml <- xmlParse(file=url)
xml2DF<-xmlToDataFrame(xml)
xml2DF
##             author      coauthor     genre      location  subgenre
## 1 Meriwether Lewis William Clark Adventure North America    Travel
## 2     Paul Theroux                  Travel       Oceania Adventure
## 3   Alfred Lansing                 History    Antarctica Adventure
##                                        title year
## 1 The Definitive Journals of Lewis and Clark 2002
## 2                 The Happy Isles of Oceania 1992
## 3  Endurance- Shackleton's Incredible Voyage 2015

============================================================

JSON File: CLICK FOR JSON

A helpful conversion function was modified from the following source.

#using rjson library
json <- fromJSON(file = "https://raw.githubusercontent.com/RobertSellers/R/master/data/IS607_Homework8/travelogues.json")

#function modified from stackoverflow source, uses plyr library.
json <- lapply(json, function(j) {
  as.data.frame(replace(j, sapply(j, is.list), NA))
})

json2DF <- rbind.fill(json)

============================================================

HTML File: CLICK FOR HTML

#Using the XML library, and sourcing a not github file
url<- "http://delineator.org/storage/travelogues.html"
html <- readHTMLTable(url)
html2DF<-as.data.frame.list(html)

============================================================

Data Comparison

glimpse(html2DF)
## Observations: 3
## Variables: 7
## $ NULL.Title    (fctr) The Definitive Journals of Lewis and Clark, The...
## $ NULL.Author   (fctr) Meriwether Lewis, Paul Theroux, Alfred Lansing
## $ NULL.Coauthor (fctr) William Clark, , 
## $ NULL.Year     (fctr) 2002, 1992, 2015
## $ NULL.Genre    (fctr) Adventure, Travel, History
## $ NULL.Subgenre (fctr) Travel, Adventure, Adventure
## $ NULL.Location (fctr) North America, Oceania, Antarctica
glimpse(json2DF)
## Observations: 3
## Variables: 7
## $ title    (fctr) The Definitive Journals of Lewis and Clark, The Happ...
## $ author   (fctr) Meriwether Lewis, Paul Theroux, Alfred Lansing
## $ coauthor (fctr) William Clark, , 
## $ year     (fctr) 2002, 1992, 2015
## $ genre    (fctr) Adventure, Travel, History
## $ subgenre (fctr) Travel, Adventure, Adventure
## $ location (fctr) North America, Oceania, Antarctica
glimpse(xml2DF)
## Observations: 3
## Variables: 7
## $ author   (fctr) Meriwether Lewis, Paul Theroux, Alfred Lansing
## $ coauthor (fctr) William Clark, , 
## $ genre    (fctr) Adventure, Travel, History
## $ location (fctr) North America, Oceania, Antarctica
## $ subgenre (fctr) Travel, Adventure, Adventure
## $ title    (fctr) The Definitive Journals of Lewis and Clark, The Happ...
## $ year     (fctr) 2002, 1992, 2015

============================================================

Discussion

We notice differences in the column arrangements, and with “NULL” row information. Not only are there idiosyncratic loading mechanisms, but the data transformation required, not the least defining data types, also differs between file types. Further refinement would be necessary to create perfectly identical data.

============================================================