Let’s start by looking at the XML file
library(XML)
library(RCurl)
## Loading required package: bitops
library(bitops)
#The file is located on my github page
docurl <- "https://raw.githubusercontent.com/mkollontai/DATA607/master/HW7/books.XML"
x <- getURL(docurl)
#Parse the XML data and isolate the root.
xmlRoot <- xmlRoot(xmlParse(x))
#Iterate SApply over each child within the root to pull the data into a matrix
xmlData <- xmlSApply(xmlRoot,function(x) xmlSApply(x, xmlValue))
#Convert this to a dataframe and display it.
xmlDataF <- data.frame(xmlData)
xmlDataF
## BOOK BOOK.1
## TITLE A Memory of Light Words of Radiance
## AUTHOR Robert Jordan, Brandon Sanderson Brandon Sanderson
## YEAR 2013 2014
## PUBLISHER Tor Books Tor Books
## SERIES A Wheel of Time The Stormlight Archive
## BOOK.2 BOOK.3
## TITLE Name of the Wind Lies of Locke Lamora
## AUTHOR Patrick Rothfuss Scott Lynch
## YEAR 2007 2006
## PUBLISHER DAW Books Bantam Spectra
## SERIES The Kingkiller Chronicle Gentleman Bastard
Now let’s look at the HTML file
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
##
## xml
#pull the data and convert it to a dataframe
htmlDataF <- as.data.frame(read_html("https://raw.githubusercontent.com/mkollontai/DATA607/master/HW7/books.html") %>% html_table(fill = TRUE))
#Due to the transposed data, rename the rows into the category names and columns to separate book entries.
names(htmlDataF) <- c("Book1", "Book2", "Book3", "Book4")
row.names(htmlDataF) <- c("Title", "Author", "Year", "Publisher", "Series")
htmlDataF
## Book1 Book2
## Title A memory of Light Words of Radiance
## Author Robert Jordan & Brandon Sanderson Brandon Sanderson
## Year 2013 2014
## Publisher Tor Books Tor Books
## Series A Wheel of Time The Stromlight Archive
## Book3 Book4 NA
## Title Name of the Wind Lies of Locke Lamora NA
## Author Patrick Rothfuss Scott Lynch NA
## Year 2007 2006 NA
## Publisher DAW Books Bantam Spectra NA
## Series The Kingkiller Chronicle Gentleman Bastard NA
Finally, let’s take a look at the JSON file
library(jsonlite)
jsonURL <- "https://raw.githubusercontent.com/mkollontai/DATA607/master/HW7/books.json"
j <- getURL(jsonURL)
jsonData <- as.data.frame(fromJSON(j))
jsonData
## Title Author Year
## 1 A Memory of Light Robert Jordan & Brandon Sanderson 2013
## 2 Words of Radiance Brandon Sanderson 2014
## 3 Name of the Wind Patrick Rothfuss 2007
## 4 Lies of Locke Lamora Scott Lynch 2006
## Publisher Series
## 1 Tor Books A Wheel of Time
## 2 Tor Books The Stormlight Archive
## 3 DAW Books The Kingkiller Chronicle
## 4 Bantam Spectra Gentleman Bastard
The three data frames contain the same information, though it took some cleaning and scrubbing to fix names of columns/rows. They are transposed with respect to one another in some cases, but this is not something that couldn’t be cleared up if necessary.The HTML dataframe contains an extra empty column with NAs.