DATA607

Let’s start by looking at the XML file

library(XML)
library(RCurl)

## Loading required package: bitops

library(bitops)

#The file is located on my github page

docurl <- "https://raw.githubusercontent.com/mkollontai/DATA607/master/HW7/books.XML"

x <- getURL(docurl)

#Parse the XML data and isolate the root.
xmlRoot <- xmlRoot(xmlParse(x))

#Iterate SApply over each child within the root to pull the data into a matrix
xmlData <- xmlSApply(xmlRoot,function(x) xmlSApply(x, xmlValue))

#Convert this to a dataframe and display it.
xmlDataF <- data.frame(xmlData)
xmlDataF

##                                       BOOK                 BOOK.1
## TITLE                    A Memory of Light      Words of Radiance
## AUTHOR    Robert Jordan, Brandon Sanderson      Brandon Sanderson
## YEAR                                  2013                   2014
## PUBLISHER                        Tor Books              Tor Books
## SERIES                     A Wheel of Time The Stormlight Archive
##                             BOOK.2               BOOK.3
## TITLE             Name of the Wind Lies of Locke Lamora
## AUTHOR            Patrick Rothfuss          Scott Lynch
## YEAR                          2007                 2006
## PUBLISHER                DAW Books       Bantam Spectra
## SERIES    The Kingkiller Chronicle    Gentleman Bastard

Now let’s look at the HTML file

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:XML':
## 
##     xml

#pull the data and convert it to a dataframe
htmlDataF <- as.data.frame(read_html("https://raw.githubusercontent.com/mkollontai/DATA607/master/HW7/books.html") %>% html_table(fill = TRUE))

#Due to the transposed data, rename the rows into the category names and columns to separate book entries. 
names(htmlDataF) <- c("Book1", "Book2", "Book3", "Book4")
row.names(htmlDataF) <- c("Title", "Author", "Year", "Publisher", "Series")

htmlDataF

##                                       Book1                  Book2
## Title                     A memory of Light      Words of Radiance
## Author    Robert Jordan & Brandon Sanderson      Brandon Sanderson
## Year                                   2013                   2014
## Publisher                         Tor Books              Tor Books
## Series                      A Wheel of Time The Stromlight Archive
##                              Book3                Book4 NA
## Title             Name of the Wind Lies of Locke Lamora NA
## Author            Patrick Rothfuss          Scott Lynch NA
## Year                          2007                 2006 NA
## Publisher                DAW Books       Bantam Spectra NA
## Series    The Kingkiller Chronicle    Gentleman Bastard NA

Finally, let’s take a look at the JSON file

library(jsonlite)

jsonURL <- "https://raw.githubusercontent.com/mkollontai/DATA607/master/HW7/books.json"
j <- getURL(jsonURL)

jsonData <- as.data.frame(fromJSON(j))
jsonData

##                  Title                            Author Year
## 1    A Memory of Light Robert Jordan & Brandon Sanderson 2013
## 2    Words of Radiance                 Brandon Sanderson 2014
## 3     Name of the Wind                  Patrick Rothfuss 2007
## 4 Lies of Locke Lamora                       Scott Lynch 2006
##        Publisher                   Series
## 1      Tor Books          A Wheel of Time
## 2      Tor Books   The Stormlight Archive
## 3      DAW Books The Kingkiller Chronicle
## 4 Bantam Spectra        Gentleman Bastard

The three data frames contain the same information, though it took some cleaning and scrubbing to fix names of columns/rows. They are transposed with respect to one another in some cases, but this is not something that couldn’t be cleared up if necessary.The HTML dataframe contains an extra empty column with NAs.

DATA607_HW7

Misha Kollontai

10/11/2019

Let’s start by looking at the XML file

Now let’s look at the HTML file

Finally, let’s take a look at the JSON file