Overview

We’re going to load a html, xml and json file from GitHub into R dataframe. These files all contain information on my top 3 favorites books. The files were created in Sublime Text.

Once loaded, we will look to see if any of the dataframes are different.

Librarys Needed

To do this the following libraries will need to be loaded in R:
* XML
* RCurl
* Dplyr
* Jsonlite
* KableExtra - used for displaying dataframes in a nice way

Get the data

url_html <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.html"
url_xml  <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.xml"
url_json <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.json"

url_html <- getURL(url_html)
url_xml  <- getURL(url_xml)

Parse Data into DataFrames

First let’s get our json file. Let’s also confirm that the json file is in the proper format. If it is, the function isValidJSON will return TRUE.

#isValidJSON(url_json)
json <- as.data.frame(jsonlite::fromJSON(url_json))

Second, let’s get our html file using htmlParse and readHTMLTable.

html <- htmlParse(file = url_html)
html <- readHTMLTable(html)

Finally, we’ll retrieve our xml file using xmlParse and xmlToDataFrame.

xml <- xmlParse(file = url_xml)
xml <- xmlToDataFrame(xml)

Look at Data

Below we can see that each file type handles columns with multiple values differently.
* JSON appears to put the items in a list “c()”.
* The HTML data is coming from a table, so the formatting shows the duplicated items in a second row in the dataframe.
* XML columns with multiple items have been separated by a column because the original file was formatted this way.

The HTML has attached the node, book., to each column name while XML & JSON do not.

If we were going to do further anaylsis using these files, additional file cleaning would be needed.

JSON
book.title book.author book.published book.pages book.GoodReadsScore book.genre
The Happiness Project Gretchen Rubin 2009 368 3.6 c(“memoir”, “self help”)
All Creatures Great and Small James Herriot 1972 448 4.3 Auto Biography
I’ll Be Gone in the Dark c(“Michelle McNamara”, “Patton Oswald”) 2018 352 4.1 c(“Auto Biography”, “True Crime”)
HTML
Title Author Year Published Pages Good Reads Score Genre
The Happiness Project Gretchen Rubin 2009 368 3.6 memoir
self-help NA NA NA NA NA
All Creatures Great and Small James Herriott 1972 448 4.3 Auto Biography
Ill Be Gone in the Dark Michelle McNamara 2018 352 4.1 Auto Biography
Patton Oswald True Crime NA NA NA NA
XML
title author published pages GoodReadsScore genre
The Happiness Project Gretchen Rubin 2009 368 3.6 memoir,self-help
All Creatures Great and Small James Herriot 1972 448 4.3 Auto Biography
Ill Be Gone in the Dark Michelle McNamara,Patton Oswald 2018 352 4.1 Auto Biography,True Crime