Overview

We’re going to load a html, xml and json file from GitHub into R dataframe. These files all contain information on my top 3 favorites books. The files were created in Sublime Text.

Once loaded, we will look to see if any of the dataframes are different.

Librarys Needed

To do this the following libraries will need to be loaded in R:
* XML
* RCurl
* Dplyr
* Jsonlite
* KableExtra - used for displaying dataframes in a nice way

Get the data

url_html <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.html"
url_xml  <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.xml"
url_json <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.json"

url_html <- getURL(url_html)
url_xml  <- getURL(url_xml)

Parse Data into DataFrames

First let’s get our json file. Let’s also confirm that the json file is in the proper format. If it is, the function isValidJSON will return TRUE.

#isValidJSON(url_json)
json <- as.data.frame(jsonlite::fromJSON(url_json))

Second, let’s get our html file using htmlParse and readHTMLTable.

html <- htmlParse(file = url_html)
html <- readHTMLTable(html)

Finally, we’ll retrieve our xml file using xmlParse and xmlToDataFrame.

xml <- xmlParse(file = url_xml)
xml <- xmlToDataFrame(xml)

Look at Data

Below we can see that each file type handles columns with multiple values differently.
* JSON appears to put the items in a list “c()”.
* The HTML data is coming from a table, so the formatting shows the duplicated items in a second row in the dataframe.
* XML columns with multiple items have been separated by a column because the original file was formatted this way.

The HTML has attached the node, book., to each column name while XML & JSON do not.

If we were going to do further anaylsis using these files, additional file cleaning would be needed.

JSON

book.title	book.author	book.published	book.pages	book.GoodReadsScore	book.genre
The Happiness Project	Gretchen Rubin	2009	368	3.6	c(“memoir”, “self help”)
All Creatures Great and Small	James Herriot	1972	448	4.3	Auto Biography
I’ll Be Gone in the Dark	c(“Michelle McNamara”, “Patton Oswald”)	2018	352	4.1	c(“Auto Biography”, “True Crime”)

HTML

Title	Author	Year Published	Pages	Good Reads Score	Genre
The Happiness Project	Gretchen Rubin	2009	368	3.6	memoir
self-help	NA	NA	NA	NA	NA
All Creatures Great and Small	James Herriott	1972	448	4.3	Auto Biography
Ill Be Gone in the Dark	Michelle McNamara	2018	352	4.1	Auto Biography
Patton Oswald	True Crime	NA	NA	NA	NA

XML

title	author	published	pages	GoodReadsScore	genre
The Happiness Project	Gretchen Rubin	2009	368	3.6	memoir,self-help
All Creatures Great and Small	James Herriot	1972	448	4.3	Auto Biography
Ill Be Gone in the Dark	Michelle McNamara,Patton Oswald	2018	352	4.1	Auto Biography,True Crime

Data607-Assignment7

Devin Teran

3/15/2020

Overview

Librarys Needed

Get the data

Parse Data into DataFrames

Look at Data