We’re going to load a html, xml and json file from GitHub into R dataframe. These files all contain information on my top 3 favorites books. The files were created in Sublime Text.
Once loaded, we will look to see if any of the dataframes are different.
To do this the following libraries will need to be loaded in R:
* XML
* RCurl
* Dplyr
* Jsonlite
* KableExtra - used for displaying dataframes in a nice way
url_html <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.html"
url_xml <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.xml"
url_json <- "https://raw.githubusercontent.com/devinteran/Data607-Assignment7/master/FavoriteBooks.json"
url_html <- getURL(url_html)
url_xml <- getURL(url_xml)
First let’s get our json file. Let’s also confirm that the json file is in the proper format. If it is, the function isValidJSON will return TRUE.
#isValidJSON(url_json)
json <- as.data.frame(jsonlite::fromJSON(url_json))
Second, let’s get our html file using htmlParse and readHTMLTable.
html <- htmlParse(file = url_html)
html <- readHTMLTable(html)
Finally, we’ll retrieve our xml file using xmlParse and xmlToDataFrame.
xml <- xmlParse(file = url_xml)
xml <- xmlToDataFrame(xml)
Below we can see that each file type handles columns with multiple values differently.
* JSON appears to put the items in a list “c()”.
* The HTML data is coming from a table, so the formatting shows the duplicated items in a second row in the dataframe.
* XML columns with multiple items have been separated by a column because the original file was formatted this way.
The HTML has attached the node, book., to each column name while XML & JSON do not.
If we were going to do further anaylsis using these files, additional file cleaning would be needed.
JSON| book.title | book.author | book.published | book.pages | book.GoodReadsScore | book.genre |
|---|---|---|---|---|---|
| The Happiness Project | Gretchen Rubin | 2009 | 368 | 3.6 | c(“memoir”, “self help”) |
| All Creatures Great and Small | James Herriot | 1972 | 448 | 4.3 | Auto Biography |
| I’ll Be Gone in the Dark | c(“Michelle McNamara”, “Patton Oswald”) | 2018 | 352 | 4.1 | c(“Auto Biography”, “True Crime”) |
|
| title | author | published | pages | GoodReadsScore | genre |
|---|---|---|---|---|---|
| The Happiness Project | Gretchen Rubin | 2009 | 368 | 3.6 | memoir,self-help |
| All Creatures Great and Small | James Herriot | 1972 | 448 | 4.3 | Auto Biography |
| Ill Be Gone in the Dark | Michelle McNamara,Patton Oswald | 2018 | 352 | 4.1 | Auto Biography,True Crime |