The purpose of this week’s assignment was to familiarize ourselves with the different file structures (HTML v XML v JSON) and whether reading from these different file types leads to different output dataframes (or not).
The basic approach I followed was:
For this assignment I chose three of my favorite titles on the topic of diet / nutrition. Fortunately, one of them had two authors :)
I’ve experimented with more than a half dozen different diets / forms of eating and chose the three titles that most positively impacted my own eating habits:
With these three titles in mind, I added the “Publisher” and “Year” (of publishing) attributes to “Title” and “Author”, and sketched out / filled in (with pen and paper) a table where each row was associated with a title and each column was associated with an attribute of this title.
Being that I’m a newbie, I searched for some quick&easy tutorials (to compliment the text) for creating files in HTML | XML | JSON, applied each of their approaches using Notepad, saved the files in the proper form, and then uploaded these files to Github.
Once the files were on Github, I read from them (in raw form).
#Load html file into dataframe
html_file <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Assignment-7/books.html")
df_html <- readHTMLTable(html_file, as.data.frame = TRUE)
#Load xml file into dataframe
xml_file <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Assignment-7/books.xml")
df_xml <- xmlToDataFrame(xml_file)
#Load json file into dataframe
json_file <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Assignment-7/books.json")
json_data <- fromJSON(json_file)
df_json <- as.data.frame(json_data)
#Display the data frames corresponding to each data source
df_html
## $`NULL`
## Title Author Publisher Year
## 1 The 4 Hour Body Tim Ferriss Harmony 2010
## 2 The Primal Blueprint Mark Sisson Primal Nutrition 2012
## 3 It Starts With Food Dallas Hartwig, Melissa Hartwig Victory Belt 2014
## Title Author Publisher Year
## 1 The 4 Hour Body Tim Ferriss Harmony 2010
## 2 The Primal Blueprint Mark Sisson Primal Nutrition 2012
## 3 It Starts With Food Dallas Hartwig, Melissa Hartwig Victory Belt 2014
## Title Author Publisher Year
## 1 The 4 Hour Body Tim Ferriss Harmony 2010
## 2 The Primal Blueprint Mark Sisson Primal Nutrition 2012
## 3 It Starts With Food Dallas Hartwig, Melissa Hartwig Victory Belt 2014
Yes, the three dataframes were identical.
It took a little surfing through corresponding R documentation to find the proper function / approach, but it seems to have worked out. From html to xml to json, the resulting dataframes are the same.
The organization of the input file mattered here. Since we were trying to compare whether input file types resulted in differing dataframes my aim was to hold the format of each input file the same. If there would have been variation here (ie. column headers, use of quotations, or general table layout) it would have led to differing outputs but that’s a whole different conversation :)
#Programmatically confirm identical dataframes
#html with xml
compare.list(as.data.frame(df_html), as.data.frame(df_xml))
## [1] TRUE TRUE TRUE TRUE
## [1] TRUE TRUE TRUE TRUE
## [1] TRUE TRUE TRUE TRUE
Programmatically confirmed :) I used the compare.list() function to verify whether the html data frame was identical to the xml data frame, html to json, and json to xml. As can be seen by the all TRUE values above, they are identical.
It seems that regardless of the form of the input file (html v xml v json), we can lean on R’s internal functions to output equivalent data frames. Of course, more experience would be needed to truly confirm or conclude this, it’s a promising start though.