Week 7 Homework: HTML, JSON & XML

The Original CSV File of Books

This was a hard assignment conceptually because I struggled with how to present the multi author book in a data frame. Should I place all authors in a single field and separate them later? Should I repeat entire rows to represent each author (this felt like an authors data frame). I decided to start with a CSV and copy each field exactly, to be sure I was getting the same data in the same fields but include ALL the authors in a single feild of a frame for all three output tables.

When it came to working with JSON, addressing the multiple seemed simple and native, but in both XML and HTML I struggled with presentation. Then in returning to a data frame, JSON became the confounder, forcing me to make the same decision I made in creating the CSV…I went for consistency and tried to recreate the same table I made in HTML.

This is the original, with authors in empty subsequent rows.

All tables are all presented in knitr::kable() tables. The only difference raw is that JSON lables the authors list=(names) on unpacking, printed to console you see this, in a JS dataframe it disappeared.

books<-read.csv("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.csv", stringsAsFactors = FALSE)
knitr::kable(books)

title	isbn	author	publication.date	pages
Fly Fishing the Mountain Lakes	978-1585747740	Gary Lafontaine	05/01/03	192
Tactics for Trout	978-0811713399	Rick hafale	11/01/14	240
		Dave Hughes		NA
		Skip Morris		NA
Caddisflies	978-0941130981	Gary Lafontaine	04/28/89	336

HTML

url_html <- read_html("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.html")
tabs <- url_html %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)
html_frame <- tabs[[1]]
knitr::kable(html_frame)

title	isbn	author	publication date	pages
Fly Fishing the Mountain Lakes	978-1585747740	Gary Lafontaine	05/01/03	192
Tactics for Trout	978-0811713399	Rick Hafale, Dave Hughes, Skip Morris	11/01/14	240
Caddisflies	978-0941130981	Gary Lafontaine	04/28/89	336

JSON

url_json<-"https://raw.githubusercontent.com/bpoulin81/Data607/ac48db2df3ec30fb48cd2fa18effcc7b428b4f4a/books.json"
json_frame<-fromJSON(url_json)
knitr::kable(json_frame)

title	isbn	authors	publication date	pages
Fly Fishing the Mountain Lakes	978-1585747740	Gary Lafontaine	05/01/2003	192
Tactics for Trout	978-0811713399	Rick Hafale, Dave Hughes, Skip Morris	11/01/2014	240
Caddisflies	978-0941130981	Gary Lafontaine	04/28/1989	336

XML

# XML from Github
url_xml <-   getURL("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.xml", ssl.verifyPeer=FALSE)
doc<-xmlParse(url_xml)
xml_frame <- xmlToDataFrame(nodes = getNodeSet(doc, "//row"), stringsAsFactors = FALSE)
knitr::kable(xml_frame)

title	isbn	author	publication_date	pages
Fly Fishing the Mountain Lakes	978-1585747740	Gary Lafontaine	05/01/2003	192
Tactics for Trout	978-0811713399	Rick Hafale, Dave Hughes, Skip Morris	11/01/2014	240
Caddisflies	978-0941130981	Gary Lafontaine	04/28/1989	336

Summary

For the most part the three tables look the same. The one major exception is that publication date in the CSV, HTML and JSON tables appear as separate words and in XML is a single underscored string. This is not a presentation or processing issue, the only way that I could get the XML parsers and the validator I used to work with that field name was to create a single un-broken string.

In general that is a relatively simple issue to deal with in final formatting of a data table. The most difficult decision was figuring out how to handle the multiple authors. I chose to merge them into a field because of what you asked for and because I had no specific end use I could better tailor the datas structure to. I made it rectangular.

I can see where each method would have its advantages if you were trying to structure data frames based on tags. But in general, XML and JSON seem like they would be more contrained and consistent formats for sharing data than a CSV provided it was not super huge data that would make moving the tags a burden on the system (csv is compact). On the other hand CSV and HTML are easier to write quickly and more forgiving.

They can all be queried the same way df$field[position] returning the same results. I think it is a matter of situation, what do you find on your scraping journey and what is the easiest way to get the data formed and ready for show?