The Original CSV File of Books

This was a hard assignment conceptually because I struggled with how to present the multi author book in a data frame. Should I place all authors in a single field and separate them later? Should I repeat entire rows to represent each author (this felt like an authors data frame). I decided to start with a CSV and copy each field exactly, to be sure I was getting the same data in the same fields but include ALL the authors in a single feild of a frame for all three output tables.

When it came to working with JSON, addressing the multiple seemed simple and native, but in both XML and HTML I struggled with presentation. Then in returning to a data frame, JSON became the confounder, forcing me to make the same decision I made in creating the CSV…I went for consistency and tried to recreate the same table I made in HTML.

This is the original, with authors in empty subsequent rows.

All tables are all presented in knitr::kable() tables. The only difference raw is that JSON lables the authors list=(names) on unpacking, printed to console you see this, in a JS dataframe it disappeared.

books<-read.csv("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.csv", stringsAsFactors = FALSE)
knitr::kable(books)
title isbn author publication.date pages
Fly Fishing the Mountain Lakes 978-1585747740 Gary Lafontaine 05/01/03 192
Tactics for Trout 978-0811713399 Rick hafale 11/01/14 240
Dave Hughes NA
Skip Morris NA
Caddisflies 978-0941130981 Gary Lafontaine 04/28/89 336

HTML

url_html <- read_html("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.html")
tabs <- url_html %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)
html_frame <- tabs[[1]]
knitr::kable(html_frame)
title isbn author publication date pages
Fly Fishing the Mountain Lakes 978-1585747740 Gary Lafontaine 05/01/03 192
Tactics for Trout 978-0811713399 Rick Hafale, Dave Hughes, Skip Morris 11/01/14 240
Caddisflies 978-0941130981 Gary Lafontaine 04/28/89 336

JSON

url_json<-"https://raw.githubusercontent.com/bpoulin81/Data607/ac48db2df3ec30fb48cd2fa18effcc7b428b4f4a/books.json"
json_frame<-fromJSON(url_json)
knitr::kable(json_frame)
title isbn authors publication date pages
Fly Fishing the Mountain Lakes 978-1585747740 Gary Lafontaine 05/01/2003 192
Tactics for Trout 978-0811713399 Rick Hafale, Dave Hughes, Skip Morris 11/01/2014 240
Caddisflies 978-0941130981 Gary Lafontaine 04/28/1989 336

XML

# XML from Github
url_xml <-   getURL("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.xml", ssl.verifyPeer=FALSE)
doc<-xmlParse(url_xml)
xml_frame <- xmlToDataFrame(nodes = getNodeSet(doc, "//row"), stringsAsFactors = FALSE)
knitr::kable(xml_frame)
title isbn author publication_date pages
Fly Fishing the Mountain Lakes 978-1585747740 Gary Lafontaine 05/01/2003 192
Tactics for Trout 978-0811713399 Rick Hafale, Dave Hughes, Skip Morris 11/01/2014 240
Caddisflies 978-0941130981 Gary Lafontaine 04/28/1989 336

Summary

For the most part the three tables look the same. The one major exception is that publication date in the CSV, HTML and JSON tables appear as separate words and in XML is a single underscored string. This is not a presentation or processing issue, the only way that I could get the XML parsers and the validator I used to work with that field name was to create a single un-broken string.

In general that is a relatively simple issue to deal with in final formatting of a data table. The most difficult decision was figuring out how to handle the multiple authors. I chose to merge them into a field because of what you asked for and because I had no specific end use I could better tailor the datas structure to. I made it rectangular.

I can see where each method would have its advantages if you were trying to structure data frames based on tags. But in general, XML and JSON seem like they would be more contrained and consistent formats for sharing data than a CSV provided it was not super huge data that would make moving the tags a burden on the system (csv is compact). On the other hand CSV and HTML are easier to write quickly and more forgiving.

They can all be queried the same way df$field[position] returning the same results. I think it is a matter of situation, what do you find on your scraping journey and what is the easiest way to get the data formed and ready for show?