This was a hard assignment conceptually because I struggled with how to present the multi author book in a data frame. Should I place all authors in a single field and separate them later? Should I repeat entire rows to represent each author (this felt like an authors data frame). I decided to start with a CSV and copy each field exactly, to be sure I was getting the same data in the same fields but include ALL the authors in a single feild of a frame for all three output tables.
When it came to working with JSON, addressing the multiple seemed simple and native, but in both XML and HTML I struggled with presentation. Then in returning to a data frame, JSON became the confounder, forcing me to make the same decision I made in creating the CSV…I went for consistency and tried to recreate the same table I made in HTML.
This is the original, with authors in empty subsequent rows.
All tables are all presented in knitr::kable() tables. The only difference raw is that JSON lables the authors list=(names) on unpacking, printed to console you see this, in a JS dataframe it disappeared.
books<-read.csv("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.csv", stringsAsFactors = FALSE)
knitr::kable(books)| title | isbn | author | publication.date | pages |
|---|---|---|---|---|
| Fly Fishing the Mountain Lakes | 978-1585747740 | Gary Lafontaine | 05/01/03 | 192 |
| Tactics for Trout | 978-0811713399 | Rick hafale | 11/01/14 | 240 |
| Dave Hughes | NA | |||
| Skip Morris | NA | |||
| Caddisflies | 978-0941130981 | Gary Lafontaine | 04/28/89 | 336 |
url_html <- read_html("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.html")
tabs <- url_html %>%
html_nodes("table") %>%
html_table(fill = TRUE)
html_frame <- tabs[[1]]
knitr::kable(html_frame)| title | isbn | author | publication date | pages |
|---|---|---|---|---|
| Fly Fishing the Mountain Lakes | 978-1585747740 | Gary Lafontaine | 05/01/03 | 192 |
| Tactics for Trout | 978-0811713399 | Rick Hafale, Dave Hughes, Skip Morris | 11/01/14 | 240 |
| Caddisflies | 978-0941130981 | Gary Lafontaine | 04/28/89 | 336 |
url_json<-"https://raw.githubusercontent.com/bpoulin81/Data607/ac48db2df3ec30fb48cd2fa18effcc7b428b4f4a/books.json"
json_frame<-fromJSON(url_json)
knitr::kable(json_frame)| title | isbn | authors | publication date | pages |
|---|---|---|---|---|
| Fly Fishing the Mountain Lakes | 978-1585747740 | Gary Lafontaine | 05/01/2003 | 192 |
| Tactics for Trout | 978-0811713399 | Rick Hafale, Dave Hughes, Skip Morris | 11/01/2014 | 240 |
| Caddisflies | 978-0941130981 | Gary Lafontaine | 04/28/1989 | 336 |
# XML from Github
url_xml <- getURL("https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/books.xml", ssl.verifyPeer=FALSE)
doc<-xmlParse(url_xml)
xml_frame <- xmlToDataFrame(nodes = getNodeSet(doc, "//row"), stringsAsFactors = FALSE)
knitr::kable(xml_frame)| title | isbn | author | publication_date | pages |
|---|---|---|---|---|
| Fly Fishing the Mountain Lakes | 978-1585747740 | Gary Lafontaine | 05/01/2003 | 192 |
| Tactics for Trout | 978-0811713399 | Rick Hafale, Dave Hughes, Skip Morris | 11/01/2014 | 240 |
| Caddisflies | 978-0941130981 | Gary Lafontaine | 04/28/1989 | 336 |
For the most part the three tables look the same. The one major exception is that publication date in the CSV, HTML and JSON tables appear as separate words and in XML is a single underscored string. This is not a presentation or processing issue, the only way that I could get the XML parsers and the validator I used to work with that field name was to create a single un-broken string.
In general that is a relatively simple issue to deal with in final formatting of a data table. The most difficult decision was figuring out how to handle the multiple authors. I chose to merge them into a field because of what you asked for and because I had no specific end use I could better tailor the datas structure to. I made it rectangular.
I can see where each method would have its advantages if you were trying to structure data frames based on tags. But in general, XML and JSON seem like they would be more contrained and consistent formats for sharing data than a CSV provided it was not super huge data that would make moving the tags a burden on the system (csv is compact). On the other hand CSV and HTML are easier to write quickly and more forgiving.
They can all be queried the same way df$field[position] returning the same results. I think it is a matter of situation, what do you find on your scraping journey and what is the easiest way to get the data formed and ready for show?