library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(XML)
books_JSON<-data.frame(fromJSON("https://raw.githubusercontent.com/pm0kjp/IS607/master/books.json"))
books_JSON
## book.title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2 The Tipping Point: How Little Things Can Make a Big Difference
## 3 Thinking, Fast and Slow
## book.author book.firstPubDate book.amazonVersions
## 1 Stephen J. Dubner, Steven Levitt 2005 37
## 2 Malcolm Gladwell 2000 67
## 3 Daniel Kahneman 2011 24
## book.amazonStars
## 1 4.0
## 2 4.2
## 3 4.4
Notice how in the JSON, the column names all have “book.” as part of their names. We won’t see that in the XML. Additionally, the two author elements for the book with two authors are comma separated and included in the single book.author column. The XML will be a bit different in this regard!
download.file("https://raw.githubusercontent.com/pm0kjp/IS607/master/books.xml","books.xml", method="curl")
books_XML<-xmlToList(xmlParse("books.xml"))
books_XML<-data.frame(do.call(bind_rows, lapply(books_XML, data.frame, stringsAsFactors=FALSE)))
books_XML
## title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2 The Tipping Point: How Little Things Can Make a Big Difference
## 3 Thinking, Fast and Slow
## author author.1 firstPubDate amazonVersions amazonStars
## 1 Stephen J. Dubner Steven Levitt 2005 37 4.0
## 2 Malcolm Gladwell <NA> 2000 67 4.2
## 3 Daniel Kahneman <NA> 2011 24 4.4
OK, in the XML, the column names are briefer (and better!) with the possible exception of author.1, which we could easily rename. It’s possibly tidier to keep the two authors in separate columns, so I like this one better.
download.file("https://raw.githubusercontent.com/pm0kjp/IS607/master/books.html","books.html", method="curl")
books_HTML<-readHTMLTable("books.html", header=TRUE, as.data.frame = TRUE)
books_HTML
## $`NULL`
## Title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2 The Tipping Point: How Little Things Can Make a Big Difference
## 3 Thinking, Fast and Slow
## Author(s) Original Publication Date
## 1 Stephen J. Dubner and Steven Levitt 2005
## 2 Malcolm Gladwell 2000
## 3 Daniel Kahneman 2011
## Number of Formats Available on Amazon
## 1 37
## 2 67
## 3 24
## Amazon Average Review (stars out of 5)
## 1 4.0
## 2 4.2
## 3 4.4
The HTML is possibly the most self-explaining, since it’s got friendly, people-facing HTML with long form column names (but with spaces, ugh!). It also includes both authors in one column, since that’s how I set up the html table.
Each data structure has its own strengths and weaknesses, and dealing with multiple occurrences of a single element type within a nested structure can be tricky. But it’s not undoable!