JSON, XML, HTML Data Retrieval

Loading up Tools

library(jsonlite)

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(XML)

Get the JSON

books_JSON<-data.frame(fromJSON("https://raw.githubusercontent.com/pm0kjp/IS607/master/books.json"))
books_JSON

##                                                               book.title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2         The Tipping Point: How Little Things Can Make a Big Difference
## 3                                                Thinking, Fast and Slow
##                        book.author book.firstPubDate book.amazonVersions
## 1 Stephen J. Dubner, Steven Levitt              2005                  37
## 2                 Malcolm Gladwell              2000                  67
## 3                  Daniel Kahneman              2011                  24
##   book.amazonStars
## 1              4.0
## 2              4.2
## 3              4.4

Notice how in the JSON, the column names all have “book.” as part of their names. We won’t see that in the XML. Additionally, the two author elements for the book with two authors are comma separated and included in the single book.author column. The XML will be a bit different in this regard!

Get the XML

download.file("https://raw.githubusercontent.com/pm0kjp/IS607/master/books.xml","books.xml", method="curl")
books_XML<-xmlToList(xmlParse("books.xml"))
books_XML<-data.frame(do.call(bind_rows, lapply(books_XML, data.frame, stringsAsFactors=FALSE)))
books_XML

##                                                                    title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2         The Tipping Point: How Little Things Can Make a Big Difference
## 3                                                Thinking, Fast and Slow
##              author      author.1 firstPubDate amazonVersions amazonStars
## 1 Stephen J. Dubner Steven Levitt         2005             37         4.0
## 2  Malcolm Gladwell          <NA>         2000             67         4.2
## 3   Daniel Kahneman          <NA>         2011             24         4.4

OK, in the XML, the column names are briefer (and better!) with the possible exception of author.1, which we could easily rename. It’s possibly tidier to keep the two authors in separate columns, so I like this one better.

Get the HTML

download.file("https://raw.githubusercontent.com/pm0kjp/IS607/master/books.html","books.html", method="curl")
books_HTML<-readHTMLTable("books.html", header=TRUE, as.data.frame = TRUE)
books_HTML

## $`NULL`
##                                                                    Title
## 1 Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 2         The Tipping Point: How Little Things Can Make a Big Difference
## 3                                                Thinking, Fast and Slow
##                             Author(s) Original Publication Date
## 1 Stephen J. Dubner and Steven Levitt                      2005
## 2                    Malcolm Gladwell                      2000
## 3                     Daniel Kahneman                      2011
##   Number of Formats Available on Amazon
## 1                                    37
## 2                                    67
## 3                                    24
##   Amazon Average Review (stars out of 5)
## 1                                    4.0
## 2                                    4.2
## 3                                    4.4

The HTML is possibly the most self-explaining, since it’s got friendly, people-facing HTML with long form column names (but with spaces, ugh!). It also includes both authors in one column, since that’s how I set up the html table.

The takeaway

Each data structure has its own strengths and weaknesses, and dealing with multiple occurrences of a single element type within a nested structure can be tricky. But it’s not undoable!