Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.’

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

JSON Format

First we need to load the ‘jsonlite’ library to load the json file into a data frame. I initially tried ‘rjson’ but it put the data into a data frame of 1 row.

library('jsonlite')

Now we need to use fromJSON to load the data into a data frame:

book_df1 <- fromJSON('books.json', simplifyDataFrame = TRUE)
book_df1
##   AuthorFirstName AuthorLastName                                    Title
## 1           Satya        Nadella                              Hit Refresh
## 2          Robert         Lagdon                        The DA Vinci Code
## 3            J.K.        Rowling Harry Potter and the Philosopher's stone
##   Year       Subject  BookType
## 1 2017 Autobiography Hardcover
## 2 2009       Fiction Paperback
## 3 2015 Autobiography    Kindle

I have tried this with both second author data for each book with the first two books empty, or only second author data for the final book. It does not matter, as fromRJSON() fills in ‘NA’ for the missing data.

HTML format

We will use the readHTMLtable() function of the XML package.

library('XML')
## Warning: package 'XML' was built under R version 3.4.4

We simply call the function with the appropriate file called in the readHTMLTable() function.

book_df2 <- readHTMLTable('books.html')
book_df2
## $`NULL`
##   AuthorFirstName AuthorLastName                                    Title
## 1           Satya        Nadella                              Hit Refresh
## 2          Robert         Lagdon                        The DA Vinci Code
## 3            J.K.        Rowling Harry Potter and the Philosopher's stone
##   Year       Subject       BookType
## 1 2017 Autobiography      Hardcover
## 2 2009       Fiction      Paperback
## 3 2015       Fiction Kindle Version

Note that html syntax is different than json syntax in one very important way; for html you need to have your table headers called first. Because of this I had blank second author data for the first two books. For the json file you could simply not put those data at all and fromJSON() would add the appropriate columns with the missing data marked ‘NA’.

XML format

We can use the XML library again this time we use the xmlToDataFrame() function.

book_df3 <- xmlToDataFrame('books.xml')
book_df3
##   AuthorFirstName AuthorLastName                                    Title
## 1           Satya        Nadella                              Hit Refresh
## 2          Robert         Lagdon                        The DA Vinci Code
## 3            J.K.        Rowling Harry Potter and the Philosopher's stone
##   Year       subject       BookType
## 1 2017 Autobiography      Hardcover
## 2 2009       Fiction      Paperback
## 3 2015       Fiction Kindle Version

Much like the JSON file, in this case not entering the author2 data for the first two books was not a problem for xmlToDataFrame() as it marked the missing data as ‘NA’.