Web API Formats

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.’

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

JSON Format

First we need to load the ‘jsonlite’ library to load the json file into a data frame. I initially tried ‘rjson’ but it put the data into a data frame of 1 row.

library('jsonlite')
## Warning: package 'jsonlite' was built under R version 3.4.1

Now we need to use fromJSON to load the data into a data frame:

book_df1 <- fromJSON('books.json', simplifyDataFrame = TRUE)
book_df1
##   firstName1 middleInt1 lastName1
## 1  Nathaniel         J.    Cooper
## 2     Julian         H.    Krolik
## 3     George         B.   Rybicki
##                                                                             Title
## 1                                           The Kilo-Parsec Properties of Blazars
## 2 Active Galactic Nuclei: From the Central Black Hole to the Galactic Environment
## 3                                             Radiative Processes in Astrophysics
##   Year                                         Subject
## 1 2010 Astrophysics - Active Galactic Nuclei - Blazars
## 2 1999 Astrophysics - Active Galactic Nuclei - General
## 3 2004                        Astrophysics - Radiation
##   Shameless Self Promotion firstName2 middleInt2 lastName2
## 1                     TRUE       <NA>       <NA>      <NA>
## 2                    FALSE       <NA>       <NA>      <NA>
## 3                    FALSE       Alan         P.  Lightman

I have tried this with both second author data for each book with the first two books empty, or only second author data for the final book. It does not matter, as fromRJSON() fills in ‘NA’ for the missing data.

HTML format

We will use the readHTMLtable() function of the XML package.

library('XML')
## Warning: package 'XML' was built under R version 3.4.1

We simply call the function with the appropriate file called in the readHTMLTable() function.

book_df2 <- readHTMLTable('books.html')
book_df2
## $`NULL`
##   firstName1 middleInt1 lastName1 firstName2 middleInt2 lastName2
## 1  Nathaniel         J.    Cooper                                
## 2     Julian         H.    Krolik                                
## 3     George         B.   Rybicki       Alan         P.  Lightman
##                                                                             Title
## 1                                           The Kilo-Parsec Properties of Blazars
## 2 Active Galactic Nuclei: From the Central Black Hole to the Galactic Environment
## 3                                             Radiative Processes in Astrophysics
##   Year                                         Subject
## 1 2010 Astrophysics - Active Galactic Nuclei - Blazars
## 2 1999 Astrophysics - Active Galactic Nuclei - General
## 3 2004                        Astrophysics - Radiation
##   Shameless Self Promotion
## 1                     TRUE
## 2                    FALSE
## 3                    FALSE

Note that html syntax is different than json syntax in one very important way; for html you need to have your table headers called first. Because of this I had blank second author data for the first two books. For the json file you could simply not put those data at all and fromJSON() would add the appropriate columns with the missing data marked ‘NA’.

XML format

We can use the XML library again this time we use the xmlToDataFrame() function.

book_df3 <- xmlToDataFrame('books.xml')
book_df3
##               author1
## 1 Nathaniel J. Cooper
## 2    Julian H. Krolik
## 3   George B. Rybicki
##                                                                             title
## 1                                           The Kilo-Parsec Properties of Blazars
## 2 Active Galactic Nuclei: From the Central Black Hole to the Galactic Environment
## 3                                             Radiative Processes in Astrophysics
##   year                                         subject
## 1 2010 Astrophysics - Active Galactic Nuclei - Blazars
## 2 1999 Astrophysics - Active Galactic Nuclei - General
## 3 2004                        Astrophysics - Radiation
##   ShamelessSelfPromotion          author2
## 1                   TRUE             <NA>
## 2                  FALSE             <NA>
## 3                  FALSE Alan P. Lightman

Much like the JSON file, in this case not entering the author2 data for the first two books was not a problem for xmlToDataFrame() as it marked the missing data as ‘NA’.