Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.’
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
First we need to load the ‘jsonlite’ library to load the json file into a data frame. I initially tried ‘rjson’ but it put the data into a data frame of 1 row.
library('jsonlite')
Now we need to use fromJSON to load the data into a data frame:
book_df1 <- fromJSON('books.json', simplifyDataFrame = TRUE)
book_df1
## AuthorFirstName AuthorLastName Title
## 1 Satya Nadella Hit Refresh
## 2 Robert Lagdon The DA Vinci Code
## 3 J.K. Rowling Harry Potter and the Philosopher's stone
## Year Subject BookType
## 1 2017 Autobiography Hardcover
## 2 2009 Fiction Paperback
## 3 2015 Autobiography Kindle
I have tried this with both second author data for each book with the first two books empty, or only second author data for the final book. It does not matter, as fromRJSON() fills in ‘NA’ for the missing data.
We will use the readHTMLtable() function of the XML package.
library('XML')
## Warning: package 'XML' was built under R version 3.4.4
We simply call the function with the appropriate file called in the readHTMLTable() function.
book_df2 <- readHTMLTable('books.html')
book_df2
## $`NULL`
## AuthorFirstName AuthorLastName Title
## 1 Satya Nadella Hit Refresh
## 2 Robert Lagdon The DA Vinci Code
## 3 J.K. Rowling Harry Potter and the Philosopher's stone
## Year Subject BookType
## 1 2017 Autobiography Hardcover
## 2 2009 Fiction Paperback
## 3 2015 Fiction Kindle Version
Note that html syntax is different than json syntax in one very important way; for html you need to have your table headers called first. Because of this I had blank second author data for the first two books. For the json file you could simply not put those data at all and fromJSON() would add the appropriate columns with the missing data marked ‘NA’.
We can use the XML library again this time we use the xmlToDataFrame() function.
book_df3 <- xmlToDataFrame('books.xml')
book_df3
## AuthorFirstName AuthorLastName Title
## 1 Satya Nadella Hit Refresh
## 2 Robert Lagdon The DA Vinci Code
## 3 J.K. Rowling Harry Potter and the Philosopher's stone
## Year subject BookType
## 1 2017 Autobiography Hardcover
## 2 2009 Fiction Paperback
## 3 2015 Fiction Kindle Version
Much like the JSON file, in this case not entering the author2 data for the first two books was not a problem for xmlToDataFrame() as it marked the missing data as ‘NA’.