Assignment

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Solution

Load Libraries

library(jsonlite)
library(xml2)
library(rvest)

Read Files

JSON

The package I will use for reading JSON files is jsonlight. This reads a JSON directly into a data.frame. However, the column order is set by the first entry, which is this case is not fully-populated, so I will have to re-order the columns for later comparisons. I will also convert the years to integers.

# JSON
BooksJ <- fromJSON("https://raw.githubusercontent.com/aadler/CUNY_MSDS_F2019_DATA607/master/HW%207/Books.json", flatten = TRUE)
# Since first entry has the NA for Author2, will reorder for later comparison
BooksJ <- BooksJ[, c(1, 2, 5, 3, 4)]
BooksJ$Year <- as.integer(BooksJ$Year)

XML

The package I will use for reading XML files is xml2. This reads both XML and HTML files. For XML files, it reads them into XML objects, so they need to be parsed to be put into a data.frame. I will make the assumption that the books have only one sub-level of information and that at least one entry is completely filled out—which I happen to know to be true in this case. A more general parser would need more logic, but that is outside the scope of this exercise. Each “Book” in the XML object will be traversed and the elements will populate the data.frame.

RawXML <- read_xml("https://raw.githubusercontent.com/aadler/CUNY_MSDS_F2019_DATA607/master/HW%207/Books.xml")
# Number of first-order children (Books)
m <- xml_length(RawXML)
# Number of second-order children (Book information) for each node
nodesize <- xml_length(xml_children(RawXML))
# Largest, for loop purposes
n <- max(nodesize)
# Find a maximally-filled entry, assuming there is one, which is safe here.
npos <- which.max(nodesize)
# Take names from entry sure to have all names
Names <- xml_name(xml_children(xml_child(RawXML, npos)))
# Pre-allocate data frame
BooksX <- setNames(data.frame(
  matrix(ncol = n, nrow = m),
  stringsAsFactors = FALSE),
  Names)
# Filling loop: For each first-level child, traverse the Names vector, and into
# the proper slot in the data.frame, put in the data associated with that
# named sub-node    
for (i in seq_len(m)) {
  for (j in Names) {
    BooksX[i, ][j] <- xml_text(xml_find_all(xml_child(RawXML, i),
                                            paste0(".//", j)))
  }
}
BooksX$Year <- as.integer(BooksX$Year)

HTML

The package I will use for reading HTML files is still xml2. However, the rvest package is optimized for processing HTML files and will make converting the HTML table data into a data.frame much easier.

# Need fill = TRUE to match NAs above
RawHTML <- read_html("https://raw.githubusercontent.com/aadler/CUNY_MSDS_F2019_DATA607/master/HW%207/Books.html")
BooksH <- html_table(html_node(RawHTML, "table"), fill = TRUE)
BooksH$Year <- as.integer(BooksH$Year)

Comparison

BooksJ
##                                                                                   Name
## 1                                                           Probability and Statistics
## 2 Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach
## 3   Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables
##               Author1            Author2 Year              ISBN
## 1  DeGroot, Morris H.               <NA> 1986     0-201-11366-X
## 2 Burnham, Kenneth P. Anderson, David R. 2002 978-0-387-95364-9
## 3  Abramowitz, Milton   Stegun, Irene A. 1970     0-4866-1272-4
BooksX
##                                                                                   Name
## 1                                                           Probability and Statistics
## 2 Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach
## 3   Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables
##               Author1            Author2 Year              ISBN
## 1  DeGroot, Morris H.               <NA> 1986     0-201-11366-X
## 2 Burnham, Kenneth P. Anderson, David R. 2002 978-0-387-95364-9
## 3  Abramowitz, Milton   Stegun, Irene A. 1970     0-4866-1272-4
BooksH
##                                                                                   Name
## 1                                                           Probability and Statistics
## 2 Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach
## 3   Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables
##               Author1            Author2 Year              ISBN
## 1  DeGroot, Morris H.                    1986     0-201-11366-X
## 2 Burnham, Kenneth P. Anderson, David R. 2002 978-0-387-95364-9
## 3  Abramowitz, Milton   Stegun, Irene A. 1970     0-4866-1272-4

Printed out, the data frames look almost exactly alike, except that the HTML parsing set the missing field to blank and not NA, despite my passing fill = TRUE. I could manually set the two NAs to blank, but in my opinion it is preferable to have NAs for missing data, so that it is known to not have been in the initial import. I could just set BookH$Author2[1] <- NA, but where is the fun in that. So I’ll take a slightly crazier approach that will ostensibly convert all empty cells to NA. For the more sane, there is a dplyr command for this, but I’m using neither dplyr nor data.table for this exercise.

# Find empty cells (nothing between word begin (^) and word end ($) and replace
# with NA Then convert list back to dataframe remembering the initial decision
# back in the 70s? to make characters default to factors for data frames.
BooksH <- as.data.frame(apply(BooksH, 2, function(x) gsub("^$", NA, x)),
                       stringsAsFactors = FALSE)
# Have to convert back to integers again
BooksH$Year <- as.integer(BooksH$Year)
identical(BooksJ, BooksX)
## [1] TRUE
identical(BooksJ, BooksH)
## [1] TRUE
identical(BooksX, BooksH)
## [1] TRUE

Now that the empty cell has been converted to NA, the data frames are identical.