Author: Romerl Elizes

Load Libraries

library(stringr)
library(XML)
library(RCurl)
library(rlist)
library(RJSONIO)
library(tidyr)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)

options(knitr.table.format = "html")

Part I. HTML

urlfile <- getURL("https://raw.githubusercontent.com/RommyGraphs/MSDA/master/DATA607/books.html",.opts = list(ssl.verifypeer = FALSE))

htmltable <- readHTMLTable(urlfile)
htmltable <- list.clean(htmltable, fun = is.null, recursive = FALSE)

HTMLdf <- htmltable[[1]]
class(HTMLdf)

## [1] "data.frame"

HTMLdf <- separate(HTMLdf,Author, c("Author1","Author2"), sep = "; ", remove = TRUE)
HTMLdf %>%
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%     
  scroll_box(width="100%",height="200px")

Name	Author1	Author2	Year	Publisher	Genre
Band of Brothers	Stephen Ambrose	NA	1992	Simon and Schuster	World War II History
The Killer Angels: A Novel of the Civil War	Michael Shaara	NA	1974	David McKay Publications	Civil War History
Original Dungeons and Dragons	Gary Gygax	Dave Arneson	1974	TSR, Inc.	Roleplaying Game

HTML Analysis: HTML was pretty straightforward. The class of the HTML Table by default is data.frame. By default, the header fields listed in HTML header row became the header fields for the data frame. The only additional work I did was to use the dplyr separate function to create fields for two authors.

Part II. XML

urlfile <- getURL("https://raw.githubusercontent.com/RommyGraphs/MSDA/master/DATA607/books.xml",.opts = list(ssl.verifypeer = FALSE))
booksXML <- xmlParse(urlfile)
class(booksXML)

## [1] "XMLInternalDocument" "XMLAbstractDocument"

root <- xmlRoot(booksXML)
xmlName(root)

## [1] "books"

root[[1]]

## <book>
##   <name>Band of Brothers</name>
##   <author>Stephen Ambrose</author>
##   <year>1992</year>
##   <publisher>Simon and Schuster</publisher>
##   <genre>World War II History</genre>
## </book>

XMLdf <- xmlToDataFrame(root)
XMLdf

##                                          name                   author
## 1                            Band of Brothers          Stephen Ambrose
## 2 The Killer Angels: A Novel of the Civil War           Michael Shaara
## 3               Original Dungeons and Dragons Gary Gygax; Dave Arneson
##   year                publisher                genre
## 1 1992       Simon and Schuster World War II History
## 2 1974 David McKay Publications    Civil War History
## 3 1974                TSR, Inc.     Roleplaying Game

names(XMLdf) <- c("Name","Author","Year", "Publisher", "Genre")

XMLdf <- separate(XMLdf,Author, c("Author1","Author2"), sep = "; ", remove = TRUE)
XMLdf %>%
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%     
  scroll_box(width="100%",height="200px")

Name	Author1	Author2	Year	Publisher	Genre
Band of Brothers	Stephen Ambrose	NA	1992	Simon and Schuster	World War II History
The Killer Angels: A Novel of the Civil War	Michael Shaara	NA	1974	David McKay Publications	Civil War History
Original Dungeons and Dragons	Gary Gygax	Dave Arneson	1974	TSR, Inc.	Roleplaying Game

XML Analysis: Following the Text book example, I was able to determine that the XML is being read properly by using both xmlParse and xmlRoot functions to find out the correct root. We also know that the xmlName for the root is books and when we list the contents of the first book, it will display by calling root with the subscript 1. I used xmlToDataFrame function to transform the XML contents to a data frame. The big difference between the resulting HTML and XML data frames is that HTML does not change the headers. The xmlToDataFrame function makes the titles all lowercase. Because of this, I had to rename the headers similar to that of HTML. Just like HTML, the only additional work I did was to use the dplyr separate function to create fields for two authors.

Part III. JSON

urlfile <- "https://raw.githubusercontent.com/RommyGraphs/MSDA/master/DATA607/books.json"
isValidJSON(urlfile)

## [1] TRUE

booksJSON.json <- fromJSON(urlfile, nullValue = NA, simplify = FALSE)

test1 <- unlist(booksJSON.json, recursive = TRUE, use.names = TRUE)
test1[str_detect(names(test1), "name")]

##                                    books.name 
##                            "Band of Brothers" 
##                                    books.name 
## "The Killer Angels: A Novel of the Civil War" 
##                                    books.name 
##               "Original Dungeons and Dragons"

booksJSON.df <- do.call("rbind", lapply(booksJSON.json, data.frame, stringsAsFactors = FALSE))

booksJSON.df  %>%
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%     
  scroll_box(width="100%",height="200px")

	name	author	year	publisher	genre	name.1	author.1	year.1	publisher.1	genre.1	name.2	author.2	year.2	publisher.2	genre.2
books	Band of Brothers	Stephen Ambrose	1992	Simon and Schuster	World War II History	The Killer Angels: A Novel of the Civil War	Michael Shaara	1974	David McKay Publications	Civil War History	Original Dungeons and Dragons	Gary Gygax; Dave Arneson	1974	TSR, Inc.	Roleplaying Game

JSON Analysis 1: Following the Text book example, I used isValidJSON function to make sure my specified JSON url file has valid JSON file contents. Next, to make sure that I was reading the JSON file properly, I used the unlist and str_detect blocks specified in the text to very that the names of the books are listed accordingly.

The big problem with the given literature was that there was no way to clearly define each of the fields. For example, the name field, instead of it being listed once as a header field, it was displayed as name, name.1 and name.2 for the book name fields. That was not acceptable because if you applied that to all the header fields, the resulting field if you follow the do.call method and lapply method directly, you get 15 header fields and 1 row instead of 5 header fields and 3 rows of data. Unacceptable!

namelst <- as.vector(sapply(booksJSON.json[[1]], "[[", "name"))
authorlst <- as.vector(sapply(booksJSON.json[[1]], "[[", "author"))
yearlst <- as.vector(sapply(booksJSON.json[[1]], "[[", "year"))
publisherlst <- as.vector(sapply(booksJSON.json[[1]], "[[", "publisher"))
genrelst <- as.vector(sapply(booksJSON.json[[1]], "[[", "genre"))

JSONdf <- data.frame(Name = namelst, Author = authorlst, Year = yearlst, Publisher = publisherlst, Genre = genrelst)
JSONdf <- separate(JSONdf,Author, c("Author1","Author2"), sep = "; ", remove = TRUE)

JSONdf  %>%
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%     
  scroll_box(width="100%",height="200px")

Name	Author1	Author2	Year	Publisher	Genre
Band of Brothers	Stephen Ambrose	NA	1992	Simon and Schuster	World War II History
The Killer Angels: A Novel of the Civil War	Michael Shaara	NA	1974	David McKay Publications	Civil War History
Original Dungeons and Dragons	Gary Gygax	Dave Arneson	1974	TSR, Inc.	Roleplaying Game

JSON Analysis 2: A better way to solve this is to use the sapply example in the text book for each header field. This would insure that all field values matching that name gets grouped into that list. Initially, I called the data.frame function to create the data frame just using the resulting lists. However, when I attempted to use the separate function to create the two Author fields, it did not work. It complained of null values. To circumvent this, when I intially created the lists using sapply, I instead used as.vector function to encapsulate each of the new lists. R Studio did not complain when I used separate function and I was able to mimic the same functionality as the HTML and XML treatment of the data.

DATA607 - Assignment 7

Author: Romerl Elizes

Load Libraries

Part I. HTML

Part II. XML

Part III. JSON