urlfile <- getURL("https://raw.githubusercontent.com/RommyGraphs/MSDA/master/DATA607/books.html",.opts = list(ssl.verifypeer = FALSE))
htmltable <- readHTMLTable(urlfile)
htmltable <- list.clean(htmltable, fun = is.null, recursive = FALSE)
HTMLdf <- htmltable[[1]]
class(HTMLdf)
## [1] "data.frame"
HTMLdf <- separate(HTMLdf,Author, c("Author1","Author2"), sep = "; ", remove = TRUE)
HTMLdf %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="200px")
Name | Author1 | Author2 | Year | Publisher | Genre |
---|---|---|---|---|---|
Band of Brothers | Stephen Ambrose | NA | 1992 | Simon and Schuster | World War II History |
The Killer Angels: A Novel of the Civil War | Michael Shaara | NA | 1974 | David McKay Publications | Civil War History |
Original Dungeons and Dragons | Gary Gygax | Dave Arneson | 1974 | TSR, Inc. | Roleplaying Game |
HTML Analysis: HTML was pretty straightforward. The class of the HTML Table by default is data.frame. By default, the header fields listed in HTML header row became the header fields for the data frame. The only additional work I did was to use the dplyr separate function to create fields for two authors.
urlfile <- getURL("https://raw.githubusercontent.com/RommyGraphs/MSDA/master/DATA607/books.xml",.opts = list(ssl.verifypeer = FALSE))
booksXML <- xmlParse(urlfile)
class(booksXML)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
## [1] "books"
## <book>
## <name>Band of Brothers</name>
## <author>Stephen Ambrose</author>
## <year>1992</year>
## <publisher>Simon and Schuster</publisher>
## <genre>World War II History</genre>
## </book>
## name author
## 1 Band of Brothers Stephen Ambrose
## 2 The Killer Angels: A Novel of the Civil War Michael Shaara
## 3 Original Dungeons and Dragons Gary Gygax; Dave Arneson
## year publisher genre
## 1 1992 Simon and Schuster World War II History
## 2 1974 David McKay Publications Civil War History
## 3 1974 TSR, Inc. Roleplaying Game
names(XMLdf) <- c("Name","Author","Year", "Publisher", "Genre")
XMLdf <- separate(XMLdf,Author, c("Author1","Author2"), sep = "; ", remove = TRUE)
XMLdf %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="200px")
Name | Author1 | Author2 | Year | Publisher | Genre |
---|---|---|---|---|---|
Band of Brothers | Stephen Ambrose | NA | 1992 | Simon and Schuster | World War II History |
The Killer Angels: A Novel of the Civil War | Michael Shaara | NA | 1974 | David McKay Publications | Civil War History |
Original Dungeons and Dragons | Gary Gygax | Dave Arneson | 1974 | TSR, Inc. | Roleplaying Game |
XML Analysis: Following the Text book example, I was able to determine that the XML is being read properly by using both xmlParse and xmlRoot functions to find out the correct root. We also know that the xmlName for the root is books and when we list the contents of the first book, it will display by calling root with the subscript 1. I used xmlToDataFrame function to transform the XML contents to a data frame. The big difference between the resulting HTML and XML data frames is that HTML does not change the headers. The xmlToDataFrame function makes the titles all lowercase. Because of this, I had to rename the headers similar to that of HTML. Just like HTML, the only additional work I did was to use the dplyr separate function to create fields for two authors.
urlfile <- "https://raw.githubusercontent.com/RommyGraphs/MSDA/master/DATA607/books.json"
isValidJSON(urlfile)
## [1] TRUE
booksJSON.json <- fromJSON(urlfile, nullValue = NA, simplify = FALSE)
test1 <- unlist(booksJSON.json, recursive = TRUE, use.names = TRUE)
test1[str_detect(names(test1), "name")]
## books.name
## "Band of Brothers"
## books.name
## "The Killer Angels: A Novel of the Civil War"
## books.name
## "Original Dungeons and Dragons"
booksJSON.df <- do.call("rbind", lapply(booksJSON.json, data.frame, stringsAsFactors = FALSE))
booksJSON.df %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="200px")
name | author | year | publisher | genre | name.1 | author.1 | year.1 | publisher.1 | genre.1 | name.2 | author.2 | year.2 | publisher.2 | genre.2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
books | Band of Brothers | Stephen Ambrose | 1992 | Simon and Schuster | World War II History | The Killer Angels: A Novel of the Civil War | Michael Shaara | 1974 | David McKay Publications | Civil War History | Original Dungeons and Dragons | Gary Gygax; Dave Arneson | 1974 | TSR, Inc. | Roleplaying Game |
JSON Analysis 1: Following the Text book example, I used isValidJSON function to make sure my specified JSON url file has valid JSON file contents. Next, to make sure that I was reading the JSON file properly, I used the unlist and str_detect blocks specified in the text to very that the names of the books are listed accordingly.
The big problem with the given literature was that there was no way to clearly define each of the fields. For example, the name field, instead of it being listed once as a header field, it was displayed as name, name.1 and name.2 for the book name fields. That was not acceptable because if you applied that to all the header fields, the resulting field if you follow the do.call method and lapply method directly, you get 15 header fields and 1 row instead of 5 header fields and 3 rows of data. Unacceptable!
namelst <- as.vector(sapply(booksJSON.json[[1]], "[[", "name"))
authorlst <- as.vector(sapply(booksJSON.json[[1]], "[[", "author"))
yearlst <- as.vector(sapply(booksJSON.json[[1]], "[[", "year"))
publisherlst <- as.vector(sapply(booksJSON.json[[1]], "[[", "publisher"))
genrelst <- as.vector(sapply(booksJSON.json[[1]], "[[", "genre"))
JSONdf <- data.frame(Name = namelst, Author = authorlst, Year = yearlst, Publisher = publisherlst, Genre = genrelst)
JSONdf <- separate(JSONdf,Author, c("Author1","Author2"), sep = "; ", remove = TRUE)
JSONdf %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="200px")
Name | Author1 | Author2 | Year | Publisher | Genre |
---|---|---|---|---|---|
Band of Brothers | Stephen Ambrose | NA | 1992 | Simon and Schuster | World War II History |
The Killer Angels: A Novel of the Civil War | Michael Shaara | NA | 1974 | David McKay Publications | Civil War History |
Original Dungeons and Dragons | Gary Gygax | Dave Arneson | 1974 | TSR, Inc. | Roleplaying Game |
JSON Analysis 2: A better way to solve this is to use the sapply example in the text book for each header field. This would insure that all field values matching that name gets grouped into that list. Initially, I called the data.frame function to create the data frame just using the resulting lists. However, when I attempted to use the separate function to create the two Author fields, it did not work. It complained of null values. To circumvent this, when I intially created the lists using sapply, I instead used as.vector function to encapsulate each of the new lists. R Studio did not complain when I used separate function and I was able to mimic the same functionality as the HTML and XML treatment of the data.