Assignment 6

The purpose of this exercise is to load information from three different file types: HTML, XML, and JSON, and to see if the results are identical. I picked mythology books for this assignment and handmade the HTML table, XML, and JSON files, which are uploaded to my GitHub.

library(XML)
library(plyr)
library(knitr)
library(jsonlite)

html.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.html"
xml.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.xml"
json.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.json"

download.file(html.url, "books.html")
download.file(xml.url, "books.xml")
download.file(json.url, "books.json")

books.html <- as.data.frame(readHTMLTable("books.html"))
kable(books.html)

NULL.Title	NULL.Authors	NULL.Publisher	NULL.Pages	NULL.Date
Norse Mythology	Neil Gaiman	W.W. Norton and Company	304	2/7/17
The Odyssey	Homer, Emily Wilson	W.W. Norton and Company	592	11/7/17
Fairy Tales from the Brothers Grimm: A New English Version	The Brothers Grimm, Philip Pullman	Viking	432	11/8/12

Aside from a minor issue with the column names, this is okay. If we were cleaning this up, we’d have to split apart the authors for The Odyssey and Fairy Tales.

Though xmlToDataFrame exists as part of the XML library, it is a bit too simplistic as it can’t handle the multiple author tags for two of the books. We’ll have to try something else. A [StackOverflow][“https://stackoverflow.com/questions/2067098/how-to-transform-xml-data-into-a-data-frame”] question gave me a solution, requiring the plyr library.

books.xml <- ldply(xmlToList("books.xml"), data.frame)

This natively manages to split apart the authors, which readHTMLTable couldn’t manage for HTML. We can clean it up a bit, and remove columns we won’t be needing.

books.xml <- books.xml[, c(2, 3, 8, 4, 5, 6)]
kable(books.xml)

title	author	author.1	publisher	pages	date
Norse Mythology	Neil Gaiman	NA	W.W. Norton and Company	304	2/7/17
The Odyssey	Homer	Emily Wilson	W.W. Norton and Company	592	11/7/17
Fairy Tales from the Brothers Grimm: A New English Version	The Brothers Grimm	Philip Pullman	Viking	432	11/8/12

books.json <- as.data.frame(fromJSON("books.json"))
kable(books.json)

books.title	books.authors	books.publisher	books.pages	books.date
Norse Mythology	Neil Gaiman	W.W. Norton and Company	304	2/7/17
The Odyssey	Homer, Emily Wilson	W.W. Norton and Company	592	11/7/17
Fairy Tales from the Brothers Grimm: A New English Version	The Brothers Grimm, Philip Pullman	Viking	432	11/8/12

Reading in the JSON file, we get an output exactly like the HTML table, although the column names are a little neater. Rather than separating the authors into new columns automatically, as the XML version had done, this leaves multiple entries in the same column.

So, no, the methods are not identical.

Assignment 6

Iden Watanabe

March 13, 2018