The purpose of this exercise is to load information from three different file types: HTML, XML, and JSON, and to see if the results are identical. I picked mythology books for this assignment and handmade the HTML table, XML, and JSON files, which are uploaded to my GitHub.

library(XML)
library(plyr)
library(knitr)
library(jsonlite)

html.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.html"
xml.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.xml"
json.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.json"

download.file(html.url, "books.html")
download.file(xml.url, "books.xml")
download.file(json.url, "books.json")
books.html <- as.data.frame(readHTMLTable("books.html"))
kable(books.html)
NULL.Title NULL.Authors NULL.Publisher NULL.Pages NULL.Date
Norse Mythology Neil Gaiman W.W. Norton and Company 304 2/7/17
The Odyssey Homer, Emily Wilson W.W. Norton and Company 592 11/7/17
Fairy Tales from the Brothers Grimm: A New English Version The Brothers Grimm, Philip Pullman Viking 432 11/8/12

Aside from a minor issue with the column names, this is okay. If we were cleaning this up, we’d have to split apart the authors for The Odyssey and Fairy Tales.

Though xmlToDataFrame exists as part of the XML library, it is a bit too simplistic as it can’t handle the multiple author tags for two of the books. We’ll have to try something else. A [StackOverflow][“https://stackoverflow.com/questions/2067098/how-to-transform-xml-data-into-a-data-frame”] question gave me a solution, requiring the plyr library.

books.xml <- ldply(xmlToList("books.xml"), data.frame)

This natively manages to split apart the authors, which readHTMLTable couldn’t manage for HTML. We can clean it up a bit, and remove columns we won’t be needing.

books.xml <- books.xml[, c(2, 3, 8, 4, 5, 6)]
kable(books.xml)
title author author.1 publisher pages date
Norse Mythology Neil Gaiman NA W.W. Norton and Company 304 2/7/17
The Odyssey Homer Emily Wilson W.W. Norton and Company 592 11/7/17
Fairy Tales from the Brothers Grimm: A New English Version The Brothers Grimm Philip Pullman Viking 432 11/8/12
books.json <- as.data.frame(fromJSON("books.json"))
kable(books.json)
books.title books.authors books.publisher books.pages books.date
Norse Mythology Neil Gaiman W.W. Norton and Company 304 2/7/17
The Odyssey Homer, Emily Wilson W.W. Norton and Company 592 11/7/17
Fairy Tales from the Brothers Grimm: A New English Version The Brothers Grimm, Philip Pullman Viking 432 11/8/12

Reading in the JSON file, we get an output exactly like the HTML table, although the column names are a little neater. Rather than separating the authors into new columns automatically, as the XML version had done, this leaves multiple entries in the same column.

So, no, the methods are not identical.