The purpose of this exercise is to load information from three different file types: HTML, XML, and JSON, and to see if the results are identical. I picked mythology books for this assignment and handmade the HTML table, XML, and JSON files, which are uploaded to my GitHub.
library(XML)
library(plyr)
library(knitr)
library(jsonlite)
html.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.html"
xml.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.xml"
json.url <- "https://raw.githubusercontent.com/EyeDen/data607/master/assignment6/books.json"
download.file(html.url, "books.html")
download.file(xml.url, "books.xml")
download.file(json.url, "books.json")
books.html <- as.data.frame(readHTMLTable("books.html"))
kable(books.html)
| NULL.Title | NULL.Authors | NULL.Publisher | NULL.Pages | NULL.Date |
|---|---|---|---|---|
| Norse Mythology | Neil Gaiman | W.W. Norton and Company | 304 | 2/7/17 |
| The Odyssey | Homer, Emily Wilson | W.W. Norton and Company | 592 | 11/7/17 |
| Fairy Tales from the Brothers Grimm: A New English Version | The Brothers Grimm, Philip Pullman | Viking | 432 | 11/8/12 |
Aside from a minor issue with the column names, this is okay. If we were cleaning this up, we’d have to split apart the authors for The Odyssey and Fairy Tales.
Though xmlToDataFrame exists as part of the XML library, it is a bit too simplistic as it can’t handle the multiple author tags for two of the books. We’ll have to try something else. A [StackOverflow][“https://stackoverflow.com/questions/2067098/how-to-transform-xml-data-into-a-data-frame”] question gave me a solution, requiring the plyr library.
books.xml <- ldply(xmlToList("books.xml"), data.frame)
This natively manages to split apart the authors, which readHTMLTable couldn’t manage for HTML. We can clean it up a bit, and remove columns we won’t be needing.
books.xml <- books.xml[, c(2, 3, 8, 4, 5, 6)]
kable(books.xml)
| title | author | author.1 | publisher | pages | date |
|---|---|---|---|---|---|
| Norse Mythology | Neil Gaiman | NA | W.W. Norton and Company | 304 | 2/7/17 |
| The Odyssey | Homer | Emily Wilson | W.W. Norton and Company | 592 | 11/7/17 |
| Fairy Tales from the Brothers Grimm: A New English Version | The Brothers Grimm | Philip Pullman | Viking | 432 | 11/8/12 |
books.json <- as.data.frame(fromJSON("books.json"))
kable(books.json)
| books.title | books.authors | books.publisher | books.pages | books.date |
|---|---|---|---|---|
| Norse Mythology | Neil Gaiman | W.W. Norton and Company | 304 | 2/7/17 |
| The Odyssey | Homer, Emily Wilson | W.W. Norton and Company | 592 | 11/7/17 |
| Fairy Tales from the Brothers Grimm: A New English Version | The Brothers Grimm, Philip Pullman | Viking | 432 | 11/8/12 |
Reading in the JSON file, we get an output exactly like the HTML table, although the column names are a little neater. Rather than separating the authors into new columns automatically, as the XML version had done, this leaves multiple entries in the same column.
So, no, the methods are not identical.