Data 607 Week 7 Assignment

Introduction:

I have created three different files in three different file formats (JSON, HTML and XML). I chose three different books that focus on the Troubles in Northern Ireland (a subject that played a large role in my mom’s younger years). The goal of this assignment is to create three different files in three different file formats and display each of them in a data frame. I will attempt to represent the data with the same format; however, I will also make a point to recognize when the formatting is different as that will inform how I attempt to work with different file formats in the future.

Loading Necessary packages

library(XML)
library(methods)
library(rjson)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

XML

bookinfo_xml <- xmlToDataFrame('bookinfo.xml')

bookinfo_xml

##                                                                         Title
## 1 Making Sense of the Troubles: The Story of the Conflict in Northern Ireland
## 2                                                                     Milkman
## 3                                                            Ressurection Man
##                         Authors Length PublishingYear           Publisher
## 1 David McKittrick, David McVea    368           2002 New Amsterdam Books
## 2                    Anna Burns    368           2018         Faber Faber
## 3                  Eoin McNamee    240           1995             Picador

HTML

bookinfo_html <- readHTMLTable('bookinfo.html')

bookinfo_html

## $`NULL`
##                                                                         Title
## 1 Making Sense of the Troubles: The Story of the Conflict in Northern Ireland
## 2                                                                     Milkman
## 3                                                            Resurrection Man
##                          Author Length PublishingYear           Publisher
## 1 David McKittrick, David McVea    368           2002 New Amsterdam Books
## 2                    Anna Burns    368           2018         Faber Faber
## 3                  Eoin McNamee    240           1995             Picador

JSON

bookinfo_json <- fromJSON(file = "bookinfo.json") %>% as.data.frame

bookinfo_json

##                                                                         Title
## 1 Making Sense of the Troubles: The Story of the Conflict in Northern Ireland
## 2                                                                     Milkman
## 3                                                            Resurrection Man
##                          Author Length PublishingYear           Publisher
## 1 David McKittrick, David McVea    368           2002 New Amsterdam Books
## 2                    Anna Burns    368           2018         Faber Faber
## 3                  Eoin McNamee    240           1995             Picador

Are the three data frames identical?

The three data frames are identical in presentation despite the different storage methods. The challenge with each method seemed to be to find a creative way to store the multiple authors. I know that with JSON you can store multiple items in an array which is what I had wanted to do, however that method was not working when I wanted to display my file in a data frame. A similar issue came up with my html file (the two authors were stored as one item). Though, the way I stored the two authors for one book was different as the XML format made it easy to separately store authors as individual items. I know I could go back, now that I have the files displayed in tables, and separate those authors easily, but this was a learning process for different data structures.