Introduction

For this assignment, I picked three books from one of my favorite genres: horror fiction, and created three files in which I stored each book’s information, and read each file into R:

HTML

XML

JSON

Reading Into R

For this assignment, I utilized the tidyverse package in order to pipe my data into R for readability. The individual packages I will note with their corresponding file formats for comprehension.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Reading HTML

library(htmltab)

file <- "https://raw.githubusercontent.com/josh1den/DATA-607/main/HW/HW7/DATA607_HW7.html"
html <- htmltab(file, which = 1) |>
        as.data.frame()
print(html)
##                                            title                       author
## 2          Other Terrors: An Inclusive Anthology Vince Liaguno and Rena Mason
## 3                          The Book Of Accidents                 Chuck Wendig
## 4 The Rim Of Morning: Two Tales of Cosmic Horror               William Sloane
##             genre published pages
## 2 Horror, Fiction      2022   363
## 3 Horror, Fiction      2021   560
## 4 Horror, Fiction      2015   480

Reading XML

library(XML)
library(xml2)
file <- "https://raw.githubusercontent.com/josh1den/DATA-607/main/HW/HW7/DATA607_HW7.xml"

xml <- xml2::read_xml(file) |>
       XML::xmlParse() |>
       XML::xmlToDataFrame()

print(xml)
##                                            title                       author
## 1          Other Terrors: An Inclusive Anthology Vince Liaguno and Rena Mason
## 2                          The Book Of Accidents                 Chuck Wendig
## 3 The Rim Of Morning: Two Tales of Cosmic Horror               William Sloane
##             genre published pages
## 1 Horror, Fiction      2022   363
## 2 Horror, Fiction      2021   560
## 3 Horror, Fiction      2015   480

Reading JSON

library(rjson)
file <- "https://raw.githubusercontent.com/josh1den/DATA-607/main/HW/HW7/DATA607_HW7.json"
json <- fromJSON(file=file) |>
        as.data.frame()
print(json)
##                             books.title                 books.author
## 1 Other Terrors: An Inclusive Anthology Vince Liaguno and Rena Mason
##       books.genre books.published books.pages         books.title.1
## 1 Horror, Fiction            2022         363 The Book Of Accidents
##   books.author.1   books.genre.1 books.published.1 books.pages.1
## 1   Chuck Wendig Horror, Fiction              2021           560
##                                    books.title.2 books.author.2   books.genre.2
## 1 The Rim Of Morning: Two Tales of Cosmic Horror William Sloane Horror, Fiction
##   books.published.2 books.pages.2
## 1              2015           480

While the HTML and XML dataframes each approximate one another, the JSON does not. Upon investigation, I discovered that the format of my JSON was providing challenges I was unable to resolve. By altering my JSON format, I was able to read in to R delivering an output equivalent to those of the XML and HTML formats:

file_v2 <- "https://raw.githubusercontent.com/josh1den/DATA-607/main/HW/HW7/DATA607_HW7_V2.json"
json_v2 <- fromJSON(file=file_v2) |>
           as.data.frame()
print(json_v2)
##                                            title                       author
## 1          Other Terrors: An Inclusive Anthology Vince Liaguno and Rena Mason
## 2                          The Book Of Accidents                 Chuck Wendig
## 3 The Rim Of Morning: Two Tales of Cosmic Horror               William Sloane
##             genre published pages
## 1 Horror, Fiction      2022   363
## 2 Horror, Fiction      2021   560
## 3 Horror, Fiction      2015   480

While I did not resolve the output challenge for the first version JSON file, one insight I glean from this is that JSON formatting can have a major effect on its output, and understanding the structure of your source file is essential to crafting code to achieve the desired output.