Data 607, Assignment 7

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Book Choices

I went with a few favorites, George Orwell’s 1984, JRR Tolkien’s The Hobbit, and Good Omens, co-written by Terry Pratchett and Neil Gaiman. From there I built datasets that captured author, title, publication year, genre, and themes.

##Load Datasets

library(rvest)    
library(XML)      
library(jsonlite) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
html_file <- "https://raw.githubusercontent.com/tcgraham-data/data-607-assignment-7/refs/heads/main/books.html"
html_page <- read_html(html_file)

books_html <- html_page %>% 
  html_node("table") %>% 
  html_table()

print("Data from HTML file:")
## [1] "Data from HTML file:"
print(books_html)
## # A tibble: 3 × 5
##   Title      Authors                      `Publication Year` Genre     Themes   
##   <chr>      <chr>                                     <int> <chr>     <chr>    
## 1 1984       George Orwell                              1949 Dystopian Totalita…
## 2 The Hobbit J.R.R. Tolkien                             1937 Fantasy   Adventur…
## 3 Good Omens Neil Gaiman, Terry Pratchett               1990 Fantasy   Apocalyp…
xml_file <- "https://raw.githubusercontent.com/tcgraham-data/data-607-assignment-7/refs/heads/main/books.xml"
xml_content <- paste(readLines(xml_file, warn = FALSE), collapse = "\n")
xml_data <- xmlParse(xml_content)

books_xml <- xmlToDataFrame(nodes = getNodeSet(xml_data, "//book"))
print("Data from XML file:")
## [1] "Data from XML file:"
print(books_xml)
##        title                    authors publication_year     genre
## 1       1984              George Orwell             1949 Dystopian
## 2 The Hobbit             J.R.R. Tolkien             1937   Fantasy
## 3 Good Omens Neil GaimanTerry Pratchett             1990   Fantasy
##                                     themes
## 1 TotalitarianismSurveillanceIndividualism
## 2               AdventureHeroismFriendship
## 3           ApocalypseHumorMoral Ambiguity
json_file <- "https://raw.githubusercontent.com/tcgraham-data/data-607-assignment-7/refs/heads/main/books.json"
json_data <- fromJSON(json_file)

books_json <- as.data.frame(json_data$books)
print("Data from JSON file:")
## [1] "Data from JSON file:"
print(books_json)
##        title                      authors publication_year     genre
## 1       1984                George Orwell             1949 Dystopian
## 2 The Hobbit               J.R.R. Tolkien             1937   Fantasy
## 3 Good Omens Neil Gaiman, Terry Pratchett             1990   Fantasy
##                                         themes
## 1 Totalitarianism, Surveillance, Individualism
## 2               Adventure, Heroism, Friendship
## 3           Apocalypse, Humor, Moral Ambiguity

Comparing Dataframes

Very interesting. The need to tidy data becomes very apparent, very quickly. We’ve been working with hmtl files which in this instances displays multiple items under a since heading using a comma dilineated format. E.g., for Author it shows “Neil Gaiman, Terry Pratchett” yet when we look at the xml version, there is no comma dilineation and we see “Neil GaimanTerry Pratchett” and then with json it just tells us there is a vector with 2 entries: <chr [2]>

It wouldn’t take a lot to tidy this data in R to render the xml and json versions to display similar to how it was with html.

Are the three data frames identical?

They contain the same data, but to call them identical would be inaccurate. While the hold the same contents, they don’t natively display in a way that is easily interpreted. To continue with the prior example, in all three cases the authors of “Good Omens” is still captured as Terry Pratchett and Nail Gaiman. However, html is comma dilineated for easy reading. xml is not, making it more difficult, and json simply tells me the author is a vector with two elements, which we happen to know (as author of the data) is Terry Pratchett and Nail Gaiman.