Introduction

This project contains information about three books, represented in HTML, XML, and JSON formats. The data is loaded into R data frames, and we check if the data frames are identical.

Book Information

  • Book 1: Gödel, Escher, Bach: An Eternal Golden Braid by Douglas Hofstadter
  • Book 2: The Talisman by Stephen King and Peter Straub
  • Book 3: Crime and Punishment by Fyodor Dostoevsky

HTML, XML, and JSON Representations

We create three different formats to represent the books: HTML, XML, and JSON.

HTML

library(rvest)

# Load HTML file
html_file <- read_html("https://raw.githubusercontent.com/Amish22/DS607/refs/heads/main/Books.html")
html_table <- html_table(html_nodes(html_file, "table")[[1]])

# Display HTML table
html_table
## # A tibble: 3 × 5
##   Title                                         Year `Author(s)`    Genre Themes
##   <chr>                                        <int> <chr>          <chr> <chr> 
## 1 Gödel, Escher, Bach: An Eternal Golden Braid  1979 Douglas Hofst… Non-… Self-…
## 2 The Talisman                                  1984 Stephen King,… Fant… Paral…
## 3 Crime and Punishment                          1866 Fyodor Dostoe… Fict… Guilt…

XML

# Load necessary library
library(xml2)

# Load XML file using xml2
xml_file <- read_xml("https://raw.githubusercontent.com/Amish22/DS607/refs/heads/main/Books.xml")

# Extract relevant nodes and convert to a data frame manually
titles <- xml_text(xml_find_all(xml_file, "//title"))
authors <- xml_text(xml_find_all(xml_file, "//author"))
years <- xml_text(xml_find_all(xml_file, "//year"))
genres <- xml_text(xml_find_all(xml_file, "//genre"))
themes <- xml_find_all(xml_file, "//themes/theme")

# Organize extracted data into a data frame
xml_data <- data.frame(
  Title = titles,
  Author = authors,
  Year = years,
  Genre = genres
)

# Display the XML data
xml_data
##                                          Title                     Author Year
## 1 Gödel, Escher, Bach: An Eternal Golden Braid         Douglas Hofstadter 1979
## 2                                 The Talisman Stephen King, Peter Straub 1984
## 3                         Crime and Punishment          Fyodor Dostoevsky 1866
##                                         Genre
## 1  Non-fiction, Philosophy, Cognitive Science
## 2                             Fantasy, Horror
## 3 Fiction, Psychological, Philosophical novel

JSON

library(jsonlite)

# Load JSON file
json_file <- fromJSON("https://raw.githubusercontent.com/Amish22/DS607/refs/heads/main/Books.json")
json_data <- as.data.frame(json_file$books)

# Display JSON data
json_data
##                                          title year                     author
## 1 Gödel, Escher, Bach: An Eternal Golden Braid 1979         Douglas Hofstadter
## 2                                 The Talisman 1984 Stephen King, Peter Straub
## 3                         Crime and Punishment 1866          Fyodor Dostoevsky
##                                         genre
## 1  Non-fiction, Philosophy, Cognitive Science
## 2                             Fantasy, Horror
## 3 Fiction, Psychological, Philosophical novel
##                                                                        themes
## 1 Self-reference, Formal systems, Intersection of mathematics, art, and music
## 2          Parallel universes, Hero’s journey, Mother-son bond, Good vs. evil
## 3                              Guilt, Morality, Redemption, Crime and justice

Comparison of Data Frames

Check if the data loaded from HTML, XML, and JSON formats are identical.

HTML vs XML

identical(html_table, xml_data)
## [1] FALSE

HTML vs JSON

identical(html_table, json_data)
## [1] FALSE

XML vs JSON

identical(xml_data, json_data)
## [1] FALSE

Conclusion

This project demonstrates how book data can be represented in different formats and loaded into R for analysis and compared the data frames generated from HTML, XML, and JSON files.