Intro

In this assignment, I will practice working with some different data/file formats by using some of my favorite books.

I have placed the given text file in my github repo. Exploring it below:

# 1. Read HTML table
html_df <- read_html("https://raw.githubusercontent.com/cdube89128/DATA-607/refs/heads/main/week-07/books.html") %>%
  html_node("table") %>%
  html_table(header = TRUE, fill = TRUE, trim = TRUE) %>%
  as_tibble()

# Normalize HTML authors/attributes into list-columns
html_df <- html_df %>%
  mutate(authors = strsplit(gsub(";\\s*", "; ", Authors), ";\\s*|;\\s*"),
         attributes = purrr::pmap(list(`Attribute 1`, `Attribute 2`), function(a,b) c(a,b))) %>%
  select(title = Title, authors, attributes)
# 2. Read XML
xml_doc <- read_xml("https://raw.githubusercontent.com/cdube89128/DATA-607/refs/heads/main/week-07/books.xml")
book_nodes <- xml_find_all(xml_doc, ".//book")
xml_list <- lapply(book_nodes, function(b) {
  title <- xml_text(xml_find_first(b, "./title"))
  authors <- xml_text(xml_find_all(b, "./authors/author"))
  attrs <- xml_text(xml_find_all(b, "./attributes/attribute"))
  list(title = title, authors = authors, attributes = attrs)
})
xml_df <- tibble::tibble(
  title = vapply(xml_list, function(x) x$title, FUN.VALUE = character(1)),
  authors = lapply(xml_list, function(x) x$authors),
  attributes = lapply(xml_list, function(x) x$attributes)
)
# 3. Read JSON
json_df_raw <- fromJSON("https://raw.githubusercontent.com/cdube89128/DATA-607/refs/heads/main/week-07/books.json", simplifyDataFrame = FALSE)
json_df <- tibble::tibble(
  title = vapply(json_df_raw, function(x) x$title, FUN.VALUE = character(1)),
  authors = lapply(json_df_raw, function(x) x$authors),
  attributes = lapply(json_df_raw, function(x) x$attributes)
)

Comparing the data from the different sources

flatten <- function(df) {
  df %>% 
    mutate(
      authors_flat = vapply(authors, function(x) paste(x, collapse = "<br>"), FUN.VALUE = character(1)),
      attributes_flat = vapply(attributes, function(x) paste(x, collapse = "<br>"), FUN.VALUE = character(1))
    ) %>%
    select(title, authors_flat, attributes_flat)
}

# Flatten the data so that it all prints out
html_flat <- flatten(html_df)
xml_flat  <- flatten(xml_df)
json_flat <- flatten(json_df)

# Looking at the data
kable(html_flat, format = "html", escape = FALSE, caption = "HTML Data Frame") %>%
  kable_styling(full_width = FALSE)
HTML Data Frame
title authors_flat attributes_flat
A Little Princess Frances Hodgson Burnett The book is an expansion of the short story “Sara Crewe: or, What Happened at Miss Minchin’s”.
Children’s literature / classic
Helter Skelter: The True Story of the Manson Murders Vincent Bugliosi
Curt Gentry
Vincent Bugliosi was the prosecutor on the Charles Manson case.
True crime / historical account
The Radium Girls: The Dark Story of America’s Shining Women Kate Moore Kate Moore did extensive research for the book, visiting the homes and graves of some of the women affected.
Social history / labor & public health
kable(xml_flat, format = "html", escape = FALSE, caption = "XML Data Frame") %>%
  kable_styling(full_width = FALSE)
XML Data Frame
title authors_flat attributes_flat
A Little Princess Frances Hodgson Burnett The book is an expansion of the short story “Sara Crewe: or, What Happened at Miss Minchin’s”.
Children’s literature / classic
Helter Skelter: The True Story of the Manson Murders Vincent Bugliosi
Curt Gentry
Vincent Bugliosi was the prosecutor on the Charles Manson case.
True crime / historical account
The Radium Girls: The Dark Story of America’s Shining Women Kate Moore Kate Moore did extensive research for the book, visiting the homes and graves of some of the women affected.
Social history / labor & public health
kable(json_flat, format = "html", escape = FALSE, caption = "JSON Data Frame") %>%
  kable_styling(full_width = FALSE)
JSON Data Frame
title authors_flat attributes_flat
A Little Princess Frances Hodgson Burnett The book is an expansion of the short story “Sara Crewe: or, What Happened at Miss Minchin’s”.
Children’s literature / classic
Helter Skelter: The True Story of the Manson Murders Vincent Bugliosi
Curt Gentry
Vincent Bugliosi was the prosecutor on the Charles Manson case.
True crime / historical account
The Radium Girls: The Dark Story of America’s Shining Women Kate Moore Kate Moore did extensive research for the book, visiting the homes and graves of some of the women affected.
Social history / labor & public health

Checking if these are identical

h <- flatten(html_df)
x <- flatten(xml_df)
j <- flatten(json_df)

cat("Are HTML and XML identical? ", identical(h, x), "\\n")
## Are HTML and XML identical?  TRUE \n
cat("Are HTML and JSON identical? ", identical(h, j), "\\n")
## Are HTML and JSON identical?  TRUE \n
cat("Are XML and JSON identical? ", identical(x, j), "\\n")
## Are XML and JSON identical?  TRUE \n

Conclusion

The files can be (and were) read in identically!