For this assignment, I selected three of my favorite books on personal growth and self-improvement: The Four Agreements by Don Miguel Ruiz and Janet Mills, The War of Art by Steven Pressfield, and Atomic Habits by James Clear. Each book includes several descriptive attributes, such as the edition, year, publisher, ISBN, number of pages, and a few relevant keywords. The goal of this project is to understand how the same information can be represented in different file formats (HTML, XML, and JSON) and to compare how each structure is interpreted when read back into R.
books <- data.frame(
title = c("The Four Agreements",
"The War of Art",
"Atomic Habits"),
authors = c("Don Miguel Ruiz; Janet Mills (editor)",
"Steven Pressfield",
"James Clear"),
edition = c("1st",
"Reprint",
"1st"),
publication_year = c(1997,
2012,
2018),
publisher = c("Amber-Allen Publishing",
"Black Irish Entertainment LLC",
"Avery"),
isbn = c("978-1878424310",
"978-1936891023",
"978-0735211292"),
pages = c(160,
190,
320),
keywords = c("personal growth, Toltec wisdom, mindset",
"creativity, resistance, discipline",
"habits, behavior change, self-improvement")
)
books
## title authors edition
## 1 The Four Agreements Don Miguel Ruiz; Janet Mills (editor) 1st
## 2 The War of Art Steven Pressfield Reprint
## 3 Atomic Habits James Clear 1st
## publication_year publisher isbn pages
## 1 1997 Amber-Allen Publishing 978-1878424310 160
## 2 2012 Black Irish Entertainment LLC 978-1936891023 190
## 3 2018 Avery 978-0735211292 320
## keywords
## 1 personal growth, Toltec wisdom, mindset
## 2 creativity, resistance, discipline
## 3 habits, behavior change, self-improvement
I manually constructed three files: an HTML table, an XML document, and a JSON file. Each file contains the same information but organized according to its own structural rules. The R code writes these files line by line to illustrate how these formats differ syntactically.
html_header <- "<!DOCTYPE html>
<html>
<head>
<meta charset='UTF-8'>
<title>My Favorite Books</title>
</head>
<body>
<h1>My Favorite Books</h1>
<table border='1'>
<tr>
<th>Title</th>
<th>Authors</th>
<th>Edition</th>
<th>Year</th>
<th>Publisher</th>
<th>ISBN</th>
<th>Pages</th>
<th>Keywords</th>
</tr>"
html_rows <- apply(
books,
1,
function(row) {
paste0(
" <tr><td>",
paste(row, collapse = "</td><td>"),
"</td></tr>"
)
}
)
html_footer <- " </table>
</body>
</html>"
writeLines(
c(html_header, html_rows, html_footer),
"books.html"
)
cat('<?xml version="1.0" encoding="UTF-8"?>\n<books>\n', file = "books.xml")
apply(
books,
1,
function(row) {
cat(" <book>\n", file = "books.xml", append = TRUE)
for (i in seq_along(row)) {
# escape special characters like & and < so browser XML parser doesn't break
safe_value <- gsub("&", "&", row[i])
safe_value <- gsub("<", "<", safe_value)
safe_value <- gsub(">", ">", safe_value)
cat(
paste0(
" <", names(row)[i], ">", safe_value, "</", names(row)[i], ">\n"
),
file = "books.xml",
append = TRUE
)
}
cat(" </book>\n", file = "books.xml", append = TRUE)
}
)
## NULL
cat("</books>\n", file = "books.xml", append = TRUE)
library(jsonlite)
write_json(
books,
"books.json",
pretty = TRUE
)
Using the packages rvest, xml2, and jsonlite, I re-loaded the three files into R and converted each one into a data frame. This allowed me to compare whether any differences appeared after importing from different structures.
library(rvest)
library(xml2)
library(dplyr)
library(purrr)
library(stringr)
df_html <- read_html("books.html") %>%
html_table(fill = TRUE) %>%
.[[1]]
names(df_html) <- names(books)
df_html
## # A tibble: 3 × 8
## title authors edition publication_year publisher isbn pages keywords
## <chr> <chr> <chr> <int> <chr> <chr> <int> <chr>
## 1 The Four Agre… Don Mi… 1st 1997 Amber-Al… 978-… 160 persona…
## 2 The War of Art Steven… Reprint 2012 Black Ir… 978-… 190 creativ…
## 3 Atomic Habits James … 1st 2018 Avery 978-… 320 habits,…
xml_doc <- read_xml("books.xml")
df_xml <- xml_find_all(xml_doc, ".//book") %>%
map_df(function(node) {
children <- xml_children(node)
tibble(
title = xml_text(xml_find_first(node, "./title")),
authors = xml_text(xml_find_first(node, "./authors")),
edition = xml_text(xml_find_first(node, "./edition")),
publication_year = as.integer(xml_text(xml_find_first(node, "./publication_year"))),
publisher = xml_text(xml_find_first(node, "./publisher")),
isbn = xml_text(xml_find_first(node, "./isbn")),
pages = as.integer(xml_text(xml_find_first(node, "./pages"))),
keywords = xml_text(xml_find_first(node, "./keywords"))
)
})
df_xml
## # A tibble: 3 × 8
## title authors edition publication_year publisher isbn pages keywords
## <chr> <chr> <chr> <int> <chr> <chr> <int> <chr>
## 1 The Four Agre… Don Mi… 1st 1997 Amber-Al… 978-… 160 persona…
## 2 The War of Art Steven… Reprint 2012 Black Ir… 978-… 190 creativ…
## 3 Atomic Habits James … 1st 2018 Avery 978-… 320 habits,…
df_json <- fromJSON("books.json")
df_json
## title authors edition
## 1 The Four Agreements Don Miguel Ruiz; Janet Mills (editor) 1st
## 2 The War of Art Steven Pressfield Reprint
## 3 Atomic Habits James Clear 1st
## publication_year publisher isbn pages
## 1 1997 Amber-Allen Publishing 978-1878424310 160
## 2 2012 Black Irish Entertainment LLC 978-1936891023 190
## 3 2018 Avery 978-0735211292 320
## keywords
## 1 personal growth, Toltec wisdom, mindset
## 2 creativity, resistance, discipline
## 3 habits, behavior change, self-improvement
Finally, I compared all three data frames using the identical() function. The comparison shows that the HTML and XML files contain identical information, while the JSON structure has small format differences. This demonstrates how data structure affects precision even when the underlying information is the same.
df_html <- df_html[names(books)]
df_xml <- df_xml[names(books)]
df_json <- df_json[names(books)]
comparison_results <- list(
html_vs_xml = identical(df_html, df_xml),
html_vs_json = identical(df_html, df_json),
xml_vs_json = identical(df_xml, df_json)
)
comparison_results
## $html_vs_xml
## [1] TRUE
##
## $html_vs_json
## [1] FALSE
##
## $xml_vs_json
## [1] FALSE
This exercise showed me how different file formats serve specific data needs.
All three can hold identical content, but their structures reveal the unique strengths of each format.
The three source files for this project are publicly available:
Each file represents the same dataset in a different structure (HTML, XML, and JSON).