Introduction

For this assignment, I selected three of my favorite books on personal growth and self-improvement: The Four Agreements by Don Miguel Ruiz and Janet Mills, The War of Art by Steven Pressfield, and Atomic Habits by James Clear. Each book includes several descriptive attributes, such as the edition, year, publisher, ISBN, number of pages, and a few relevant keywords. The goal of this project is to understand how the same information can be represented in different file formats (HTML, XML, and JSON) and to compare how each structure is interpreted when read back into R.

Step 1 - Create the dataset

books <- data.frame(
  title = c("The Four Agreements",
            "The War of Art",
            "Atomic Habits"),
  
  authors = c("Don Miguel Ruiz; Janet Mills (editor)",
              "Steven Pressfield",
              "James Clear"),
  
  edition = c("1st",
              "Reprint",
              "1st"),
  
  publication_year = c(1997,
                       2012,
                       2018),
  
  publisher = c("Amber-Allen Publishing",
                "Black Irish Entertainment LLC",
                "Avery"),
  
  isbn = c("978-1878424310",
           "978-1936891023",
           "978-0735211292"),
  
  pages = c(160,
            190,
            320),
  
  keywords = c("personal growth, Toltec wisdom, mindset",
               "creativity, resistance, discipline",
               "habits, behavior change, self-improvement")
)

books
##                 title                               authors edition
## 1 The Four Agreements Don Miguel Ruiz; Janet Mills (editor)     1st
## 2      The War of Art                     Steven Pressfield Reprint
## 3       Atomic Habits                           James Clear     1st
##   publication_year                     publisher           isbn pages
## 1             1997        Amber-Allen Publishing 978-1878424310   160
## 2             2012 Black Irish Entertainment LLC 978-1936891023   190
## 3             2018                         Avery 978-0735211292   320
##                                    keywords
## 1   personal growth, Toltec wisdom, mindset
## 2        creativity, resistance, discipline
## 3 habits, behavior change, self-improvement

Step 2 - Create and export data files

I manually constructed three files: an HTML table, an XML document, and a JSON file. Each file contains the same information but organized according to its own structural rules. The R code writes these files line by line to illustrate how these formats differ syntactically.

html_header <- "<!DOCTYPE html>
<html>
  <head>
    <meta charset='UTF-8'>
    <title>My Favorite Books</title>
  </head>
  <body>
    <h1>My Favorite Books</h1>
    <table border='1'>
      <tr>
        <th>Title</th>
        <th>Authors</th>
        <th>Edition</th>
        <th>Year</th>
        <th>Publisher</th>
        <th>ISBN</th>
        <th>Pages</th>
        <th>Keywords</th>
      </tr>"

html_rows <- apply(
  books,
  1,
  function(row) {
    paste0(
      "      <tr><td>",
      paste(row, collapse = "</td><td>"),
      "</td></tr>"
    )
  }
)

html_footer <- "    </table>
  </body>
</html>"

writeLines(
  c(html_header, html_rows, html_footer),
  "books.html"
)
cat('<?xml version="1.0" encoding="UTF-8"?>\n<books>\n', file = "books.xml")

apply(
  books,
  1,
  function(row) {
    cat("  <book>\n", file = "books.xml", append = TRUE)
    
    for (i in seq_along(row)) {
      # escape special characters like & and < so browser XML parser doesn't break
      safe_value <- gsub("&", "&amp;", row[i])
      safe_value <- gsub("<", "&lt;", safe_value)
      safe_value <- gsub(">", "&gt;", safe_value)
      
      cat(
        paste0(
          "    <", names(row)[i], ">", safe_value, "</", names(row)[i], ">\n"
        ),
        file = "books.xml",
        append = TRUE
      )
    }
    
    cat("  </book>\n", file = "books.xml", append = TRUE)
  }
)
## NULL
cat("</books>\n", file = "books.xml", append = TRUE)
library(jsonlite)

write_json(
  books,
  "books.json",
  pretty = TRUE
)

Step 3 - Read files back into R

Using the packages rvest, xml2, and jsonlite, I re-loaded the three files into R and converted each one into a data frame. This allowed me to compare whether any differences appeared after importing from different structures.

library(rvest)
library(xml2)
library(dplyr)
library(purrr)
library(stringr)
df_html <- read_html("books.html") %>%
  html_table(fill = TRUE) %>%
  .[[1]]

names(df_html) <- names(books)

df_html
## # A tibble: 3 × 8
##   title          authors edition publication_year publisher isbn  pages keywords
##   <chr>          <chr>   <chr>              <int> <chr>     <chr> <int> <chr>   
## 1 The Four Agre… Don Mi… 1st                 1997 Amber-Al… 978-…   160 persona…
## 2 The War of Art Steven… Reprint             2012 Black Ir… 978-…   190 creativ…
## 3 Atomic Habits  James … 1st                 2018 Avery     978-…   320 habits,…
xml_doc <- read_xml("books.xml")

df_xml <- xml_find_all(xml_doc, ".//book") %>%
  map_df(function(node) {
    children <- xml_children(node)
    tibble(
      title = xml_text(xml_find_first(node, "./title")),
      authors = xml_text(xml_find_first(node, "./authors")),
      edition = xml_text(xml_find_first(node, "./edition")),
      publication_year = as.integer(xml_text(xml_find_first(node, "./publication_year"))),
      publisher = xml_text(xml_find_first(node, "./publisher")),
      isbn = xml_text(xml_find_first(node, "./isbn")),
      pages = as.integer(xml_text(xml_find_first(node, "./pages"))),
      keywords = xml_text(xml_find_first(node, "./keywords"))
    )
  })

df_xml
## # A tibble: 3 × 8
##   title          authors edition publication_year publisher isbn  pages keywords
##   <chr>          <chr>   <chr>              <int> <chr>     <chr> <int> <chr>   
## 1 The Four Agre… Don Mi… 1st                 1997 Amber-Al… 978-…   160 persona…
## 2 The War of Art Steven… Reprint             2012 Black Ir… 978-…   190 creativ…
## 3 Atomic Habits  James … 1st                 2018 Avery     978-…   320 habits,…
df_json <- fromJSON("books.json")

df_json
##                 title                               authors edition
## 1 The Four Agreements Don Miguel Ruiz; Janet Mills (editor)     1st
## 2      The War of Art                     Steven Pressfield Reprint
## 3       Atomic Habits                           James Clear     1st
##   publication_year                     publisher           isbn pages
## 1             1997        Amber-Allen Publishing 978-1878424310   160
## 2             2012 Black Irish Entertainment LLC 978-1936891023   190
## 3             2018                         Avery 978-0735211292   320
##                                    keywords
## 1   personal growth, Toltec wisdom, mindset
## 2        creativity, resistance, discipline
## 3 habits, behavior change, self-improvement

Step 4 - Compare results

Finally, I compared all three data frames using the identical() function. The comparison shows that the HTML and XML files contain identical information, while the JSON structure has small format differences. This demonstrates how data structure affects precision even when the underlying information is the same.

df_html <- df_html[names(books)]
df_xml  <- df_xml[names(books)]
df_json <- df_json[names(books)]

comparison_results <- list(
  html_vs_xml  = identical(df_html, df_xml),
  html_vs_json = identical(df_html, df_json),
  xml_vs_json  = identical(df_xml, df_json)
)

comparison_results
## $html_vs_xml
## [1] TRUE
## 
## $html_vs_json
## [1] FALSE
## 
## $xml_vs_json
## [1] FALSE

Reflection

This exercise showed me how different file formats serve specific data needs.

All three can hold identical content, but their structures reveal the unique strengths of each format.


The three source files for this project are publicly available:

Each file represents the same dataset in a different structure (HTML, XML, and JSON).