Assignment 7 HTML and JSON

Author

Mei Qi Ng

Published

March 12, 2026

Approach Overview

To gain experience with working with structured data in HTML and JSON formats and to prepare these data to be used in R as data frames.

I using book data centered on the subject of personal growth, written by women authors. The data set consist of three books, one of which includes multiple authors. This will be used to demonstrate the different data formats in a list form. ## Running Code

The selected books are

Girlhood by Melissa Febos (2021)
The High 5 Habit by Mel Robbins (2021)
Burnout: The Secret to Unlocking the Stress Cycle by Emily Nagoski and Amelia Nagoski (2019)

Data Description

Book record attributes: Title, Author, Publication Year, Publisher, Genre

I chose these attributes as they were common details that can be found on websites and looked different in different file formats.

First, I will manually create

HTML file showing a table containing book information. Each row will be a boos , each column will be an book attribute. If the book has more than 1 author, it will list authors as a single text string separated by semicolons.
JSON file with the same book information being stored via nesting objects and arrays in a hierarchical structure. Each book stored as an objected with named attribute and the author will be in an array so it can handle multiple authors for certain books

Data Strategy Proposal

I will loading R packages (rvest and jsonlite) to assist with loading the HTML and JSON files into data frames in R and to perform necessary transformation so that the resulting data frames share the same structure, columns names, and data type for smooth data analysis and comparison.

Sources:

Codebase

Code

#load required packages
library(rvest)
library(jsonlite)

# 1. Load manual made book HTML from Github repository

html_link <- "https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/books.html"

# Read HTML file, locate table element, change into a data frame
html_dftable <- read_html(html_link) |>
  html_element("table") |>
  html_table()

print("HTML Data Frame:")

[1] "HTML Data Frame:"

Code

print(html_dftable)

# A tibble: 3 × 5
  title                                 authors publication_year publisher genre
  <chr>                                 <chr>              <int> <chr>     <chr>
1 Girlhood                              Meliss…             2021 Bloomsbu… Memo…
2 The High 5 Habit                      Mel Ro…             2021 Hay House Self…
3 Burnout: The Secret to Unlocking the… Emily …             2019 Ballanti… Heal…

Code

# 2. Load manual made book JSON from Github repository

json_link <- "https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/books.json"

# Read JSON file into a data frame
json_table <- fromJSON(json_link)

print("JSON Data Frame:")

[1] "JSON Data Frame:"

Code

print(json_table)

                                              title
1                                          Girlhood
2                                  The High 5 Habit
3 Burnout: The Secret to Unlocking the Stress Cycle
                        authors publication_year        publisher
1                 Melissa Febos             2021       Bloomsbury
2                   Mel Robbins             2021        Hay House
3 Emily Nagoski; Amelia Nagoski             2019 Ballantine Books
                genre
1     Memoir / Essays
2           Self-Help
3 Health / Psychology

Code

# 3. Prepare and Compare

# Check for differences using all.equal() (which describes differences) 
# and identical() (which gives a strict TRUE/FALSE)
comparing <- all.equal(html_dftable, json_table)
exact_match <- identical(html_dftable, json_table)

print("Are the html and json data frames identical?")

[1] "Are the html and json data frames identical?"

Code

print(exact_match)

[1] FALSE

Code

# If they are not identical, print specific differences
if(!exact_match) {
  print("Differences found:")
  print(comparing)
}

[1] "Differences found:"
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"

Code

# HTML was a tibble before, convert into standard data frame
html_dftable <- as.data.frame(html_link)

# Double check
all.equal(html_dftable, json_table)

[1] "Names: 1 string mismatch"                                               
[2] "Attributes: < Component \"row.names\": Numeric: lengths (1, 3) differ >"
[3] "Length mismatch: comparison on first 1 components"                      
[4] "Component 1: Lengths (1, 3) differ (string compare on first 1)"         
[5] "Component 1: 1 string mismatch"

Code

identical(html_dftable, json_table)

[1] FALSE

Conclusion

After loading the manually created HTML and JSON files into R, I compared the 2 converted data frames to find out if they were identical. The identical() function returned FALSE, highlighting a few key technical differences in how R packages parse these distinct file formats:

rvest package imported HTML as a tibble while jsonlite package imports JSON file as a standard data frame
There was a header case sensitivity, and comparison failed on the first render because of mismatch header formating (e.g Publish Year vs publish_year). To fix this, I used lower case for the headers since R is very case-sensitive (ex: publication_year in both). This kept everything consistent before loading into R

The information on both the book match perfectly in terms of details however, the underlying data structures and assigned data typed differ based on the parsing packaged used on them and how they were read with these packages.