To gain experience with working with structured data in HTML and JSON formats and to prepare these data to be used in R as data frames.
I using book data centered on the subject of personal growth, written by women authors. The data set consist of three books, one of which includes multiple authors. This will be used to demonstrate the different data formats in a list form. ## Running Code
The selected books are
Girlhood by Melissa Febos (2021)
The High 5 Habit by Mel Robbins (2021)
Burnout: The Secret to Unlocking the Stress Cycle by Emily Nagoski and Amelia Nagoski (2019)
Data Description
Book record attributes: Title, Author, Publication Year, Publisher, Genre
I chose these attributes as they were common details that can be found on websites and looked different in different file formats.
First, I will manually create
HTML file showing a table containing book information. Each row will be a boos , each column will be an book attribute. If the book has more than 1 author, it will list authors as a single text string separated by semicolons.
JSON file with the same book information being stored via nesting objects and arrays in a hierarchical structure. Each book stored as an objected with named attribute and the author will be in an array so it can handle multiple authors for certain books
Data Strategy Proposal
I will loading R packages (rvest and jsonlite) to assist with loading the HTML and JSON files into data frames in R and to perform necessary transformation so that the resulting data frames share the same structure, columns names, and data type for smooth data analysis and comparison.
#load required packageslibrary(rvest)library(jsonlite)# 1. Load manual made book HTML from Github repositoryhtml_link <-"https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/books.html"# Read HTML file, locate table element, change into a data framehtml_dftable <-read_html(html_link) |>html_element("table") |>html_table()print("HTML Data Frame:")
[1] "HTML Data Frame:"
Code
print(html_dftable)
# A tibble: 3 × 5
title authors publication_year publisher genre
<chr> <chr> <int> <chr> <chr>
1 Girlhood Meliss… 2021 Bloomsbu… Memo…
2 The High 5 Habit Mel Ro… 2021 Hay House Self…
3 Burnout: The Secret to Unlocking the… Emily … 2019 Ballanti… Heal…
Code
# 2. Load manual made book JSON from Github repositoryjson_link <-"https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/books.json"# Read JSON file into a data framejson_table <-fromJSON(json_link)print("JSON Data Frame:")
[1] "JSON Data Frame:"
Code
print(json_table)
title
1 Girlhood
2 The High 5 Habit
3 Burnout: The Secret to Unlocking the Stress Cycle
authors publication_year publisher
1 Melissa Febos 2021 Bloomsbury
2 Mel Robbins 2021 Hay House
3 Emily Nagoski; Amelia Nagoski 2019 Ballantine Books
genre
1 Memoir / Essays
2 Self-Help
3 Health / Psychology
Code
# 3. Prepare and Compare# Check for differences using all.equal() (which describes differences) # and identical() (which gives a strict TRUE/FALSE)comparing <-all.equal(html_dftable, json_table)exact_match <-identical(html_dftable, json_table)print("Are the html and json data frames identical?")
[1] "Are the html and json data frames identical?"
Code
print(exact_match)
[1] FALSE
Code
# If they are not identical, print specific differencesif(!exact_match) {print("Differences found:")print(comparing)}
After loading the manually created HTML and JSON files into R, I compared the 2 converted data frames to find out if they were identical. The identical() function returned FALSE, highlighting a few key technical differences in how R packages parse these distinct file formats:
rvest package imported HTML as a tibble while jsonlite package imports JSON file as a standard data frame
There was a header case sensitivity, and comparison failed on the first render because of mismatch header formating (e.g Publish Year vs publish_year). To fix this, I used lower case for the headers since R is very case-sensitive (ex: publication_year in both). This kept everything consistent before loading into R
The information on both the book match perfectly in terms of details however, the underlying data structures and assigned data typed differ based on the parsing packaged used on them and how they were read with these packages.