Assignment HTML and JSON

Author

Khandker Qaiduzzaman

Objective

The objective of this assignment is to create HTML and JSON data files containing information about selected adventure books and import them into R for analysis.

Approach

For this assignment, I created a small dataset containing three adventure books written by authors whose works I enjoy reading, including Alexandre Dumas, James Rollins, and H. Rider Haggard.

The selected books are:

The Three Musketeers
Sandstorm
King Solomon’s Mines

The dataset contains the following variables:

Title – Name of the book
Authors – Author or authors of the book
Publication Year – Year the book was originally published
Publisher – Publishing company responsible for the book
Genre – Literary genre of the book

The same dataset is stored in two different formats:

HTML table – representing structured tabular data often found on websites.
JSON file – representing hierarchical data commonly used by APIs.

Working with both formats will demonstrate how structured information from web sources can be imported and converted into R data frames.

Anticipated Challenges

One of the main challenges when working with web-based data formats is that they often represent information in different structures. In HTML files, tabular data is stored inside table elements, which require specialized packages such as rvest to extract the data. JSON files, on the other hand, store information using nested objects and arrays, which must be parsed using packages such as jsonlite. Another challenge involves handling nested values such as lists of authors. In JSON format, multiple authors may be stored as an array, which must be converted into a format suitable for analysis within R.

Implementation of Data Import

The following code demonstrates how the HTML table and JSON file can be imported into R and converted into data frames for further analysis.

library(rvest)
library(jsonlite)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(gt)
library(stringr)

The HTML file contains a table that lists the books and their attributes. Using the rvest package, the table can be extracted and converted into a data frame.

# HTML file
html_url <- "https://raw.githubusercontent.com/NafeesKhandker/HTML-and-JSON-Data/refs/heads/main/book.html"

books_html <- read_html(html_url) |>
  html_table(fill = TRUE)

books_html_df <- books_html[[1]]

books_html_df |>
  head() |>
  gt()

Title	Authors	Publication Year	Publisher	Genre
The Three Musketeers	Alexandre Dumas	1844	Baudry's European Library	Historical Adventure
Sandstorm	James Rollins	2004	William Morrow	Action Adventure / Thriller
King Solomon's Mines	H. Rider Haggard, Andrew Lang	1885	Cassell & Company	Adventure Fiction

The JSON file stores the same dataset but in a hierarchical format. The jsonlite package can be used to convert the JSON structure into an R data frame.

# JSON file
json_url <- "https://raw.githubusercontent.com/NafeesKhandker/HTML-and-JSON-Data/refs/heads/main/book.json"

books_json <- fromJSON(json_url)

books_json_df <- as.data.frame(books_json$books)

books_json_df |>
  head() |>
  gt()

title	authors	publication_year	publisher	genre
The Three Musketeers	Alexandre Dumas	1844	Baudry's European Library	Historical Adventure
Sandstorm	James Rollins	2004	William Morrow	Action Adventure / Thriller
King Solomon's Mines	H. Rider Haggard, Andrew Lang	1885	Cassell & Company	Adventure Fiction

# Convert to base data.frame (removes tibble class differences)
books_html_df <- as.data.frame(books_html_df)
books_json_df <- as.data.frame(books_json_df)

# Standardize column names
colnames(books_html_df) <- c("title","authors","publication_year","publisher","genre")

# Convert JSON authors list to string
books_json_df$authors <- sapply(books_json_df$authors, paste, collapse = ", ")

# Final comparison
identical(books_html_df, books_json_df)

[1] TRUE

Conclusion

The HTML and JSON datasets were imported into R and converted into data frames. Because the JSON file stored authors as an array while the HTML table stored them as a single string, the JSON author field was collapsed into a comma-separated string. After standardizing the column names and data frame structures, the identical() function confirmed that the two datasets were identical.

Reference

OpenAI. (2026, March 15). Conversation about comparing HTML and JSON datasets in R using identical() and all.equal(). ChatGPT. https://chat.openai.com/