Assignment HTML and JSON

Author

Khandker Qaiduzzaman

Objective

The objective of this assignment is to create HTML and JSON data files containing information about selected adventure books and import them into R for analysis.

Approach

For this assignment, I created a small dataset containing three adventure books written by authors whose works I enjoy reading, including Alexandre Dumas, James Rollins, and H. Rider Haggard.

The selected books are:

  • The Three Musketeers
  • Sandstorm
  • King Solomon’s Mines

The dataset contains the following variables:

  • Title – Name of the book
  • Authors – Author or authors of the book
  • Publication Year – Year the book was originally published
  • Publisher – Publishing company responsible for the book
  • Genre – Literary genre of the book

The same dataset is stored in two different formats:

  1. HTML table – representing structured tabular data often found on websites.
  2. JSON file – representing hierarchical data commonly used by APIs.

Working with both formats will demonstrate how structured information from web sources can be imported and converted into R data frames.

Anticipated Challenges

One of the main challenges when working with web-based data formats is that they often represent information in different structures. In HTML files, tabular data is stored inside table elements, which require specialized packages such as rvest to extract the data. JSON files, on the other hand, store information using nested objects and arrays, which must be parsed using packages such as jsonlite. Another challenge involves handling nested values such as lists of authors. In JSON format, multiple authors may be stored as an array, which must be converted into a format suitable for analysis within R.

Implementation of Data Import

The following code demonstrates how the HTML table and JSON file can be imported into R and converted into data frames for further analysis.

library(rvest)
library(jsonlite)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(gt)
library(stringr)

The HTML file contains a table that lists the books and their attributes. Using the rvest package, the table can be extracted and converted into a data frame.

# HTML file
html_url <- "https://raw.githubusercontent.com/NafeesKhandker/HTML-and-JSON-Data/refs/heads/main/book.html"

books_html <- read_html(html_url) |>
  html_table(fill = TRUE)

books_html_df <- books_html[[1]]

books_html_df |>
  head() |>
  gt()
Title Authors Publication Year Publisher Genre
The Three Musketeers Alexandre Dumas 1844 Baudry's European Library Historical Adventure
Sandstorm James Rollins 2004 William Morrow Action Adventure / Thriller
King Solomon's Mines H. Rider Haggard, Andrew Lang 1885 Cassell & Company Adventure Fiction

The JSON file stores the same dataset but in a hierarchical format. The jsonlite package can be used to convert the JSON structure into an R data frame.

# JSON file
json_url <- "https://raw.githubusercontent.com/NafeesKhandker/HTML-and-JSON-Data/refs/heads/main/book.json"

books_json <- fromJSON(json_url)

books_json_df <- as.data.frame(books_json$books)

books_json_df |>
  head() |>
  gt()
title authors publication_year publisher genre
The Three Musketeers Alexandre Dumas 1844 Baudry's European Library Historical Adventure
Sandstorm James Rollins 2004 William Morrow Action Adventure / Thriller
King Solomon's Mines H. Rider Haggard, Andrew Lang 1885 Cassell & Company Adventure Fiction
# Convert to base data.frame (removes tibble class differences)
books_html_df <- as.data.frame(books_html_df)
books_json_df <- as.data.frame(books_json_df)

# Standardize column names
colnames(books_html_df) <- c("title","authors","publication_year","publisher","genre")

# Convert JSON authors list to string
books_json_df$authors <- sapply(books_json_df$authors, paste, collapse = ", ")

# Final comparison
identical(books_html_df, books_json_df)
[1] TRUE

Conclusion

The HTML and JSON datasets were imported into R and converted into data frames. Because the JSON file stored authors as an array while the HTML table stored them as a single string, the JSON author field was collapsed into a comma-separated string. After standardizing the column names and data frame structures, the identical() function confirmed that the two datasets were identical.

Reference

  1. OpenAI. (2026, March 15). Conversation about comparing HTML and JSON datasets in R using identical() and all.equal(). ChatGPT. https://chat.openai.com/