Assignment 7 – Working with HTML and JSON: Approach

Author

Muhammad Suffyan Khan

Published

March 12, 2026

Introduction

This assignment focuses on working with two common semi-structured data formats: HTML and JSON. The goal is to better understand how the same dataset can be represented in different file structures and then imported into R for downstream analysis.

For this exercise, I will manually create two files containing the same book dataset: an HTML file containing a table and a JSON file representing the same information as structured objects. I will then write R code to load both files into R data frames and compare them to determine whether they are identical.

Selected Dataset

For this assignment I selected three books related to data science and machine learning, which aligns with the subject of this course.

The books included in the dataset are:

An Introduction to Statistical Learning – Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow – Aurélien Géron
Python for Data Analysis – Wes McKinney

These books allow the dataset to include multiple authors for at least one book, which satisfies the assignment requirement.

Each book record will contain the following attributes:

Title
Authors
Publication Year
Publisher
Genre

Planned Approach

1. Manually Create the HTML File

I will create a file named books.html by hand. This file will contain an HTML table representing the book dataset.

The table will include:

A header row listing the attributes
One row for each book
Columns for title, authors, publication year, publisher, and genre

This step will help reinforce how tabular data is structured in HTML.

2. Manually Create the JSON File

Next, I will create a file named books.json manually. This file will represent the same book dataset in JSON format.

The JSON structure will consist of:

An array containing book objects
One object per book
Fields matching the attributes used in the HTML table

This demonstrates how the same dataset can be represented in a hierarchical structure.

3. Upload Files to GitHub

Both files (books.html and books.json) will be uploaded to a public GitHub repository. This allows the files to be accessed directly from the web.

Using GitHub raw file links ensures that the R code can load the files without relying on local file paths.

4. Load the HTML Data into R

Using the rvest package, I will read the HTML table from the GitHub raw URL and convert it into an R data frame.

During this step I will verify:

The table structure
Column names
Data types

5. Load the JSON Data into R

Using the jsonlite package, I will read the JSON file from the GitHub raw URL and convert it into a second R data frame.

I will check that the imported data matches the structure and values from the HTML table.

6. Compare the Two Data Frames

After both datasets are loaded, I will compare them using:

identical() to test strict equality
all.equal() to detect minor structural differences if present

This step will confirm whether both file formats produce identical data frames when imported into R.

R Packages to Be Used

The following packages will be used:

rvest – to read HTML tables
jsonlite – to read JSON data
dplyr – for light data manipulation and comparison

Reproducibility

To ensure the analysis is reproducible:

The HTML and JSON files will be manually created
Both files will be stored in a public GitHub repository
The Quarto document will load the files directly from GitHub raw URLs