Assignment 7 – Working with HTML and JSON: Approach

Author

Muhammad Suffyan Khan

Published

March 12, 2026

Introduction

This assignment focuses on working with two common semi-structured data formats: HTML and JSON. The goal is to better understand how the same dataset can be represented in different file structures and then imported into R for downstream analysis.

For this exercise, I will manually create two files containing the same book dataset: an HTML file containing a table and a JSON file representing the same information as structured objects. I will then write R code to load both files into R data frames and compare them to determine whether they are identical.


Selected Dataset

For this assignment I selected three books related to data science and machine learning, which aligns with the subject of this course.

The books included in the dataset are:

  • An Introduction to Statistical Learning – Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow – Aurélien Géron
  • Python for Data Analysis – Wes McKinney

These books allow the dataset to include multiple authors for at least one book, which satisfies the assignment requirement.

Each book record will contain the following attributes:

  • Title
  • Authors
  • Publication Year
  • Publisher
  • Genre

Planned Approach

1. Manually Create the HTML File

I will create a file named books.html by hand. This file will contain an HTML table representing the book dataset.

The table will include:

  • A header row listing the attributes
  • One row for each book
  • Columns for title, authors, publication year, publisher, and genre

This step will help reinforce how tabular data is structured in HTML.


2. Manually Create the JSON File

Next, I will create a file named books.json manually. This file will represent the same book dataset in JSON format.

The JSON structure will consist of:

  • An array containing book objects
  • One object per book
  • Fields matching the attributes used in the HTML table

This demonstrates how the same dataset can be represented in a hierarchical structure.


3. Upload Files to GitHub

Both files (books.html and books.json) will be uploaded to a public GitHub repository. This allows the files to be accessed directly from the web.

Using GitHub raw file links ensures that the R code can load the files without relying on local file paths.


4. Load the HTML Data into R

Using the rvest package, I will read the HTML table from the GitHub raw URL and convert it into an R data frame.

During this step I will verify:

  • The table structure
  • Column names
  • Data types

5. Load the JSON Data into R

Using the jsonlite package, I will read the JSON file from the GitHub raw URL and convert it into a second R data frame.

I will check that the imported data matches the structure and values from the HTML table.


6. Compare the Two Data Frames

After both datasets are loaded, I will compare them using:

  • identical() to test strict equality
  • all.equal() to detect minor structural differences if present

This step will confirm whether both file formats produce identical data frames when imported into R.


R Packages to Be Used

The following packages will be used:

  • rvest – to read HTML tables
  • jsonlite – to read JSON data
  • dplyr – for light data manipulation and comparison

Reproducibility

To ensure the analysis is reproducible:

  • The HTML and JSON files will be manually created
  • Both files will be stored in a public GitHub repository
  • The Quarto document will load the files directly from GitHub raw URLs