Assignment 7 – Working with HTML and JSON: Approach
Introduction
This assignment focuses on working with two common semi-structured data formats: HTML and JSON. The goal is to better understand how the same dataset can be represented in different file structures and then imported into R for downstream analysis.
For this exercise, I will manually create two files containing the same book dataset: an HTML file containing a table and a JSON file representing the same information as structured objects. I will then write R code to load both files into R data frames and compare them to determine whether they are identical.
Selected Dataset
For this assignment I selected three books related to data science and machine learning, which aligns with the subject of this course.
The books included in the dataset are:
- An Introduction to Statistical Learning – Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow – Aurélien Géron
- Python for Data Analysis – Wes McKinney
These books allow the dataset to include multiple authors for at least one book, which satisfies the assignment requirement.
Each book record will contain the following attributes:
- Title
- Authors
- Publication Year
- Publisher
- Genre
Planned Approach
1. Manually Create the HTML File
I will create a file named books.html by hand. This file will contain an HTML table representing the book dataset.
The table will include:
- A header row listing the attributes
- One row for each book
- Columns for title, authors, publication year, publisher, and genre
This step will help reinforce how tabular data is structured in HTML.
2. Manually Create the JSON File
Next, I will create a file named books.json manually. This file will represent the same book dataset in JSON format.
The JSON structure will consist of:
- An array containing book objects
- One object per book
- Fields matching the attributes used in the HTML table
This demonstrates how the same dataset can be represented in a hierarchical structure.
3. Upload Files to GitHub
Both files (books.html and books.json) will be uploaded to a public GitHub repository. This allows the files to be accessed directly from the web.
Using GitHub raw file links ensures that the R code can load the files without relying on local file paths.
4. Load the HTML Data into R
Using the rvest package, I will read the HTML table from the GitHub raw URL and convert it into an R data frame.
During this step I will verify:
- The table structure
- Column names
- Data types
5. Load the JSON Data into R
Using the jsonlite package, I will read the JSON file from the GitHub raw URL and convert it into a second R data frame.
I will check that the imported data matches the structure and values from the HTML table.
6. Compare the Two Data Frames
After both datasets are loaded, I will compare them using:
identical()to test strict equalityall.equal()to detect minor structural differences if present
This step will confirm whether both file formats produce identical data frames when imported into R.
R Packages to Be Used
The following packages will be used:
rvest– to read HTML tables
jsonlite– to read JSON data
dplyr– for light data manipulation and comparison
Reproducibility
To ensure the analysis is reproducible:
- The HTML and JSON files will be manually created
- Both files will be stored in a public GitHub repository
- The Quarto document will load the files directly from GitHub raw URLs