Books: HTML & JSON

Author

Madina Kudanova

Introduction

This assignment explores how structured data can be represented and imported into R using two common data formats: HTML and JSON. The objective is to manually create small datasets in both formats and then load them into R for analysis.

Approach

The first step in completing this assignment was selecting a small dataset consisting of three books on a common subject. The chosen subject is data science and data mining, which are closely related fields.

The three selected books are Data Science from Scratch by Joel Grus, Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar, and Data Smart by John W. Foreman. These books were selected because they represent different perspectives within the data science field, including programming-based data science, machine learning and data mining techniques, and applied business analytics.

For each book, several attributes were recorded, including the title, authors, publication year, publisher, and genre. This information was then used to manually construct two data files: an HTML file containing a table with the book data and a JSON file representing the same information in structured format.

After creating these files, R was used to load the HTML and JSON data into separate data frames. Finally, the two data frames were compared to determine whether they contain identical information.

Code Base

library(rvest)
library(jsonlite)
library(knitr)

Load, read and convert the table from the HTML file into a data frame.

html_page <- read_html("https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/HTML%26JSON/books.html")
books_html_df <- html_table(html_page, fill = TRUE)[[1]]

kable(books_html_df)

title	authors	publication_year	publisher	genre
Data Science from Scratch	Joel Grus	2019	O’Reilly Media	Data Science
Introduction to Data Mining	Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar	2018	Pearson	Data Mining
Data Smart	John W. Foreman	2013	Wiley	Business Analytics

Load, read and converts the JSON structure into a data frame

books_json_df <- fromJSON("https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/HTML%26JSON/books.json")

kable(books_json_df)

title	authors	publication_year	publisher	genre
Data Science from Scratch	Joel Grus	2019	O’Reilly Media	Data Science
Introduction to Data Mining	Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar	2018	Pearson	Data Mining
Data Smart	John W. Foreman	2013	Wiley	Business Analytics

Comparing the data frames

books_html_df <- as.data.frame(books_html_df)
books_json_df <- as.data.frame(books_json_df)
 
identical(books_html_df, books_json_df)

[1] TRUE

all.equal(books_html_df, books_json_df)

[1] TRUE

Conclusion

Both the HTML and JSON files were successfully imported into R and converted into data frames. After converting them to the same class, the comparison confirmed that the datasets contain identical information. This exercise demonstrates how different structured data formats can be integrated and analyzed in R.