library(rvest)
library(jsonlite)
library(knitr)Books: HTML & JSON
Introduction
This assignment explores how structured data can be represented and imported into R using two common data formats: HTML and JSON. The objective is to manually create small datasets in both formats and then load them into R for analysis.
Approach
The first step in completing this assignment was selecting a small dataset consisting of three books on a common subject. The chosen subject is data science and data mining, which are closely related fields.
The three selected books are Data Science from Scratch by Joel Grus, Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar, and Data Smart by John W. Foreman. These books were selected because they represent different perspectives within the data science field, including programming-based data science, machine learning and data mining techniques, and applied business analytics.
For each book, several attributes were recorded, including the title, authors, publication year, publisher, and genre. This information was then used to manually construct two data files: an HTML file containing a table with the book data and a JSON file representing the same information in structured format.
After creating these files, R was used to load the HTML and JSON data into separate data frames. Finally, the two data frames were compared to determine whether they contain identical information.
Code Base
Load, read and convert the table from the HTML file into a data frame.
html_page <- read_html("https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/HTML%26JSON/books.html")
books_html_df <- html_table(html_page, fill = TRUE)[[1]]
kable(books_html_df)| title | authors | publication_year | publisher | genre |
|---|---|---|---|---|
| Data Science from Scratch | Joel Grus | 2019 | O’Reilly Media | Data Science |
| Introduction to Data Mining | Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar | 2018 | Pearson | Data Mining |
| Data Smart | John W. Foreman | 2013 | Wiley | Business Analytics |
Load, read and converts the JSON structure into a data frame
books_json_df <- fromJSON("https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/HTML%26JSON/books.json")
kable(books_json_df)| title | authors | publication_year | publisher | genre |
|---|---|---|---|---|
| Data Science from Scratch | Joel Grus | 2019 | O’Reilly Media | Data Science |
| Introduction to Data Mining | Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar | 2018 | Pearson | Data Mining |
| Data Smart | John W. Foreman | 2013 | Wiley | Business Analytics |
Comparing the data frames
books_html_df <- as.data.frame(books_html_df)
books_json_df <- as.data.frame(books_json_df)
identical(books_html_df, books_json_df)[1] TRUE
all.equal(books_html_df, books_json_df)[1] TRUE
Conclusion
Both the HTML and JSON files were successfully imported into R and converted into data frames. After converting them to the same class, the comparison confirmed that the datasets contain identical information. This exercise demonstrates how different structured data formats can be integrated and analyzed in R.