#load rvest to read html and jsonlite to read json
library(rvest)
library(jsonlite)Assignment 7: HTML and JSON
Approach
Three books on Middle Eastern/ North African Art:
Art Across the Arabian Gulf
Authors: Aram Alajaji, Cecelia Ruggeri, Nada Alaradi, Sultan Sood Al Qassemi, Abdulrahman Al-Soliman, Basma Alshathry, Shadin Albulaihed, Safiyah Abaalkhailis
Publication Year: 2025
Publisher: Kaph Books
ISBN: 9786148035999
Amazigh Arts in Morocco: Women Shaping Berber Identity
Author: Cynthia J. Becker
Publication Year: 2014
Publisher: University of Texas Press
ISBN: 9780292756199
Syria Speaks: Art and Culture from the Frontline
Authors: Malu Halasa, Zaher Omareen, Nawara Mahfoud
Publication Year: 2014
Publisher: Saqi Books
ISBN: 9780863567872
For each book, I will record the above information (title, author(s), publication year, publisher and ISBN) in both a HTML table and a JSON file. Then, I will be able to use the HTML and JSON files to create two separate data frames in R, and use the identical() base R command to compare them. If this comes back false, I will do more investigation into the structures of the data frames to determine differences in elements like data types or formatting.
Code Base
I created separate HTML and JSON files with the information on the three books above and uploaded them to my Github repository for this assignment.
#create data frame from HTML
htmlUrl <- "https://raw.githubusercontent.com/emilye5/607-assignment7/refs/heads/main/books.html"
htmlPage <- read_html(htmlUrl)
table <- html_table(htmlPage)
htmlDf <- table[[1]]
#create data frame from JSON
jsonUrl <- "https://raw.githubusercontent.com/emilye5/607-assignment7/refs/heads/main/books.json"
jsonDf <- fromJSON(jsonUrl)$booksCheck to see if they are identical:
identical(htmlDf, jsonDf)[1] FALSE
Since this returns false, and they are not identical, more investigation is needed in order to see where the data frames differ.
all.equal(htmlDf, jsonDf)[1] "Names: 5 string mismatches"
[2] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[3] "Attributes: < Component \"class\": 1 string mismatch >"
[4] "Component 2: Modes: character, list"
[5] "Component 2: target is character, current is list"
[6] "Component 5: Modes: numeric, character"
[7] "Component 5: target is numeric, current is character"
Conclusion
Running all.equal returns the exact differences between the data frame derived from the HTML table and the one derived from the JSON. The first difference that this returns is that 5 of the column names are different. This is something that I did intentionally to see what the result would be when comparing the data frames. In the HTML table, the column names begin with an upper case letter, while in the JSON file I left them all lowercase and used an underscore in publication_year instead of a space. The second difference is that the classes of both data frames are different.
class(htmlDf)[1] "tbl_df" "tbl" "data.frame"
class(jsonDf)[1] "data.frame"
This returns that htmlDf is a tibble while jsonDf is a base R data frame. Upon doing research on why this might be, I found that the usage of html_table() from rvest returns a tibble to ensure compatibility with tidyverse. (Source: https://rvest.tidyverse.org/articles/rvest.html) The following difference goes hand in hand with this as it is saying that the actual strings that describe the classes are different. The next difference returned by all.equal() is that the author column in the HTMl data frame is a character vector, while in the JSON data frame it is a list. This is because the storage of multiple nested objects in JSON is an array. This seems to be repeated by the fifth returned difference. The final difference is that the ISBN column in the HTML data frame is numeric, while in the JSON data frame it is stored as a string, and the seventh returned difference repeats this. To further extend this work, one might transform these data frames so that they are identical by making the column names the same, and changing the data types.