Assignment 7: Web Technologies

Author

Desiree Thomas, Denise Atherley, Kiera Griffiths

Approach: HTML and JSON

For this assignment we have chosen three philosophy books. Since not all group members are familiar with JSON and HTML, we will manually create these structures. HTML will use the table structure and JSON will focus mostly on the data types. The JSON will handle the nesting since there are multiple authors listed for the same book; this will be held within an array. The files will be saved with .html and .json. To load, we will load the rvest package and use read_html() to read the file and in the next step we will extract the table components. We will use the jsonlite library for the JSON file and use fromJSON to read the file.

Anticipated challenges may include structuring challenges and gaining an understanding of HTML and JSON structure since they are being manually written. It is important that we follow and maintain the correct syntax to avoid issues and bugs later on. The output when viewing the JSON and HTML will also look different between the two. We will host our files on GitHub and use the RAW version for reproducibility.

Load the HTML data into R dataframe

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

library(jsonlite)


Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten

library(knitr)

Load the HTML data frame and displays the table

raw_html <- read_html("https://raw.githubusercontent.com/desithomas/607-7/refs/heads/main/books.html")

#Extract the table and place it into a dataframe
html_df <- raw_html %>%
  
#Looks for the table element; it must be in a string
html_element("table") %>%

html_table()

#displayed contents in Knitr table to check if identical
kable(html_df, caption = "Book data from HTML file")

Book data from HTML file
Title	Authors	Year	Publisher	ISBN-10
Four Views on Free Will	John Martin Fischer, Robert Kane, Derk Pereboom, Manuel Vargas	2007	Blackwell Pub	1405134860
Meditations	Marcus Aurelius	2018	CreateSpace Independent Publishing Platform	1503280462
The Problems of Philosophy	Bertrand Russell	2013	Martino Fine Books	161427486X

Load the JSON data frame and displays the table

json_df <-fromJSON("https://raw.githubusercontent.com/desithomas/607-7/refs/heads/main/books.json")

#Displayed table contents using Knitr to check if identical
kable(json_df, caption = "Book data from the JSON file")

Book data from the JSON file
Title	Authors	Year	Publisher	ISBN-10
Four Views on Free Will	John Martin Fischer, Robert Kane , Derk Pereboom , Manuel Vargas	2007	Blackwell Pub	1405134860
Meditations	Marcus Aurelius	2018	CreateSpace Independent Publishing Platform	1503280462
The Problems of Philosophy	Bertrand Russell	2013	Martino Fine Books	161427486X

Are the two tables identical?

We use the identical function to check if the objects are 100% clones. If they are not identical it will render as False.

# Checking if tables are identical with identical function below

identical(html_df, json_df)

[1] FALSE

We received FALSE, thus, the tables are not identical. We used the all.equal() to identify if the data is the same while ignoring minor attribute differences like table indexing.

# Using this function to identify the specific differences between the tables

all.equal(html_df, json_df)

[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
[3] "Component \"Authors\": Modes: character, list"                                         
[4] "Component \"Authors\": target is character, current is list"

When we checked with the two functions, we received FALSE for both. all.equal() output 6 errors initially with a brief description of the cause. Some errors were caused during the manual creation of the .html and .json files, which were easily rectified. The other errors were related to the .json files processing as a list versus the .html which processed as a character. The following code was used to make the necessary changes.

Data Transformation

After running the initial checks regarding whether the tables were identical or not, we worked to resolve the 6 issues found. We applied the mutate function to flatten the list of authors in the JSON file and standardized the year in the HTML file. We also resolved errors that occurred when we were manually creating the HTML and JSON files, such as accidentally having entered too many spaces and us initially not matching the case for year in our mutate function (year instead of Year).

# Standardize JSON: Flatten the author list and ensure year is character
json_df_final <- json_df %>%
  mutate(
    # Flattens the list of authors into a single comma-separated string 
    Authors = map_chr(Authors, ~ paste(.x, collapse = ", ")), 
    # Ensures the year column is a character type 
    Year = as.character(Year)
  )

# Standardize HTML: Ensure year is character to match JSON 
html_df_final <- html_df %>%
  mutate(
    Year = as.character(Year)
  )

# Remove any metadata/attributes that differ (like row names or class types) 
#Converting to tibble ensures both have the same base class
json_df_final <- as_tibble(json_df_final)
html_df_final <- as_tibble(html_df_final)



# Displaying both to confirm visual parity 
kable(html_df_final, caption = "Cleaned HTML Table")

Cleaned HTML Table
Title	Authors	Year	Publisher	ISBN-10
Four Views on Free Will	John Martin Fischer, Robert Kane, Derk Pereboom, Manuel Vargas	2007	Blackwell Pub	1405134860
Meditations	Marcus Aurelius	2018	CreateSpace Independent Publishing Platform	1503280462
The Problems of Philosophy	Bertrand Russell	2013	Martino Fine Books	161427486X

kable(json_df_final, caption = "Cleaned JSON Table")

Cleaned JSON Table
Title	Authors	Year	Publisher	ISBN-10
Four Views on Free Will	John Martin Fischer, Robert Kane, Derk Pereboom, Manuel Vargas	2007	Blackwell Pub	1405134860
Meditations	Marcus Aurelius	2018	CreateSpace Independent Publishing Platform	1503280462
The Problems of Philosophy	Bertrand Russell	2013	Martino Fine Books	161427486X

# Checks whether the tables are identical 
all.equal(html_df_final, json_df_final)

[1] TRUE

identical(html_df_final, json_df_final)

[1] TRUE

Conclusion

The tables are now output as TRUE, thus, they are identical. For the data transformation of the .json and .html files, we used an LLM (Gemini) to provide a reference and snippets of code to make the tables identical.

Citation

(Google DeepMind. (2026). Gemini Pro 3.1 [Large language model]. https://gemini.google.com. Accessed March 14, 2026