Introduction

This document demonstrates how to read and compare data from HTML, XML, and JSON files in R.
The dataset contains information about three fiction books.

Load Required Libraries

library(XML)      # For XML processing
library(jsonlite) # For JSON processing
library(rvest)    # For HTML scraping
library(dplyr)    # For data manipulation

Load Files from GitHub

html_file <- "https://raw.githubusercontent.com/sheriannmclarty/Spring-2025-Data-607/Assignment-7/black_author_reads.html"
xml_file <- "https://raw.githubusercontent.com/sheriannmclarty/Spring-2025-Data-607/Assignment-7/black_author_reads.xml"
json_file <- "https://raw.githubusercontent.com/sheriannmclarty/Spring-2025-Data-607/Assignment-7/black_author_reads.json"

Read HTML (From GitHub Hosting)

html_text <- readLines(html_file, warn = FALSE) # Read as text
html_data <- read_html(paste(html_text, collapse = "\n")) # Convert to HTML object
# Extract the First Table from the HTML File
html_table <- html_data %>% html_element("table") %>% html_table(fill = TRUE)

# Convert Table to Data Frame
black_author_reads_html <- as.data.frame(html_table)

# Print the Extracted Data
print(black_author_reads_html)
##                        Title       Authors                       Genre
## 1             The Bluest Eye Toni Morrison            Literary Fiction
## 2          Things Fall Apart Chinua Achebe Historical Fiction, Tragedy
## 3 Children of Blood and Bone  Tomi Adeyemi         Young Adult Fantasy
##   Published Year
## 1           1970
## 2           1958
## 3           2018
##                                                     Notable Awards
## 1                                      None (But highly acclaimed)
## 2 None (But widely regarded as one of the greatest African novels)
## 3   Andre Norton Award for Young Adult Science Fiction and Fantasy
##                                                                                                                                                                                                                                  Summary
## 1                          A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society that idolizes whiteness. Morrison’s debut novel explores themes of beauty, racism, and identity.
## 2 The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and internal struggles. Achebe’s masterpiece captures the clash between traditional African culture and European imperialism.
## 3                                               A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore magic to her people while battling oppressive forces in the land of Orïsha.

Read XML (From GitHub Hosting)

xml_text <- readLines(xml_file, warn = FALSE)
xml_data <- xmlParse(paste(xml_text, collapse = "\n"))
black_author_reads_xml <- xmlToDataFrame(xml_data)
# Print to confirm extraction
print(black_author_reads_xml)
##                        title       authors                       genre
## 1             The Bluest Eye Toni Morrison            Literary Fiction
## 2          Things Fall Apart Chinua Achebe Historical Fiction, Tragedy
## 3 Children of Blood and Bone  Tomi Adeyemi         Young Adult Fantasy
##   publishedYear
## 1          1970
## 2          1958
## 3          2018
##                                                             awards
## 1                                      None (But highly acclaimed)
## 2 None (But widely regarded as one of the greatest African novels)
## 3   Andre Norton Award for Young Adult Science Fiction and Fantasy
##                                                                                                                                                                                                                                  summary
## 1                          A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society that idolizes whiteness. Morrison’s debut novel explores themes of beauty, racism, and identity.
## 2 The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and internal struggles. Achebe’s masterpiece captures the clash between traditional African culture and European imperialism.
## 3                                               A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore magic to her people while battling oppressive forces in the land of Orïsha.

Read JSON (From Github)

json_data <- fromJSON(json_file)
# Extract the books data from JSON
black_author_reads_json <- as.data.frame(json_data$books)
# Print to confirm extraction
print(black_author_reads_json)
##                        title       authors                       genre
## 1             The Bluest Eye Toni Morrison            Literary Fiction
## 2          Things Fall Apart Chinua Achebe Historical Fiction, Tragedy
## 3 Children of Blood and Bone  Tomi Adeyemi         Young Adult Fantasy
##   publishedYear
## 1          1970
## 2          1958
## 3          2018
##                                                             awards
## 1                                      None (But highly acclaimed)
## 2 None (But widely regarded as one of the greatest African novels)
## 3   Andre Norton Award for Young Adult Science Fiction and Fantasy
##                                                                                                                                                                                                                                  summary
## 1                          A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society that idolizes whiteness. Morrison’s debut novel explores themes of beauty, racism, and identity.
## 2 The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and internal struggles. Achebe’s masterpiece captures the clash between traditional African culture and European imperialism.
## 3                                               A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore magic to her people while battling oppressive forces in the land of Orïsha.

Compare Data Frames

# Convert columns to character type for accurate comparison
black_author_html <- black_author_reads_html %>% mutate(across(everything(), as.character))
black_author_xml <- black_author_reads_xml %>% mutate(across(everything(), as.character))
black_author_json <- black_author_reads_json %>% mutate(across(everything(), as.character))

# Check if all data frames are identical
identical_html_xml <- identical(black_author_html, black_author_xml)
identical_html_json <- identical(black_author_html, black_author_json)
identical_xml_json <- identical(black_author_xml, black_author_json)

# Print results
cat("Are HTML and XML data frames identical? ", identical_html_xml, "\n")
## Are HTML and XML data frames identical?  FALSE
cat("Are HTML and JSON data frames identical? ", identical_html_json, "\n")
## Are HTML and JSON data frames identical?  FALSE
cat("Are XML and JSON data frames identical? ", identical_xml_json, "\n")
## Are XML and JSON data frames identical?  FALSE

Compare Dataframes to make all three files become the same

## Compare Data Frames with Adjustments

# Ensure consistent column names
colnames(black_author_reads_html) <- colnames(black_author_reads_xml) <- colnames(black_author_reads_json)

# Trim whitespace from all values
black_author_reads_html <- black_author_reads_html %>% mutate(across(everything(), ~ trimws(as.character(.))))
black_author_reads_xml <- black_author_reads_xml %>% mutate(across(everything(), ~ trimws(as.character(.))))
black_author_reads_json <- black_author_reads_json %>% mutate(across(everything(), ~ trimws(as.character(.))))

# Arrange rows to ensure order doesn't affect comparison
black_author_reads_html <- black_author_reads_html %>% arrange_all()
black_author_reads_xml <- black_author_reads_xml %>% arrange_all()
black_author_reads_json <- black_author_reads_json %>% arrange_all()

# Recheck if data frames are now identical
identical_html_xml <- identical(black_author_reads_html, black_author_reads_xml)
identical_html_json <- identical(black_author_reads_html, black_author_reads_json)
identical_xml_json <- identical(black_author_reads_xml, black_author_reads_json)

# Print results
cat("Are HTML and XML data frames identical? ", identical_html_xml, "\n")
## Are HTML and XML data frames identical?  TRUE
cat("Are HTML and JSON data frames identical? ", identical_html_json, "\n")
## Are HTML and JSON data frames identical?  FALSE
cat("Are XML and JSON data frames identical? ", identical_xml_json, "\n")
## Are XML and JSON data frames identical?  FALSE

Looking at the output, to see what’s the diffrence between the three files

str(black_author_reads_html)
## 'data.frame':    3 obs. of  6 variables:
##  $ title        : chr  "Children of Blood and Bone" "The Bluest Eye" "Things Fall Apart"
##  $ authors      : chr  "Tomi Adeyemi" "Toni Morrison" "Chinua Achebe"
##  $ genre        : chr  "Young Adult Fantasy" "Literary Fiction" "Historical Fiction, Tragedy"
##  $ publishedYear: chr  "2018" "1970" "1958"
##  $ awards       : chr  "Andre Norton Award for Young Adult Science Fiction and Fantasy" "None (But highly acclaimed)" "None (But widely regarded as one of the greatest African novels)"
##  $ summary      : chr  "A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore ma"| __truncated__ "A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society th"| __truncated__ "The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and in"| __truncated__
str(black_author_reads_xml)
## 'data.frame':    3 obs. of  6 variables:
##  $ title        : chr  "Children of Blood and Bone" "The Bluest Eye" "Things Fall Apart"
##  $ authors      : chr  "Tomi Adeyemi" "Toni Morrison" "Chinua Achebe"
##  $ genre        : chr  "Young Adult Fantasy" "Literary Fiction" "Historical Fiction, Tragedy"
##  $ publishedYear: chr  "2018" "1970" "1958"
##  $ awards       : chr  "Andre Norton Award for Young Adult Science Fiction and Fantasy" "None (But highly acclaimed)" "None (But widely regarded as one of the greatest African novels)"
##  $ summary      : chr  "A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore ma"| __truncated__ "A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society th"| __truncated__ "The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and in"| __truncated__
str(black_author_reads_json)
## 'data.frame':    3 obs. of  6 variables:
##  $ title        : chr  "Children of Blood and Bone" "The Bluest Eye" "Things Fall Apart"
##  $ authors      : chr  "Tomi Adeyemi" "Toni Morrison" "Chinua Achebe"
##  $ genre        : chr  "Young Adult Fantasy" "Literary Fiction" "c(\"Historical Fiction\", \"Tragedy\")"
##  $ publishedYear: chr  "2018" "1970" "1958"
##  $ awards       : chr  "Andre Norton Award for Young Adult Science Fiction and Fantasy" "None (But highly acclaimed)" "None (But widely regarded as one of the greatest African novels)"
##  $ summary      : chr  "A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore ma"| __truncated__ "A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society th"| __truncated__ "The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and in"| __truncated__

Json Data format stores multiple authors and genre, convert Json columns to match

## Read Data from JSON (Fix Nested Lists)

# Load JSON file
json_file <- "black_author_reads.json"  # Ensure the correct file extension
json_data <- fromJSON(json_file)

# Convert JSON list to dataframe
black_author_reads_json <- json_data$books

# Flatten authors and genres to match HTML/XML formats
black_author_reads_json$authors <- sapply(black_author_reads_json$authors, paste, collapse = ", ")
black_author_reads_json$genre <- sapply(black_author_reads_json$genre, paste, collapse = ", ")

# Convert to dataframe
black_author_reads_json <- as.data.frame(black_author_reads_json)

# Display JSON data after correction
print(black_author_reads_json)
##                        title       authors                       genre
## 1             The Bluest Eye Toni Morrison            Literary Fiction
## 2          Things Fall Apart Chinua Achebe Historical Fiction, Tragedy
## 3 Children of Blood and Bone  Tomi Adeyemi         Young Adult Fantasy
##   publishedYear
## 1          1970
## 2          1958
## 3          2018
##                                                             awards
## 1                                      None (But highly acclaimed)
## 2 None (But widely regarded as one of the greatest African novels)
## 3   Andre Norton Award for Young Adult Science Fiction and Fantasy
##                                                                                                                                                                                                                                  summary
## 1                          A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society that idolizes whiteness. Morrison’s debut novel explores themes of beauty, racism, and identity.
## 2 The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and internal struggles. Achebe’s masterpiece captures the clash between traditional African culture and European imperialism.
## 3                                               A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore magic to her people while battling oppressive forces in the land of Orïsha.

Rechecking the all three dataframes for identical likeness

# Check if all data frames are identical
identical_html_xml <- identical(black_author_reads_html, black_author_reads_xml)
identical_html_json <- identical(black_author_reads_html, black_author_reads_json)
identical_xml_json <- identical(black_author_reads_xml, black_author_reads_json)

# Print results
cat("Are HTML and XML data frames identical? ", identical_html_xml, "\n")
## Are HTML and XML data frames identical?  TRUE
cat("Are HTML and JSON data frames identical? ", identical_html_json, "\n")
## Are HTML and JSON data frames identical?  FALSE
cat("Are XML and JSON data frames identical? ", identical_xml_json, "\n")
## Are XML and JSON data frames identical?  FALSE

Json file dataframe sorted

## Final Data Cleaning for JSON

# Ensure column names match
colnames(black_author_reads_json) <- colnames(black_author_reads_html)

# Trim whitespace from JSON data
black_author_reads_json <- black_author_reads_json %>% mutate(across(everything(), ~ trimws(as.character(.))))

# Convert publishedYear to character type for consistent comparison
black_author_reads_json$publishedYear <- as.character(black_author_reads_json$publishedYear)

# Sort rows by title to ensure order does not affect comparison
black_author_reads_html <- black_author_reads_html %>% arrange(title)
black_author_reads_json <- black_author_reads_json %>% arrange(title)
black_author_reads_xml  <- black_author_reads_xml  %>% arrange(title)

# Recheck if all data frames are now identical
identical_html_xml <- identical(black_author_reads_html, black_author_reads_xml)
identical_html_json <- identical(black_author_reads_html, black_author_reads_json)
identical_xml_json <- identical(black_author_reads_xml, black_author_reads_json)

# Print results
cat("Are HTML and XML data frames identical? ", identical_html_xml, "\n")
## Are HTML and XML data frames identical?  TRUE
cat("Are HTML and JSON data frames identical? ", identical_html_json, "\n")
## Are HTML and JSON data frames identical?  TRUE
cat("Are XML and JSON data frames identical? ", identical_xml_json, "\n")
## Are XML and JSON data frames identical?  TRUE

Conclusion

We successfully extracted and standardized book data from HTML, XML, and JSON formats, ensuring they matched perfectly.

Key Takeaways Extracted the data using rvest (HTML), XML (XML), and jsonlite (JSON). Fixed mismatches in column names, whitespace, and encoding issues. Flattened JSON lists for authors and genres to align with the other formats. Ran identical() checks and confirmed all three formats were identical after cleaning.

Final Thoughts Data is messy, and different formats don’t always play by the same rules. Standardization is key—cleaning and wrangling made everything match up.