This document demonstrates how to read and compare data from
HTML, XML, and JSON files in R.
The dataset contains information about three fiction books.
library(XML) # For XML processing
library(jsonlite) # For JSON processing
library(rvest) # For HTML scraping
library(dplyr) # For data manipulation
html_file <- "https://raw.githubusercontent.com/sheriannmclarty/Spring-2025-Data-607/Assignment-7/black_author_reads.html"
xml_file <- "https://raw.githubusercontent.com/sheriannmclarty/Spring-2025-Data-607/Assignment-7/black_author_reads.xml"
json_file <- "https://raw.githubusercontent.com/sheriannmclarty/Spring-2025-Data-607/Assignment-7/black_author_reads.json"
html_text <- readLines(html_file, warn = FALSE) # Read as text
html_data <- read_html(paste(html_text, collapse = "\n")) # Convert to HTML object
# Extract the First Table from the HTML File
html_table <- html_data %>% html_element("table") %>% html_table(fill = TRUE)
# Convert Table to Data Frame
black_author_reads_html <- as.data.frame(html_table)
# Print the Extracted Data
print(black_author_reads_html)
## Title Authors Genre
## 1 The Bluest Eye Toni Morrison Literary Fiction
## 2 Things Fall Apart Chinua Achebe Historical Fiction, Tragedy
## 3 Children of Blood and Bone Tomi Adeyemi Young Adult Fantasy
## Published Year
## 1 1970
## 2 1958
## 3 2018
## Notable Awards
## 1 None (But highly acclaimed)
## 2 None (But widely regarded as one of the greatest African novels)
## 3 Andre Norton Award for Young Adult Science Fiction and Fantasy
## Summary
## 1 A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society that idolizes whiteness. Morrison’s debut novel explores themes of beauty, racism, and identity.
## 2 The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and internal struggles. Achebe’s masterpiece captures the clash between traditional African culture and European imperialism.
## 3 A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore magic to her people while battling oppressive forces in the land of Orïsha.
xml_text <- readLines(xml_file, warn = FALSE)
xml_data <- xmlParse(paste(xml_text, collapse = "\n"))
black_author_reads_xml <- xmlToDataFrame(xml_data)
# Print to confirm extraction
print(black_author_reads_xml)
## title authors genre
## 1 The Bluest Eye Toni Morrison Literary Fiction
## 2 Things Fall Apart Chinua Achebe Historical Fiction, Tragedy
## 3 Children of Blood and Bone Tomi Adeyemi Young Adult Fantasy
## publishedYear
## 1 1970
## 2 1958
## 3 2018
## awards
## 1 None (But highly acclaimed)
## 2 None (But widely regarded as one of the greatest African novels)
## 3 Andre Norton Award for Young Adult Science Fiction and Fantasy
## summary
## 1 A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society that idolizes whiteness. Morrison’s debut novel explores themes of beauty, racism, and identity.
## 2 The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and internal struggles. Achebe’s masterpiece captures the clash between traditional African culture and European imperialism.
## 3 A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore magic to her people while battling oppressive forces in the land of Orïsha.
json_data <- fromJSON(json_file)
# Extract the books data from JSON
black_author_reads_json <- as.data.frame(json_data$books)
# Print to confirm extraction
print(black_author_reads_json)
## title authors genre
## 1 The Bluest Eye Toni Morrison Literary Fiction
## 2 Things Fall Apart Chinua Achebe Historical Fiction, Tragedy
## 3 Children of Blood and Bone Tomi Adeyemi Young Adult Fantasy
## publishedYear
## 1 1970
## 2 1958
## 3 2018
## awards
## 1 None (But highly acclaimed)
## 2 None (But widely regarded as one of the greatest African novels)
## 3 Andre Norton Award for Young Adult Science Fiction and Fantasy
## summary
## 1 A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society that idolizes whiteness. Morrison’s debut novel explores themes of beauty, racism, and identity.
## 2 The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and internal struggles. Achebe’s masterpiece captures the clash between traditional African culture and European imperialism.
## 3 A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore magic to her people while battling oppressive forces in the land of Orïsha.
# Convert columns to character type for accurate comparison
black_author_html <- black_author_reads_html %>% mutate(across(everything(), as.character))
black_author_xml <- black_author_reads_xml %>% mutate(across(everything(), as.character))
black_author_json <- black_author_reads_json %>% mutate(across(everything(), as.character))
# Check if all data frames are identical
identical_html_xml <- identical(black_author_html, black_author_xml)
identical_html_json <- identical(black_author_html, black_author_json)
identical_xml_json <- identical(black_author_xml, black_author_json)
# Print results
cat("Are HTML and XML data frames identical? ", identical_html_xml, "\n")
## Are HTML and XML data frames identical? FALSE
cat("Are HTML and JSON data frames identical? ", identical_html_json, "\n")
## Are HTML and JSON data frames identical? FALSE
cat("Are XML and JSON data frames identical? ", identical_xml_json, "\n")
## Are XML and JSON data frames identical? FALSE
Compare Dataframes to make all three files become the same
## Compare Data Frames with Adjustments
# Ensure consistent column names
colnames(black_author_reads_html) <- colnames(black_author_reads_xml) <- colnames(black_author_reads_json)
# Trim whitespace from all values
black_author_reads_html <- black_author_reads_html %>% mutate(across(everything(), ~ trimws(as.character(.))))
black_author_reads_xml <- black_author_reads_xml %>% mutate(across(everything(), ~ trimws(as.character(.))))
black_author_reads_json <- black_author_reads_json %>% mutate(across(everything(), ~ trimws(as.character(.))))
# Arrange rows to ensure order doesn't affect comparison
black_author_reads_html <- black_author_reads_html %>% arrange_all()
black_author_reads_xml <- black_author_reads_xml %>% arrange_all()
black_author_reads_json <- black_author_reads_json %>% arrange_all()
# Recheck if data frames are now identical
identical_html_xml <- identical(black_author_reads_html, black_author_reads_xml)
identical_html_json <- identical(black_author_reads_html, black_author_reads_json)
identical_xml_json <- identical(black_author_reads_xml, black_author_reads_json)
# Print results
cat("Are HTML and XML data frames identical? ", identical_html_xml, "\n")
## Are HTML and XML data frames identical? TRUE
cat("Are HTML and JSON data frames identical? ", identical_html_json, "\n")
## Are HTML and JSON data frames identical? FALSE
cat("Are XML and JSON data frames identical? ", identical_xml_json, "\n")
## Are XML and JSON data frames identical? FALSE
str(black_author_reads_html)
## 'data.frame': 3 obs. of 6 variables:
## $ title : chr "Children of Blood and Bone" "The Bluest Eye" "Things Fall Apart"
## $ authors : chr "Tomi Adeyemi" "Toni Morrison" "Chinua Achebe"
## $ genre : chr "Young Adult Fantasy" "Literary Fiction" "Historical Fiction, Tragedy"
## $ publishedYear: chr "2018" "1970" "1958"
## $ awards : chr "Andre Norton Award for Young Adult Science Fiction and Fantasy" "None (But highly acclaimed)" "None (But widely regarded as one of the greatest African novels)"
## $ summary : chr "A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore ma"| __truncated__ "A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society th"| __truncated__ "The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and in"| __truncated__
str(black_author_reads_xml)
## 'data.frame': 3 obs. of 6 variables:
## $ title : chr "Children of Blood and Bone" "The Bluest Eye" "Things Fall Apart"
## $ authors : chr "Tomi Adeyemi" "Toni Morrison" "Chinua Achebe"
## $ genre : chr "Young Adult Fantasy" "Literary Fiction" "Historical Fiction, Tragedy"
## $ publishedYear: chr "2018" "1970" "1958"
## $ awards : chr "Andre Norton Award for Young Adult Science Fiction and Fantasy" "None (But highly acclaimed)" "None (But widely regarded as one of the greatest African novels)"
## $ summary : chr "A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore ma"| __truncated__ "A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society th"| __truncated__ "The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and in"| __truncated__
str(black_author_reads_json)
## 'data.frame': 3 obs. of 6 variables:
## $ title : chr "Children of Blood and Bone" "The Bluest Eye" "Things Fall Apart"
## $ authors : chr "Tomi Adeyemi" "Toni Morrison" "Chinua Achebe"
## $ genre : chr "Young Adult Fantasy" "Literary Fiction" "c(\"Historical Fiction\", \"Tragedy\")"
## $ publishedYear: chr "2018" "1970" "1958"
## $ awards : chr "Andre Norton Award for Young Adult Science Fiction and Fantasy" "None (But highly acclaimed)" "None (But widely regarded as one of the greatest African novels)"
## $ summary : chr "A gripping fantasy novel inspired by West African mythology, where Zélie, a young divîner, fights to restore ma"| __truncated__ "A powerful novel about a young Black girl named Pecola Breedlove, who struggles with self-worth in a society th"| __truncated__ "The tragic story of Okonkwo, a respected leader in an Igbo village, whose life is upended by colonialism and in"| __truncated__
# Check if all data frames are identical
identical_html_xml <- identical(black_author_reads_html, black_author_reads_xml)
identical_html_json <- identical(black_author_reads_html, black_author_reads_json)
identical_xml_json <- identical(black_author_reads_xml, black_author_reads_json)
# Print results
cat("Are HTML and XML data frames identical? ", identical_html_xml, "\n")
## Are HTML and XML data frames identical? TRUE
cat("Are HTML and JSON data frames identical? ", identical_html_json, "\n")
## Are HTML and JSON data frames identical? FALSE
cat("Are XML and JSON data frames identical? ", identical_xml_json, "\n")
## Are XML and JSON data frames identical? FALSE
## Final Data Cleaning for JSON
# Ensure column names match
colnames(black_author_reads_json) <- colnames(black_author_reads_html)
# Trim whitespace from JSON data
black_author_reads_json <- black_author_reads_json %>% mutate(across(everything(), ~ trimws(as.character(.))))
# Convert publishedYear to character type for consistent comparison
black_author_reads_json$publishedYear <- as.character(black_author_reads_json$publishedYear)
# Sort rows by title to ensure order does not affect comparison
black_author_reads_html <- black_author_reads_html %>% arrange(title)
black_author_reads_json <- black_author_reads_json %>% arrange(title)
black_author_reads_xml <- black_author_reads_xml %>% arrange(title)
# Recheck if all data frames are now identical
identical_html_xml <- identical(black_author_reads_html, black_author_reads_xml)
identical_html_json <- identical(black_author_reads_html, black_author_reads_json)
identical_xml_json <- identical(black_author_reads_xml, black_author_reads_json)
# Print results
cat("Are HTML and XML data frames identical? ", identical_html_xml, "\n")
## Are HTML and XML data frames identical? TRUE
cat("Are HTML and JSON data frames identical? ", identical_html_json, "\n")
## Are HTML and JSON data frames identical? TRUE
cat("Are XML and JSON data frames identical? ", identical_xml_json, "\n")
## Are XML and JSON data frames identical? TRUE
We successfully extracted and standardized book data from HTML, XML, and JSON formats, ensuring they matched perfectly.
Key Takeaways Extracted the data using rvest (HTML), XML (XML), and jsonlite (JSON). Fixed mismatches in column names, whitespace, and encoding issues. Flattened JSON lists for authors and genres to align with the other formats. Ran identical() checks and confirmed all three formats were identical after cleaning.
Final Thoughts Data is messy, and different formats don’t always play by the same rules. Standardization is key—cleaning and wrangling made everything match up.