This assignment is all about exploring how data is represented and loaded in R from three different formats: HTML, XML, and JSON. Here are my chosen three books, with at least one featuring multiple authors!
Each book includes: Title, Authors, Year, Genre, and Pages.
We start by reading the book data from an HTML file. This file
contains a simple table with all book details.
We use the rvest and xml2 packages to parse
the HTML and extract the table.
What happens in this step? - Load the HTML file
using read_html(). - Extract the table using
html_table(). - Get the first table from the list (since
our file only has one). - Output: A data frame called
books_html containing the book info, one row per book.
library(rvest)
library(xml2)
## Warning: package 'xml2' was built under R version 4.4.3
# Path to the HTML file (make sure it's in your working directory)
html_file <- "books.html"
# Read and parse the HTML file
html_doc <- read_html(html_file)
# Extract all tables from the HTML (returns a list of data frames)
html_tables <- html_doc %>% html_table(fill = TRUE)
# Select the first table (our books table)
books_html <- html_tables[[1]]
# Print the data frame to see what we've got
print(books_html)
## # A tibble: 3 × 5
## Title Authors Year Genre Pages
## <chr> <chr> <int> <chr> <int>
## 1 The Art of Statistics David Spiegelhalter 2019 Stat… 448
## 2 Weapons of Math Destruction Cathy O'Neil 2016 Math… 259
## 3 Data Feminism Catherine D'Ignazio, Lauren F. … 2020 Data… 328
Expected output:
A table (data frame) with columns: Title, Authors, Year, Genre,
Pages.
Each row represents a book and its details.
Next, we read the book data from an XML file. XML allows for nested structures, which is useful for books with multiple authors.
What happens in this step? - Load the XML file using
read_xml(). - Find all <book> nodes in
the XML. - For each book, extract fields: title, authors (combine
multiple authors into a single string), year, genre, pages. - Output: A
data frame called books_xml with the same columns as
above.
library(xml2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
# Path to the XML file
xml_file <- "books.xml"
# Read and parse the XML file
xml_doc <- read_xml(xml_file)
# Find all book nodes
book_nodes <- xml_find_all(xml_doc, ".//book")
# For each book node, extract details into a data frame
books_xml <- tibble(
Title = sapply(book_nodes, function(x) xml_text(xml_find_first(x, "title"))),
Authors = sapply(book_nodes, function(x) paste(xml_text(xml_find_all(x, "authors/author")), collapse = ", ")),
Year = sapply(book_nodes, function(x) xml_text(xml_find_first(x, "year"))),
Genre = sapply(book_nodes, function(x) xml_text(xml_find_first(x, "genre"))),
Pages = sapply(book_nodes, function(x) xml_text(xml_find_first(x, "pages")))
)
# Print the data frame to confirm it loaded correctly
print(books_xml)
## # A tibble: 3 × 5
## Title Authors Year Genre Pages
## <chr> <chr> <chr> <chr> <chr>
## 1 The Art of Statistics David Spiegelhalter 2019 Stat… 448
## 2 Weapons of Math Destruction Cathy O'Neil 2016 Math… 259
## 3 Data Feminism Catherine D'Ignazio, Lauren F. … 2020 Data… 328
Expected output:
A tibble/data frame with columns Title, Authors (authors separated by
comma for multi-author books), Year, Genre, Pages.
Now let’s load the book data from a JSON file. JSON is very common for APIs and web data.
What happens in this step? - Use
jsonlite::fromJSON() to load the JSON file into R. - Each
book is an object in a list; authors are stored as a vector (can have
multiple). - We flatten the authors vector into a single comma-separated
string for consistency. - Output: A data frame called
books_json with the same structure.
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 4.4.2
library(tibble)
# Path to the JSON file
json_file <- "books.json"
# Read JSON data (returns a list of book objects)
books_list <- fromJSON(json_file)
# Convert to a tibble/data frame, flattening authors
books_json <- tibble(
Title = books_list$title,
Authors = sapply(books_list$authors, function(x) paste(x, collapse = ", ")),
Year = books_list$year,
Genre = books_list$genre,
Pages = books_list$pages
)
# Print to inspect the loaded data
print(books_json)
## # A tibble: 3 × 5
## Title Authors Year Genre Pages
## <chr> <chr> <int> <chr> <int>
## 1 The Art of Statistics David Spiegelhalter 2019 Stat… 448
## 2 Weapons of Math Destruction Cathy O'Neil 2016 Math… 259
## 3 Data Feminism Catherine D'Ignazio, Lauren F. … 2020 Data… 328
Expected output:
A tibble/data frame with Title, Authors (comma-separated), Year, Genre,
Pages.
Are the three data frames identical?
It’s important to standardize column names and data types before
comparing.
What happens in this step? - Make column names the
same for all three tables. - Convert Year and Pages to character type
for comparison (sometimes they’re numeric, sometimes strings). - Use
identical() to check if the data frames are exactly the
same.
# Rename columns in HTML for consistency
names(books_html) <- c("Title", "Authors", "Year", "Genre", "Pages")
# Convert Year & Pages columns to character type
books_html$Year <- as.character(books_html$Year)
books_html$Pages <- as.character(books_html$Pages)
books_xml$Year <- as.character(books_xml$Year)
books_xml$Pages <- as.character(books_xml$Pages)
books_json$Year <- as.character(books_json$Year)
books_json$Pages <- as.character(books_json$Pages)
# Compare each pair of data frames
html_vs_xml <- identical(books_html, books_xml)
html_vs_json <- identical(books_html, books_json)
xml_vs_json <- identical(books_xml, books_json)
# Print results
cat("Are HTML and XML data frames identical? ", html_vs_xml, "\n")
## Are HTML and XML data frames identical? TRUE
cat("Are HTML and JSON data frames identical? ", html_vs_json, "\n")
## Are HTML and JSON data frames identical? TRUE
cat("Are XML and JSON data frames identical? ", xml_vs_json, "\n")
## Are XML and JSON data frames identical? TRUE
Expected output:
Three logical values (TRUE or FALSE) showing
whether each pair of data frames is identical.
Are the three data frames identical?
The output above says TRUE for all comparisons, which means
that yes—they match exactly! This shows that careful formatting across
all files yields consistent results.
What did I learn from this assignment? - HTML tables are easy for simple, flat data, but not ideal for complex nesting (like multiple authors). - XML excels at complex, hierarchical data. Parsing requires more code, but it’s flexible. - JSON is web-friendly, handles nested structures well, and is common in APIs. Flattening lists may be needed! - Data wrangling: Always check for type mismatches and ensure consistency in column names and formats. - Real-world caveats: Even small differences in formatting can cause data frames to differ (e.g. whitespace, author order, types).