Week 7 Assignment: Working with XML and JSON in R

This assignment is all about exploring how data is represented and loaded in R from three different formats: HTML, XML, and JSON. Here are my chosen three books, with at least one featuring multiple authors!

Book Choices

The Art of Statistics by David Spiegelhalter
Weapons of Math Destruction by Cathy O’Neil
Data Feminism by Catherine D’Ignazio & Lauren F. Klein

Each book includes: Title, Authors, Year, Genre, and Pages.

1. Reading the HTML Table

We start by reading the book data from an HTML file. This file contains a simple table with all book details.
We use the rvest and xml2 packages to parse the HTML and extract the table.

What happens in this step? - Load the HTML file using read_html(). - Extract the table using html_table(). - Get the first table from the list (since our file only has one). - Output: A data frame called books_html containing the book info, one row per book.

library(rvest)
library(xml2)

## Warning: package 'xml2' was built under R version 4.4.3

# Path to the HTML file (make sure it's in your working directory)
html_file <- "books.html"

# Read and parse the HTML file
html_doc <- read_html(html_file)

# Extract all tables from the HTML (returns a list of data frames)
html_tables <- html_doc %>% html_table(fill = TRUE)

# Select the first table (our books table)
books_html <- html_tables[[1]]

# Print the data frame to see what we've got
print(books_html)

## # A tibble: 3 × 5
##   Title                       Authors                           Year Genre Pages
##   <chr>                       <chr>                            <int> <chr> <int>
## 1 The Art of Statistics       David Spiegelhalter               2019 Stat…   448
## 2 Weapons of Math Destruction Cathy O'Neil                      2016 Math…   259
## 3 Data Feminism               Catherine D'Ignazio, Lauren F. …  2020 Data…   328

Expected output:
A table (data frame) with columns: Title, Authors, Year, Genre, Pages.
Each row represents a book and its details.

2. Reading the XML File

Next, we read the book data from an XML file. XML allows for nested structures, which is useful for books with multiple authors.

What happens in this step? - Load the XML file using read_xml(). - Find all <book> nodes in the XML. - For each book, extract fields: title, authors (combine multiple authors into a single string), year, genre, pages. - Output: A data frame called books_xml with the same columns as above.

library(xml2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tibble)

# Path to the XML file
xml_file <- "books.xml"

# Read and parse the XML file
xml_doc <- read_xml(xml_file)

# Find all book nodes
book_nodes <- xml_find_all(xml_doc, ".//book")

# For each book node, extract details into a data frame
books_xml <- tibble(
  Title   = sapply(book_nodes, function(x) xml_text(xml_find_first(x, "title"))),
  Authors = sapply(book_nodes, function(x) paste(xml_text(xml_find_all(x, "authors/author")), collapse = ", ")),
  Year    = sapply(book_nodes, function(x) xml_text(xml_find_first(x, "year"))),
  Genre   = sapply(book_nodes, function(x) xml_text(xml_find_first(x, "genre"))),
  Pages   = sapply(book_nodes, function(x) xml_text(xml_find_first(x, "pages")))
)

# Print the data frame to confirm it loaded correctly
print(books_xml)

## # A tibble: 3 × 5
##   Title                       Authors                          Year  Genre Pages
##   <chr>                       <chr>                            <chr> <chr> <chr>
## 1 The Art of Statistics       David Spiegelhalter              2019  Stat… 448  
## 2 Weapons of Math Destruction Cathy O'Neil                     2016  Math… 259  
## 3 Data Feminism               Catherine D'Ignazio, Lauren F. … 2020  Data… 328

Expected output:
A tibble/data frame with columns Title, Authors (authors separated by comma for multi-author books), Year, Genre, Pages.

3. Reading the JSON File

Now let’s load the book data from a JSON file. JSON is very common for APIs and web data.

What happens in this step? - Use jsonlite::fromJSON() to load the JSON file into R. - Each book is an object in a list; authors are stored as a vector (can have multiple). - We flatten the authors vector into a single comma-separated string for consistency. - Output: A data frame called books_json with the same structure.

library(jsonlite)

## Warning: package 'jsonlite' was built under R version 4.4.2

library(tibble)

# Path to the JSON file
json_file <- "books.json"

# Read JSON data (returns a list of book objects)
books_list <- fromJSON(json_file)

# Convert to a tibble/data frame, flattening authors
books_json <- tibble(
  Title   = books_list$title,
  Authors = sapply(books_list$authors, function(x) paste(x, collapse = ", ")),
  Year    = books_list$year,
  Genre   = books_list$genre,
  Pages   = books_list$pages
)

# Print to inspect the loaded data
print(books_json)

## # A tibble: 3 × 5
##   Title                       Authors                           Year Genre Pages
##   <chr>                       <chr>                            <int> <chr> <int>
## 1 The Art of Statistics       David Spiegelhalter               2019 Stat…   448
## 2 Weapons of Math Destruction Cathy O'Neil                      2016 Math…   259
## 3 Data Feminism               Catherine D'Ignazio, Lauren F. …  2020 Data…   328

Expected output:
A tibble/data frame with Title, Authors (comma-separated), Year, Genre, Pages.

4. Comparing the Three Data Frames

Are the three data frames identical?
It’s important to standardize column names and data types before comparing.

What happens in this step? - Make column names the same for all three tables. - Convert Year and Pages to character type for comparison (sometimes they’re numeric, sometimes strings). - Use identical() to check if the data frames are exactly the same.

# Rename columns in HTML for consistency
names(books_html) <- c("Title", "Authors", "Year", "Genre", "Pages")

# Convert Year & Pages columns to character type
books_html$Year <- as.character(books_html$Year)
books_html$Pages <- as.character(books_html$Pages)
books_xml$Year   <- as.character(books_xml$Year)
books_xml$Pages  <- as.character(books_xml$Pages)
books_json$Year  <- as.character(books_json$Year)
books_json$Pages <- as.character(books_json$Pages)

# Compare each pair of data frames
html_vs_xml <- identical(books_html, books_xml)
html_vs_json <- identical(books_html, books_json)
xml_vs_json <- identical(books_xml, books_json)

# Print results
cat("Are HTML and XML data frames identical? ", html_vs_xml, "\n")

## Are HTML and XML data frames identical?  TRUE

cat("Are HTML and JSON data frames identical? ", html_vs_json, "\n")

## Are HTML and JSON data frames identical?  TRUE

cat("Are XML and JSON data frames identical? ", xml_vs_json, "\n")

## Are XML and JSON data frames identical?  TRUE

Expected output:
Three logical values (TRUE or FALSE) showing whether each pair of data frames is identical.

5. Discussion & Analysis

Are the three data frames identical?
The output above says TRUE for all comparisons, which means that yes—they match exactly! This shows that careful formatting across all files yields consistent results.

What did I learn from this assignment? - HTML tables are easy for simple, flat data, but not ideal for complex nesting (like multiple authors). - XML excels at complex, hierarchical data. Parsing requires more code, but it’s flexible. - JSON is web-friendly, handles nested structures well, and is common in APIs. Flattening lists may be needed! - Data wrangling: Always check for type mismatches and ensure consistency in column names and formats. - Real-world caveats: Even small differences in formatting can cause data frames to differ (e.g. whitespace, author order, types).