library(xml2)
library(jsonlite)
library(XML)
library(dplyr)

Introduction

In this assignment we look at three books, namely

The Art of thinking clearly,
The Last Wish,
Good Omens

We manually create an html, xml and json table and import them into this RMD file to compare them.

# html_file <- "https://raw.githubusercontent.com/jerryjerald27/Data-607/refs/heads/main/Week7Assignment/books.html"  This line was giving me a random error saying XML content does not seem to be XML: '' that made no sense 
html_file <- "books.html" 
html_data <- readHTMLTable(html_file, stringsAsFactors = FALSE)
df_html <- html_data[[1]]
xml_file <- "https://raw.githubusercontent.com/jerryjerald27/Data-607/refs/heads/main/Week7Assignment/books.xml"
xml_data <- read_xml(xml_file)
df_xml <- data.frame(
  Title = xml_text(xml_find_all(xml_data, "//book/title")),
  Author = xml_text(xml_find_all(xml_data, "//book/author")),
  Rating = as.numeric(xml_text(xml_find_all(xml_data, "//book/rating"))),
  Genre = xml_text(xml_find_all(xml_data, "//book/genre")),
  stringsAsFactors = FALSE
)

json_file <- "https://raw.githubusercontent.com/jerryjerald27/Data-607/refs/heads/main/Week7Assignment/books.json"  
json_data <- fromJSON(json_file)
df_json <- as.data.frame(json_data$books)

Eye test

Now to compare the three different data frames. We can first display them with knitr:kable and see how they look.

knitr::kable((df_html),"simple")
Title Author Rating Genre
The Art of Thinking Clearly Rolf Dobelli 4.1 Non-fiction, Psychology
The Last Wish Andrzej Sapkowski 4.5 Fantasy
Good Omens Neil Gaiman, Terry Pratchett 4.7 Fantasy, Comedy
knitr::kable((df_json),"simple")
title author rating genre
The Art of Thinking Clearly Rolf Dobelli 4.1 Non-fiction, Psychology
The Last Wish Andrzej Sapkowski 4.5 Fantasy
Good Omens Neil Gaiman, Terry Pratchett 4.7 Fantasy, Comedy
knitr::kable((df_xml),"simple")
Title Author Rating Genre
The Art of Thinking Clearly Rolf Dobelli 4.1 Non-fiction, Psychology
The Last Wish Andrzej Sapkowski 4.5 Fantasy
Good Omens Neil Gaiman, Terry Pratchett 4.7 Fantasy, Comedy

As we can see the three tables look completely identical here in a basic eye test.

Using identical()

Now we can first use function identical () and all_equal() to see what R thinks of the data frames

identical(df_html, df_xml)  
## [1] FALSE
identical(df_html, df_json) 
## [1] FALSE
identical(df_xml, df_json)  
## [1] FALSE

Identical returns false for all comparisons . This might be due to reasons such as data types not being consistent across the data frames,possible additional metadata or additional hidden attributes being passed by the different file types or the libraries used to extract them. It might also be differences in characters or white spaces.

Using all.equal()

We can also use dplyr function all.equal(). Its more useful as it specifies the differences that it finds

all.equal(df_html, df_xml)  
## [1] "Component \"Rating\": Modes: character, numeric"              
## [2] "Component \"Rating\": target is character, current is numeric"
all.equal(df_html, df_json) 
## [1] "Names: 4 string mismatches"                          
## [2] "Component 3: Modes: character, numeric"              
## [3] "Component 3: target is character, current is numeric"
all.equal(df_xml, df_json)  
## [1] "Names: 4 string mismatches"

Here we can see that there is a difference in column data types and apparent string mismatches.
The data type issue does not crop up when comparing the XML and the JSON, so it has to be an issue with the HTML files component 3 .
Additionally the string mismatch is not an issue between the HTML and the XML. So it has to be an issue introduced with the JSON.

We can verify the data types separately

sapply(df_html, class)
##       Title      Author      Rating       Genre 
## "character" "character" "character" "character"
sapply(df_xml, class)
##       Title      Author      Rating       Genre 
## "character" "character"   "numeric" "character"
sapply(df_json, class)
##       title      author      rating       genre 
## "character" "character"   "numeric" "character"

We can now clearly see that the html table characterized the rating field as a character while the others correctly considered it as numeric. Lets see what happens if we force it to be numeric

df_html$Rating <- as.numeric(df_html$Rating)
all.equal(df_html, df_xml)  
## [1] TRUE
all.equal(df_html, df_json) 
## [1] "Names: 4 string mismatches"
all.equal(df_xml, df_json)  
## [1] "Names: 4 string mismatches"

It is at this point that I realized that I had written out the field names in all lowercase for the json, while both the other tables had an uppercase letter in the beginning. Causing the string mismatches. Lets correct for that

colnames(df_json) <- c("Title", "Author", "Rating", "Genre")
all.equal(df_html, df_xml)  
## [1] TRUE
all.equal(df_html, df_json) 
## [1] TRUE
all.equal(df_xml, df_json) 
## [1] TRUE
identical(df_html, df_xml) 
## [1] TRUE
identical(df_html, df_json) 
## [1] TRUE
identical(df_xml, df_json) 
## [1] TRUE

Now its all equal using both functions.