library(xml2)
library(jsonlite)
library(XML)
library(dplyr)
In this assignment we look at three books, namely
The Art of thinking clearly,
The Last Wish,
Good Omens
We manually create an html, xml and json table and import them into this RMD file to compare them.
# html_file <- "https://raw.githubusercontent.com/jerryjerald27/Data-607/refs/heads/main/Week7Assignment/books.html" This line was giving me a random error saying XML content does not seem to be XML: '' that made no sense
html_file <- "books.html"
html_data <- readHTMLTable(html_file, stringsAsFactors = FALSE)
df_html <- html_data[[1]]
xml_file <- "https://raw.githubusercontent.com/jerryjerald27/Data-607/refs/heads/main/Week7Assignment/books.xml"
xml_data <- read_xml(xml_file)
df_xml <- data.frame(
Title = xml_text(xml_find_all(xml_data, "//book/title")),
Author = xml_text(xml_find_all(xml_data, "//book/author")),
Rating = as.numeric(xml_text(xml_find_all(xml_data, "//book/rating"))),
Genre = xml_text(xml_find_all(xml_data, "//book/genre")),
stringsAsFactors = FALSE
)
json_file <- "https://raw.githubusercontent.com/jerryjerald27/Data-607/refs/heads/main/Week7Assignment/books.json"
json_data <- fromJSON(json_file)
df_json <- as.data.frame(json_data$books)
Now to compare the three different data frames. We can first display them with knitr:kable and see how they look.
knitr::kable((df_html),"simple")
| Title | Author | Rating | Genre |
|---|---|---|---|
| The Art of Thinking Clearly | Rolf Dobelli | 4.1 | Non-fiction, Psychology |
| The Last Wish | Andrzej Sapkowski | 4.5 | Fantasy |
| Good Omens | Neil Gaiman, Terry Pratchett | 4.7 | Fantasy, Comedy |
knitr::kable((df_json),"simple")
| title | author | rating | genre |
|---|---|---|---|
| The Art of Thinking Clearly | Rolf Dobelli | 4.1 | Non-fiction, Psychology |
| The Last Wish | Andrzej Sapkowski | 4.5 | Fantasy |
| Good Omens | Neil Gaiman, Terry Pratchett | 4.7 | Fantasy, Comedy |
knitr::kable((df_xml),"simple")
| Title | Author | Rating | Genre |
|---|---|---|---|
| The Art of Thinking Clearly | Rolf Dobelli | 4.1 | Non-fiction, Psychology |
| The Last Wish | Andrzej Sapkowski | 4.5 | Fantasy |
| Good Omens | Neil Gaiman, Terry Pratchett | 4.7 | Fantasy, Comedy |
As we can see the three tables look completely identical here in a basic eye test.
Now we can first use function identical () and all_equal() to see what R thinks of the data frames
identical(df_html, df_xml)
## [1] FALSE
identical(df_html, df_json)
## [1] FALSE
identical(df_xml, df_json)
## [1] FALSE
Identical returns false for all comparisons . This might be due to reasons such as data types not being consistent across the data frames,possible additional metadata or additional hidden attributes being passed by the different file types or the libraries used to extract them. It might also be differences in characters or white spaces.
We can also use dplyr function all.equal(). Its more useful as it specifies the differences that it finds
all.equal(df_html, df_xml)
## [1] "Component \"Rating\": Modes: character, numeric"
## [2] "Component \"Rating\": target is character, current is numeric"
all.equal(df_html, df_json)
## [1] "Names: 4 string mismatches"
## [2] "Component 3: Modes: character, numeric"
## [3] "Component 3: target is character, current is numeric"
all.equal(df_xml, df_json)
## [1] "Names: 4 string mismatches"
Here we can see that there is a difference in column data types and
apparent string mismatches.
The data type issue does not crop up when comparing the XML and the
JSON, so it has to be an issue with the HTML files component 3 .
Additionally the string mismatch is not an issue between the HTML and
the XML. So it has to be an issue introduced with the JSON.
We can verify the data types separately
sapply(df_html, class)
## Title Author Rating Genre
## "character" "character" "character" "character"
sapply(df_xml, class)
## Title Author Rating Genre
## "character" "character" "numeric" "character"
sapply(df_json, class)
## title author rating genre
## "character" "character" "numeric" "character"
We can now clearly see that the html table characterized the rating field as a character while the others correctly considered it as numeric. Lets see what happens if we force it to be numeric
df_html$Rating <- as.numeric(df_html$Rating)
all.equal(df_html, df_xml)
## [1] TRUE
all.equal(df_html, df_json)
## [1] "Names: 4 string mismatches"
all.equal(df_xml, df_json)
## [1] "Names: 4 string mismatches"
It is at this point that I realized that I had written out the field names in all lowercase for the json, while both the other tables had an uppercase letter in the beginning. Causing the string mismatches. Lets correct for that
colnames(df_json) <- c("Title", "Author", "Rating", "Genre")
all.equal(df_html, df_xml)
## [1] TRUE
all.equal(df_html, df_json)
## [1] TRUE
all.equal(df_xml, df_json)
## [1] TRUE
identical(df_html, df_xml)
## [1] TRUE
identical(df_html, df_json)
## [1] TRUE
identical(df_xml, df_json)
## [1] TRUE
Now its all equal using both functions.