Data607: Working with XML and JSON in R

Overview

The purpose of this assignment is to work with HTML, XML, and JSON files in R. I have included 4 books where 1 book has multiple authors and have included more details such as the year they were released, copies sold, pages and the genres. The following source files will be available on my GitHub Page.

HTML

With this code block, htmlload will get the raw html file from the GitHub Page.

htmlload <- read_html(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.html"))

This code block will Extract the table and store it as bookstable:

bookstable <- html_table(htmlload, fill=TRUE)[[1]]

Lastly, the codeblock below will display the table using kable.

kable(bookstable, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))

Title	Author	Release Year	Genres	Pages	Copies Sold Worldwide
The Rising of the Shield Hero Volume 12	Aneko Yusagi	2018	Fantasy, Adventure, Isekai	360	3 million
The Eminence in Shadow, Vol. 4	Daisuke Aizawa	2021	Action, Comedy, Isekai	260	1 million
The Book Thief	Markus Zusak	2005	Historical Fiction	584	16 million
Good Omens	Neil Gaiman, Terry Pratchett	1990	Fantasy, Comedy	412	5 million

XML

With this code block, xmlload will get the raw xml file from the GitHub Page.

xmlload <- read_xml(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.xml"))

This codeblock will utilize xml_structure() to get the following nodes from xmlload:

xml_structure(xmlload)

## <books>
##   <book>
##     <title>
##       {text}
##     <author>
##       {text}
##     <release_year>
##       {text}
##     <genres>
##       {text}
##     <pages>
##       {text}
##     <copies_sold_worldwide>
##       {text}
##   <book>
##     <title>
##       {text}
##     <author>
##       {text}
##     <release_year>
##       {text}
##     <genres>
##       {text}
##     <pages>
##       {text}
##     <copies_sold_worldwide>
##       {text}
##   <book>
##     <title>
##       {text}
##     <author>
##       {text}
##     <release_year>
##       {text}
##     <genres>
##       {text}
##     <pages>
##       {text}
##     <copies_sold_worldwide>
##       {text}
##   <book>
##     <title>
##       {text}
##     <author>
##       {text}
##     <release_year>
##       {text}
##     <genres>
##       {text}
##     <pages>
##       {text}
##     <copies_sold_worldwide>
##       {text}

We will then store the nodes into a dataframe booksdf:

titles <- xml_text(xml_find_all(xmlload, "//title"))
authors <- xml_text(xml_find_all(xmlload, "//author"))
release_years <- xml_text(xml_find_all(xmlload, "//release_year"))
genres <- xml_text(xml_find_all(xmlload, "//genres"))
pages <- xml_text(xml_find_all(xmlload, "//pages"))
copies_sold <- xml_text(xml_find_all(xmlload, "//copies_sold_worldwide"))

books_df <- data.frame(
  Title = titles,
  Author = authors,
  Release_Year = release_years,
  Genres = genres,
  Pages = pages,
  Copies_Sold_Worldwide = copies_sold,
  stringsAsFactors = FALSE
)

Lastly, it will be displayed as a table using kable:

kable(books_df, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))

Title	Author	Release Year	Genres	Pages	Copies Sold Worldwide
The Rising of the Shield Hero Volume 12	Aneko Yusagi	2018	Fantasy, Adventure, Isekai	360	3 million
The Eminence in Shadow, Vol. 4	Daisuke Aizawa	2021	Action, Comedy, Isekai	260	1 million
The Book Thief	Markus Zusak	2005	Historical Fiction	584	16 million
Good Omens	Neil Gaiman, Terry Pratchett	1990	Fantasy, Comedy	412	5 million

JSON

With this code block, jsonload will get the raw json file from the GitHub Page.

jsonload <- fromJSON(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.json"))

After loading the json frame we will store it in books_json as a data frame

books_json <- as.data.frame(jsonload$books)

Since there were instances where the Genre and Authors would have an extra space between the commas, I used the collaps function below to fix this issue.

books_json$author <- sapply(books_json$author, function(x) paste(x, collapse = ", "))

books_json$genres <- sapply(books_json$genres, function(x) paste(x, collapse = ", "))

Lastly, I displayed the json table using kable.

kable(books_json, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))

Title	Author	Release Year	Genres	Pages	Copies Sold Worldwide
The Rising of the Shield Hero Volume 12	Aneko Yusagi	2018	Fantasy, Adventure, Isekai	360	3 million
The Eminence in Shadow, Vol. 4	Daisuke Aizawa	2021	Action, Comedy, Isekai	260	1 million
The Book Thief	Markus Zusak	2005	Historical Fiction	584	16 million
Good Omens	Neil Gaiman, Terry Pratchett	1990	Fantasy, Comedy	412	5 million

Check if Dataframes are identical

The following codeblock will check if they are identical:

is_html_xml_identical <- identical(bookstable, books_df)
is_html_json_identical <- identical(books_df, books_json)
is_xml_json_identical <- identical(books_df, books_json)

is_html_json_identical

## [1] FALSE

is_html_xml_identical

## [1] FALSE

is_xml_json_identical

## [1] FALSE

Conclusion

Even though the tables look similar, these are not identical as there are formatting differences between HTML, XML and JSON. JSON had some instances where it was stored as lists and I had to collapse them into strings to make the tables look similar. One way to make the data frames identical would be to use the trimws() function and the as.character() function to convert all columns to character type to possibly make them identical, and using the all.equal() funcition may make the data frames identical.