This assignment will use the following packages:

library(bslib)
library(readr)
library(RCurl)
library(stringr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(xml2)
library(rvest)
library(jsonlite)

Overview

The purpose of this assignment is to work with HTML, XML, and JSON files in R. I have included 4 books where 1 book has multiple authors and have included more details such as the year they were released, copies sold, pages and the genres. The following source files will be available on my GitHub Page.

HTML

With this code block, htmlload will get the raw html file from the GitHub Page.

htmlload <- read_html(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.html"))

This code block will Extract the table and store it as bookstable:

bookstable <- html_table(htmlload, fill=TRUE)[[1]]

Lastly, the codeblock below will display the table using kable.

kable(bookstable, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))
Title Author Release Year Genres Pages Copies Sold Worldwide
The Rising of the Shield Hero Volume 12 Aneko Yusagi 2018 Fantasy, Adventure, Isekai 360 3 million
The Eminence in Shadow, Vol. 4 Daisuke Aizawa 2021 Action, Comedy, Isekai 260 1 million
The Book Thief Markus Zusak 2005 Historical Fiction 584 16 million
Good Omens Neil Gaiman, Terry Pratchett 1990 Fantasy, Comedy 412 5 million

XML

With this code block, xmlload will get the raw xml file from the GitHub Page.

xmlload <- read_xml(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.xml"))

This codeblock will utilize xml_structure() to get the following nodes from xmlload:

xml_structure(xmlload)
## <books>
##   <book>
##     <title>
##       {text}
##     <author>
##       {text}
##     <release_year>
##       {text}
##     <genres>
##       {text}
##     <pages>
##       {text}
##     <copies_sold_worldwide>
##       {text}
##   <book>
##     <title>
##       {text}
##     <author>
##       {text}
##     <release_year>
##       {text}
##     <genres>
##       {text}
##     <pages>
##       {text}
##     <copies_sold_worldwide>
##       {text}
##   <book>
##     <title>
##       {text}
##     <author>
##       {text}
##     <release_year>
##       {text}
##     <genres>
##       {text}
##     <pages>
##       {text}
##     <copies_sold_worldwide>
##       {text}
##   <book>
##     <title>
##       {text}
##     <author>
##       {text}
##     <release_year>
##       {text}
##     <genres>
##       {text}
##     <pages>
##       {text}
##     <copies_sold_worldwide>
##       {text}

We will then store the nodes into a dataframe booksdf:

titles <- xml_text(xml_find_all(xmlload, "//title"))
authors <- xml_text(xml_find_all(xmlload, "//author"))
release_years <- xml_text(xml_find_all(xmlload, "//release_year"))
genres <- xml_text(xml_find_all(xmlload, "//genres"))
pages <- xml_text(xml_find_all(xmlload, "//pages"))
copies_sold <- xml_text(xml_find_all(xmlload, "//copies_sold_worldwide"))

books_df <- data.frame(
  Title = titles,
  Author = authors,
  Release_Year = release_years,
  Genres = genres,
  Pages = pages,
  Copies_Sold_Worldwide = copies_sold,
  stringsAsFactors = FALSE
)

Lastly, it will be displayed as a table using kable:

kable(books_df, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))
Title Author Release Year Genres Pages Copies Sold Worldwide
The Rising of the Shield Hero Volume 12 Aneko Yusagi 2018 Fantasy, Adventure, Isekai 360 3 million
The Eminence in Shadow, Vol. 4 Daisuke Aizawa 2021 Action, Comedy, Isekai 260 1 million
The Book Thief Markus Zusak 2005 Historical Fiction 584 16 million
Good Omens Neil Gaiman, Terry Pratchett 1990 Fantasy, Comedy 412 5 million

JSON

With this code block, jsonload will get the raw json file from the GitHub Page.

jsonload <- fromJSON(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.json"))

After loading the json frame we will store it in books_json as a data frame

books_json <- as.data.frame(jsonload$books)

Since there were instances where the Genre and Authors would have an extra space between the commas, I used the collaps function below to fix this issue.

books_json$author <- sapply(books_json$author, function(x) paste(x, collapse = ", "))

books_json$genres <- sapply(books_json$genres, function(x) paste(x, collapse = ", "))

Lastly, I displayed the json table using kable.

kable(books_json, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))
Title Author Release Year Genres Pages Copies Sold Worldwide
The Rising of the Shield Hero Volume 12 Aneko Yusagi 2018 Fantasy, Adventure, Isekai 360 3 million
The Eminence in Shadow, Vol. 4 Daisuke Aizawa 2021 Action, Comedy, Isekai 260 1 million
The Book Thief Markus Zusak 2005 Historical Fiction 584 16 million
Good Omens Neil Gaiman, Terry Pratchett 1990 Fantasy, Comedy 412 5 million

Check if Dataframes are identical

The following codeblock will check if they are identical:

is_html_xml_identical <- identical(bookstable, books_df)
is_html_json_identical <- identical(books_df, books_json)
is_xml_json_identical <- identical(books_df, books_json)

is_html_json_identical
## [1] FALSE
is_html_xml_identical
## [1] FALSE
is_xml_json_identical
## [1] FALSE

Conclusion

Even though the tables look similar, these are not identical as there are formatting differences between HTML, XML and JSON. JSON had some instances where it was stored as lists and I had to collapse them into strings to make the tables look similar. One way to make the data frames identical would be to use the trimws() function and the as.character() function to convert all columns to character type to possibly make them identical, and using the all.equal() funcition may make the data frames identical.