This assignment will use the following packages:
library(bslib)
library(readr)
library(RCurl)
library(stringr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(xml2)
library(rvest)
library(jsonlite)
The purpose of this assignment is to work with HTML, XML, and JSON files in R. I have included 4 books where 1 book has multiple authors and have included more details such as the year they were released, copies sold, pages and the genres. The following source files will be available on my GitHub Page.
With this code block, htmlload
will get the raw html
file from the GitHub Page.
htmlload <- read_html(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.html"))
This code block will Extract the table and store it as
bookstable
:
bookstable <- html_table(htmlload, fill=TRUE)[[1]]
Lastly, the codeblock below will display the table using kable.
kable(bookstable, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))
Title | Author | Release Year | Genres | Pages | Copies Sold Worldwide |
---|---|---|---|---|---|
The Rising of the Shield Hero Volume 12 | Aneko Yusagi | 2018 | Fantasy, Adventure, Isekai | 360 | 3 million |
The Eminence in Shadow, Vol. 4 | Daisuke Aizawa | 2021 | Action, Comedy, Isekai | 260 | 1 million |
The Book Thief | Markus Zusak | 2005 | Historical Fiction | 584 | 16 million |
Good Omens | Neil Gaiman, Terry Pratchett | 1990 | Fantasy, Comedy | 412 | 5 million |
With this code block, xmlload
will get the raw xml file
from the GitHub Page.
xmlload <- read_xml(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.xml"))
This codeblock will utilize xml_structure()
to get the
following nodes from xmlload
:
xml_structure(xmlload)
## <books>
## <book>
## <title>
## {text}
## <author>
## {text}
## <release_year>
## {text}
## <genres>
## {text}
## <pages>
## {text}
## <copies_sold_worldwide>
## {text}
## <book>
## <title>
## {text}
## <author>
## {text}
## <release_year>
## {text}
## <genres>
## {text}
## <pages>
## {text}
## <copies_sold_worldwide>
## {text}
## <book>
## <title>
## {text}
## <author>
## {text}
## <release_year>
## {text}
## <genres>
## {text}
## <pages>
## {text}
## <copies_sold_worldwide>
## {text}
## <book>
## <title>
## {text}
## <author>
## {text}
## <release_year>
## {text}
## <genres>
## {text}
## <pages>
## {text}
## <copies_sold_worldwide>
## {text}
We will then store the nodes into a dataframe
booksdf
:
titles <- xml_text(xml_find_all(xmlload, "//title"))
authors <- xml_text(xml_find_all(xmlload, "//author"))
release_years <- xml_text(xml_find_all(xmlload, "//release_year"))
genres <- xml_text(xml_find_all(xmlload, "//genres"))
pages <- xml_text(xml_find_all(xmlload, "//pages"))
copies_sold <- xml_text(xml_find_all(xmlload, "//copies_sold_worldwide"))
books_df <- data.frame(
Title = titles,
Author = authors,
Release_Year = release_years,
Genres = genres,
Pages = pages,
Copies_Sold_Worldwide = copies_sold,
stringsAsFactors = FALSE
)
Lastly, it will be displayed as a table using kable:
kable(books_df, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))
Title | Author | Release Year | Genres | Pages | Copies Sold Worldwide |
---|---|---|---|---|---|
The Rising of the Shield Hero Volume 12 | Aneko Yusagi | 2018 | Fantasy, Adventure, Isekai | 360 | 3 million |
The Eminence in Shadow, Vol. 4 | Daisuke Aizawa | 2021 | Action, Comedy, Isekai | 260 | 1 million |
The Book Thief | Markus Zusak | 2005 | Historical Fiction | 584 | 16 million |
Good Omens | Neil Gaiman, Terry Pratchett | 1990 | Fantasy, Comedy | 412 | 5 million |
With this code block, jsonload
will get the raw json
file from the GitHub Page.
jsonload <- fromJSON(url("https://raw.githubusercontent.com/spacerome/Data607_Assignment_5/refs/heads/main/books.json"))
After loading the json frame we will store it in
books_json
as a data frame
books_json <- as.data.frame(jsonload$books)
Since there were instances where the Genre and Authors would have an extra space between the commas, I used the collaps function below to fix this issue.
books_json$author <- sapply(books_json$author, function(x) paste(x, collapse = ", "))
books_json$genres <- sapply(books_json$genres, function(x) paste(x, collapse = ", "))
Lastly, I displayed the json table using kable.
kable(books_json, format = "markdown", col.names = c("Title", "Author", "Release Year", "Genres", "Pages", "Copies Sold Worldwide"))
Title | Author | Release Year | Genres | Pages | Copies Sold Worldwide |
---|---|---|---|---|---|
The Rising of the Shield Hero Volume 12 | Aneko Yusagi | 2018 | Fantasy, Adventure, Isekai | 360 | 3 million |
The Eminence in Shadow, Vol. 4 | Daisuke Aizawa | 2021 | Action, Comedy, Isekai | 260 | 1 million |
The Book Thief | Markus Zusak | 2005 | Historical Fiction | 584 | 16 million |
Good Omens | Neil Gaiman, Terry Pratchett | 1990 | Fantasy, Comedy | 412 | 5 million |
The following codeblock will check if they are identical:
is_html_xml_identical <- identical(bookstable, books_df)
is_html_json_identical <- identical(books_df, books_json)
is_xml_json_identical <- identical(books_df, books_json)
is_html_json_identical
## [1] FALSE
is_html_xml_identical
## [1] FALSE
is_xml_json_identical
## [1] FALSE
Even though the tables look similar, these are not identical
as there are formatting differences between HTML, XML and JSON. JSON had
some instances where it was stored as lists and I had to collapse them
into strings to make the tables look similar. One way to make the data
frames identical would be to use the trimws()
function and
the as.character()
function to convert all columns to
character type to possibly make them identical, and using the
all.equal()
funcition may make the data frames
identical.