Several libraries needed to be installed before importing the HTML, XML, and JSON files. It also required a manual CRAN setup. Once installed, we were able to load the packages.
We load each file in a similar way, but they differ in how the data is structured. The HTML file is read in multiple lines and includes tags, such as “head” and “body”, similar to how content is organized on a webpage. The XML file reads more like structured data in a single line, similar to a CSV file, though it uses tags to define elements. Lastly, the JSON file is loaded as a list, reflecting its hierarchical key-value format.
books.html <- read_html("https://raw.githubusercontent.com/crystaliquezada/week7_data607/refs/heads/main/Books.html") #read html file
books.html #view
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8 ...
## [2] <body><div class="ritz grid-container" dir="ltr"><table class="waffle" ce ...
books.xml <- "https://raw.githubusercontent.com/crystaliquezada/week7_data607/refs/heads/main/Books.xml"
readbooksxml <- read_xml(books.xml) #read xml file
readbooksxml #view
## {xml_document}
## <data>
## [1] <Sheet1>\n <Harry_Potter_and_the_Philosopher_s_Stone Author="J.K Rowling ...
json.url <- "https://raw.githubusercontent.com/crystaliquezada/week7_data607/refs/heads/main/Books.json" #read json file
json.url #view
## [1] "https://raw.githubusercontent.com/crystaliquezada/week7_data607/refs/heads/main/Books.json"
Transforming each file into a data frame is another challenge. Due to the unique structure, each data frame needs a different approach.
books_df <- books.html %>%
html_table(fill = TRUE) %>%
.[[1]]
print(books_df)
## # A tibble: 4 × 5
## `` A B C D
## <int> <chr> <chr> <chr> <chr>
## 1 1 Books Author Chapters Pages
## 2 2 Harry Potter and the Philosopher's Stone J.K Rowling 17 309
## 3 3 Sula Toni Morrison 11 192
## 4 4 The Stranger Albert Camus 11 123
xml_df <- xml_find_all(readbooksxml, ".//Sheet1/*")
xml_df_nested <- data.frame(
Book = sapply(xml_df, function(x) xml_name(x)),
Author = xml_attr(xml_df, "Author"),
Chapters = as.numeric(xml_attr(xml_df, "Chapters")),
Pages = as.numeric(xml_attr(xml_df, "Pages")),
stringsAsFactors = FALSE
)
print(xml_df_nested)
## Book Author Chapters Pages
## 1 Harry_Potter_and_the_Philosopher_s_Stone J.K Rowling 17 309
## 2 Sula Toni Morrison 11 192
## 3 The_Stranger Albert Camus 11 123
books_json <- fromJSON(json.url)
books_df <- as.data.frame(do.call(rbind, lapply(books_json$Sheet1, as.data.frame)))
colnames(books_df) <- c("Author", "Chapters", "Pages")
print(books_df)
## Author Chapters Pages
## Harry Potter and the Philosopher's Stone J.K Rowling 17 309
## Sula Toni Morrison 11 192
## The Stranger Albert Camus 11 123