Introduction

This week’s assignment is to load 3 different types of files into R. The file types are HTML, XML, and JSON. The data is the same for all three types. The data includes three books with multiple authors and their attributes. First data type to import is HTML file.

library(rvest)

#Load html file
book_html <- read_html("https://raw.githubusercontent.com/mirajpatel289/Data607/refs/heads/main/books.html")

#Put data into table
book_html_table <- book_html %>%
  html_node("table") %>%
  html_table()

print(book_html_table)
## # A tibble: 3 × 5
##   Title                                       Authors Genre Pages `Publish Year`
##   <chr>                                       <chr>   <chr> <int>          <int>
## 1 Good Omens: The Nice and Accurate Propheci… Terry … Fant…   288           1990
## 2 Enemy of my Enemy                           Meliss… Thri…   294           2016
## 3 Black House                                 Peter … Horr…   672           2001

Second data type to import is XML file.

library(xml2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the XML file
xml_data <- read_xml("https://raw.githubusercontent.com/mirajpatel289/Data607/refs/heads/main/books.xml")

# Extract book details
titles <- xml_text(xml_find_all(xml_data, "//title"))
authors <- xml_find_all(xml_data, "//authors/author") %>% xml_text()
genres <- xml_text(xml_find_all(xml_data, "//genre"))
pages <- xml_text(xml_find_all(xml_data, "//pages"))
publish_years <- xml_text(xml_find_all(xml_data, "//publish_year"))

# Create a data frame
book_data <- data.frame(
    Title = titles,
    Authors = authors,
    Genre = genres,
    Pages = as.integer(pages),
    Publish_Year = as.integer(publish_years)
)

print(book_data)
##                                                          Title          Authors
## 1 Good Omens: The Nice and Accurate Prophecies of Agnes Nutter  Terry Pratchett
## 2                                            Enemy of my Enemy      Neil Gaiman
## 3                                                  Black House Melissa Mayberry
## 4 Good Omens: The Nice and Accurate Prophecies of Agnes Nutter     Travis Casey
## 5                                            Enemy of my Enemy     Peter Straub
## 6                                                  Black House     Stephen King
##      Genre Pages Publish_Year
## 1  Fantasy   288         1990
## 2 Thriller   294         2016
## 3   Horror   672         2001
## 4  Fantasy   288         1990
## 5 Thriller   294         2016
## 6   Horror   672         2001

The last type of data to import is JSON file.

library(jsonlite)

# Load the JSON file
json_data <- fromJSON("https://raw.githubusercontent.com/mirajpatel289/Data607/refs/heads/main/books.json")

# Extract the 'library' list as a data frame
book_json <- as.data.frame(json_data$library)

# Display the results
print(book_json)
##                                                          title
## 1 Good Omens: The Nice and Accurate Prophecies of Agnes Nutter
## 2                                            Enemy of my Enemy
## 3                                                  Black House
##                          authors    genre pages publish_year
## 1   Terry Pratchett, Neil Gaiman  Fantasy   288         1990
## 2 Melissa Mayberry, Travis Casey Thriller   294         2016
## 3     Peter Straub, Stephen King   Horror   672         2001

After importing these three file types with the same data, the tables look almost the same. For all three files, the data was imported correctly and all the data was clearly presented in the tables. The biggest difference is the XML table where the books are repeated for each author. The JSON and HTML are similar, but HTML looks more clean.