This week’s assignment is to load 3 different types of files into R. The file types are HTML, XML, and JSON. The data is the same for all three types. The data includes three books with multiple authors and their attributes. First data type to import is HTML file.
library(rvest)
#Load html file
book_html <- read_html("https://raw.githubusercontent.com/mirajpatel289/Data607/refs/heads/main/books.html")
#Put data into table
book_html_table <- book_html %>%
html_node("table") %>%
html_table()
print(book_html_table)
## # A tibble: 3 × 5
## Title Authors Genre Pages `Publish Year`
## <chr> <chr> <chr> <int> <int>
## 1 Good Omens: The Nice and Accurate Propheci… Terry … Fant… 288 1990
## 2 Enemy of my Enemy Meliss… Thri… 294 2016
## 3 Black House Peter … Horr… 672 2001
Second data type to import is XML file.
library(xml2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the XML file
xml_data <- read_xml("https://raw.githubusercontent.com/mirajpatel289/Data607/refs/heads/main/books.xml")
# Extract book details
titles <- xml_text(xml_find_all(xml_data, "//title"))
authors <- xml_find_all(xml_data, "//authors/author") %>% xml_text()
genres <- xml_text(xml_find_all(xml_data, "//genre"))
pages <- xml_text(xml_find_all(xml_data, "//pages"))
publish_years <- xml_text(xml_find_all(xml_data, "//publish_year"))
# Create a data frame
book_data <- data.frame(
Title = titles,
Authors = authors,
Genre = genres,
Pages = as.integer(pages),
Publish_Year = as.integer(publish_years)
)
print(book_data)
## Title Authors
## 1 Good Omens: The Nice and Accurate Prophecies of Agnes Nutter Terry Pratchett
## 2 Enemy of my Enemy Neil Gaiman
## 3 Black House Melissa Mayberry
## 4 Good Omens: The Nice and Accurate Prophecies of Agnes Nutter Travis Casey
## 5 Enemy of my Enemy Peter Straub
## 6 Black House Stephen King
## Genre Pages Publish_Year
## 1 Fantasy 288 1990
## 2 Thriller 294 2016
## 3 Horror 672 2001
## 4 Fantasy 288 1990
## 5 Thriller 294 2016
## 6 Horror 672 2001
The last type of data to import is JSON file.
library(jsonlite)
# Load the JSON file
json_data <- fromJSON("https://raw.githubusercontent.com/mirajpatel289/Data607/refs/heads/main/books.json")
# Extract the 'library' list as a data frame
book_json <- as.data.frame(json_data$library)
# Display the results
print(book_json)
## title
## 1 Good Omens: The Nice and Accurate Prophecies of Agnes Nutter
## 2 Enemy of my Enemy
## 3 Black House
## authors genre pages publish_year
## 1 Terry Pratchett, Neil Gaiman Fantasy 288 1990
## 2 Melissa Mayberry, Travis Casey Thriller 294 2016
## 3 Peter Straub, Stephen King Horror 672 2001
After importing these three file types with the same data, the tables look almost the same. For all three files, the data was imported correctly and all the data was clearly presented in the tables. The biggest difference is the XML table where the books are repeated for each author. The JSON and HTML are similar, but HTML looks more clean.