This assignment helps us practice working with HTML, XML, and JSON
file formats.
We manually created three files with book information and loaded them
into R.
We compare the results and explain the differences.
We chose three books from different subjects:
Each book includes title, authors, and two extra attributes.
We created three files by hand:
books.html
: HTML tablebooks.xml
: XML structurebooks.json
: JSON structureEach file stores the same book data.
We use R packages to read each file: rvest for HTML, xml2 for XML and jsonlite for JSON
Each file is loaded into a separate data frame.
#Loading the packages
library(rvest)
library(xml2)
library(jsonlite)
html_url <- "https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/books_arutam.html"
html_data <- read_html(html_url)
books_html <- html_data %>% html_table(fill = TRUE) %>% .[[1]]
print(books_html)
## # A tibble: 3 × 4
## Title Authors Year Extra
## <chr> <chr> <int> <chr>
## 1 OpenIntro Statistics "David M. Diez, Christopher… 2019 Edit…
## 2 R for Data Science "Hadley Wickham, Garrett Gr… 2017 Publ…
## 3 Sapiens: A Brief History of Humankind "Yuval Noah Harari" 2015 Genr…
xml_url <- "https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/books_arutam.xml"
xml_data <- read_xml(xml_url)
book_nodes <- xml_find_all(xml_data, "//book")
books_xml <- data.frame(
title = xml_text(xml_find_all(book_nodes, "title")), authors = sapply(book_nodes, function(book) {paste(xml_text(xml_find_all(book, "author")), collapse = ", ")}),
year = c(2019, 2017, 2015),
extra = c("Edition: 4th", "Publisher: O'Reilly", "Genre: Non-fiction"))
print(books_xml)
## title authors year extra
## 1 OpenIntro Statistics 2019 Edition: 4th
## 2 R for Data Science 2017 Publisher: O'Reilly
## 3 Sapiens: A Brief History of Humankind 2015 Genre: Non-fiction
json_url <- "https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/books_arutam.json"
books_json <- fromJSON(json_url)
books_json_df <- data.frame(
title = books_json$title,
authors = sapply(books_json$authors, function(x) paste(x, collapse = ", ")),
year = books_json$year,
extra = books_json$extra)
print(books_json_df)
## title
## 1 OpenIntro Statistics
## 2 R for Data Science
## 3 Sapiens: A Brief History of Humankind
## authors year
## 1 David M. Diez, Christopher D. Barr, Mine Çetinkaya-Rundel 2019
## 2 Hadley Wickham, Garrett Grolemund 2017
## 3 Yuval Noah Harari 2015
## extra
## 1 Edition: 4th
## 2 Publisher: O'Reilly
## 3 Genre: Non-fiction
We compare the three data frames to check if they are
identical.
We use identical()
and all.equal()
functions.
We also print the data frames to see the structure.
identical(books_html, books_xml)
## [1] FALSE
identical(books_html, books_json_df)
## [1] FALSE
identical(books_xml, books_json_df)
## [1] FALSE
all.equal(books_html, books_xml)
## [1] "Names: 4 string mismatches"
## [2] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [3] "Attributes: < Component \"class\": 1 string mismatch >"
## [4] "Component 2: 3 string mismatches"
all.equal(books_html, books_json_df)
## [1] "Names: 4 string mismatches"
## [2] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [3] "Attributes: < Component \"class\": 1 string mismatch >"
## [4] "Component 2: 1 string mismatch"
all.equal(books_xml, books_json_df)
## [1] "Component \"authors\": 3 string mismatches"
str(books_html)
## tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
## $ Title : chr [1:3] "OpenIntro Statistics" "R for Data Science" "Sapiens: A Brief History of Humankind"
## $ Authors: chr [1:3] "David M. Diez, Christopher D. Barr, Mine Ã\u0087etinkaya-Rundel" "Hadley Wickham, Garrett Grolemund" "Yuval Noah Harari"
## $ Year : int [1:3] 2019 2017 2015
## $ Extra : chr [1:3] "Edition: 4th" "Publisher: O'Reilly" "Genre: Non-fiction"
str(books_xml)
## 'data.frame': 3 obs. of 4 variables:
## $ title : chr "OpenIntro Statistics" "R for Data Science" "Sapiens: A Brief History of Humankind"
## $ authors: chr "" "" ""
## $ year : num 2019 2017 2015
## $ extra : chr "Edition: 4th" "Publisher: O'Reilly" "Genre: Non-fiction"
str(books_json_df)
## 'data.frame': 3 obs. of 4 variables:
## $ title : chr "OpenIntro Statistics" "R for Data Science" "Sapiens: A Brief History of Humankind"
## $ authors: chr "David M. Diez, Christopher D. Barr, Mine Çetinkaya-Rundel" "Hadley Wickham, Garrett Grolemund" "Yuval Noah Harari"
## $ year : int 2019 2017 2015
## $ extra : chr "Edition: 4th" "Publisher: O'Reilly" "Genre: Non-fiction"
The three data frames are not identical, but they are very similar.
All of them have the same columns: title, authors, year, and
extra.
The differences are mostly in:
These do not affect the meaning of the data.
This assignment helped us understand how HTML, XML, and JSON store
data.
We manually created files with book information and loaded them into
R.
Each format required different tools and steps to extract the data.
We learned that:
Even though the formats are different, the final data frames are very similar.