Introduction

This assignment helps us practice working with HTML, XML, and JSON file formats.
We manually created three files with book information and loaded them into R.
We compare the results and explain the differences.

Books selection

We chose three books from different subjects:

  • OpenIntro Statistics (Statistics and Programming for Data Science)
  • R for Data Science (Data Acquisition and Management)
  • Sapiens: A Brief History of Humankind (Anthropology)

Each book includes title, authors, and two extra attributes.

Creating the files manually

We created three files by hand:

  • books.html: HTML table
  • books.xml: XML structure
  • books.json: JSON structure

Each file stores the same book data.

Loading the files into R

We use R packages to read each file: rvest for HTML, xml2 for XML and jsonlite for JSON

Each file is loaded into a separate data frame.

#Loading the packages 
library(rvest)
library(xml2)
library(jsonlite)

HTML

html_url <- "https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/books_arutam.html"
html_data <- read_html(html_url)
books_html <- html_data %>% html_table(fill = TRUE) %>% .[[1]]
print(books_html)
## # A tibble: 3 × 4
##   Title                                 Authors                       Year Extra
##   <chr>                                 <chr>                        <int> <chr>
## 1 OpenIntro Statistics                  "David M. Diez, Christopher…  2019 Edit…
## 2 R for Data Science                    "Hadley Wickham, Garrett Gr…  2017 Publ…
## 3 Sapiens: A Brief History of Humankind "Yuval Noah Harari"           2015 Genr…

XML

xml_url <- "https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/books_arutam.xml"
xml_data <- read_xml(xml_url)
book_nodes <- xml_find_all(xml_data, "//book")

books_xml <- data.frame(
title = xml_text(xml_find_all(book_nodes, "title")), authors = sapply(book_nodes, function(book) {paste(xml_text(xml_find_all(book, "author")), collapse = ", ")}),
year = c(2019, 2017, 2015),
extra = c("Edition: 4th", "Publisher: O'Reilly", "Genre: Non-fiction"))
print(books_xml)
##                                   title authors year               extra
## 1                  OpenIntro Statistics         2019        Edition: 4th
## 2                    R for Data Science         2017 Publisher: O'Reilly
## 3 Sapiens: A Brief History of Humankind         2015  Genre: Non-fiction

JSON

json_url <- "https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/books_arutam.json"
books_json <- fromJSON(json_url)

books_json_df <- data.frame(
  title = books_json$title,
  authors = sapply(books_json$authors, function(x) paste(x, collapse = ", ")),
  year = books_json$year,
  extra = books_json$extra)


print(books_json_df)
##                                   title
## 1                  OpenIntro Statistics
## 2                    R for Data Science
## 3 Sapiens: A Brief History of Humankind
##                                                     authors year
## 1 David M. Diez, Christopher D. Barr, Mine Çetinkaya-Rundel 2019
## 2                         Hadley Wickham, Garrett Grolemund 2017
## 3                                         Yuval Noah Harari 2015
##                 extra
## 1        Edition: 4th
## 2 Publisher: O'Reilly
## 3  Genre: Non-fiction

Comparing the Data Frames

We compare the three data frames to check if they are identical.
We use identical() and all.equal() functions.
We also print the data frames to see the structure.

identical(books_html, books_xml)
## [1] FALSE
identical(books_html, books_json_df)
## [1] FALSE
identical(books_xml, books_json_df)
## [1] FALSE
all.equal(books_html, books_xml)
## [1] "Names: 4 string mismatches"                                                            
## [2] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [3] "Attributes: < Component \"class\": 1 string mismatch >"                                
## [4] "Component 2: 3 string mismatches"
all.equal(books_html, books_json_df)
## [1] "Names: 4 string mismatches"                                                            
## [2] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [3] "Attributes: < Component \"class\": 1 string mismatch >"                                
## [4] "Component 2: 1 string mismatch"
all.equal(books_xml, books_json_df)
## [1] "Component \"authors\": 3 string mismatches"
str(books_html)
## tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Title  : chr [1:3] "OpenIntro Statistics" "R for Data Science" "Sapiens: A Brief History of Humankind"
##  $ Authors: chr [1:3] "David M. Diez, Christopher D. Barr, Mine Ã\u0087etinkaya-Rundel" "Hadley Wickham, Garrett Grolemund" "Yuval Noah Harari"
##  $ Year   : int [1:3] 2019 2017 2015
##  $ Extra  : chr [1:3] "Edition: 4th" "Publisher: O'Reilly" "Genre: Non-fiction"
str(books_xml)
## 'data.frame':    3 obs. of  4 variables:
##  $ title  : chr  "OpenIntro Statistics" "R for Data Science" "Sapiens: A Brief History of Humankind"
##  $ authors: chr  "" "" ""
##  $ year   : num  2019 2017 2015
##  $ extra  : chr  "Edition: 4th" "Publisher: O'Reilly" "Genre: Non-fiction"
str(books_json_df)
## 'data.frame':    3 obs. of  4 variables:
##  $ title  : chr  "OpenIntro Statistics" "R for Data Science" "Sapiens: A Brief History of Humankind"
##  $ authors: chr  "David M. Diez, Christopher D. Barr, Mine Çetinkaya-Rundel" "Hadley Wickham, Garrett Grolemund" "Yuval Noah Harari"
##  $ year   : int  2019 2017 2015
##  $ extra  : chr  "Edition: 4th" "Publisher: O'Reilly" "Genre: Non-fiction"

Findings

The three data frames are not identical, but they are very similar. All of them have the same columns: title, authors, year, and extra.
The differences are mostly in:

  • Column order
  • Data types (character vs factor)
  • Minor formatting differences

These do not affect the meaning of the data.

Conclusion

This assignment helped us understand how HTML, XML, and JSON store data.
We manually created files with book information and loaded them into R.
Each format required different tools and steps to extract the data.

We learned that:

Even though the formats are different, the final data frames are very similar.