Main differences among the dataframes are as follows:
Data Types
One data frame store numbers as characters (xml), while another stores them as integers (html and json).
books_html is a tibblet
books_html is a tibblet (from rvest)
# Load necessary libraries
library(XML) # For XML processing
library(jsonlite) # For JSON processing
library(rvest) # For HTML processing
# Load the HTML file
html_data <- read_html("C:/Users/ninan/iCloudDrive/Study/CUNY/607 Data Acquisition and Management/w7/books.html") %>%
html_table(fill = TRUE)
books_html <- html_data[[1]]
print(books_html)
## # A tibble: 3 × 4
## Title Authors Year Genre
## <chr> <chr> <int> <chr>
## 1 Sapiens: A Brief History of Humankind Yuval No… 2011 Non-…
## 2 Homo Deus: A Brief History of Tomorrow Yuval No… 2015 Non-…
## 3 Unstoppable Us, Volume 1: How Humans Took Over the World Yuval No… 2022 Chil…
# Load the XML file
xml_data <- xmlParse("C:/Users/ninan/iCloudDrive/Study/CUNY/607 Data Acquisition and Management/w7/books.xml")
xml_root <- xmlRoot(xml_data)
books_xml <- xmlToDataFrame(nodes = getNodeSet(xml_root, "//book"))
print(books_xml)
## title
## 1 Sapiens: A Brief History of Humankind
## 2 Homo Deus: A Brief History of Tomorrow
## 3 Unstoppable Us, Volume 1: How Humans Took Over the World
## authors year genre
## 1 Yuval Noah Harari 2011 Non-fiction
## 2 Yuval Noah Harari 2015 Non-fiction
## 3 Yuval Noah Harari, Ricard Zaplana Ruiz 2022 Children's Non-fiction
# Load the JSON file
books_json <- fromJSON("C:/Users/ninan/iCloudDrive/Study/CUNY/607 Data Acquisition and Management/w7/books.json")
print(books_json)
## title
## 1 Sapiens: A Brief History of Humankind
## 2 Homo Deus: A Brief History of Tomorrow
## 3 Unstoppable Us, Volume 1: How Humans Took Over the World
## authors year genre
## 1 Yuval Noah Harari 2011 Non-fiction
## 2 Yuval Noah Harari 2015 Non-fiction
## 3 Yuval Noah Harari, Ricard Zaplana Ruiz 2022 Children's Non-fiction
# Check if they are identical
identical(books_html, books_xml)
## [1] FALSE
identical(books_html, books_json)
## [1] FALSE
identical(books_xml, books_json)
## [1] FALSE
# Summary of each dataframe
str(books_html)
## tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
## $ Title : chr [1:3] "Sapiens: A Brief History of Humankind" "Homo Deus: A Brief History of Tomorrow" "Unstoppable Us, Volume 1: How Humans Took Over the World"
## $ Authors: chr [1:3] "Yuval Noah Harari" "Yuval Noah Harari" "Yuval Noah Harari, Ricard Zaplana Ruiz"
## $ Year : int [1:3] 2011 2015 2022
## $ Genre : chr [1:3] "Non-fiction" "Non-fiction" "Children's Non-fiction"
str(books_xml)
## 'data.frame': 3 obs. of 4 variables:
## $ title : chr "Sapiens: A Brief History of Humankind" "Homo Deus: A Brief History of Tomorrow" "Unstoppable Us, Volume 1: How Humans Took Over the World"
## $ authors: chr "Yuval Noah Harari" "Yuval Noah Harari" "Yuval Noah Harari, Ricard Zaplana Ruiz"
## $ year : chr "2011" "2015" "2022"
## $ genre : chr "Non-fiction" "Non-fiction" "Children's Non-fiction"
str(books_json)
## 'data.frame': 3 obs. of 4 variables:
## $ title : chr "Sapiens: A Brief History of Humankind" "Homo Deus: A Brief History of Tomorrow" "Unstoppable Us, Volume 1: How Humans Took Over the World"
## $ authors:List of 3
## ..$ : chr "Yuval Noah Harari"
## ..$ : chr "Yuval Noah Harari"
## ..$ : chr "Yuval Noah Harari" "Ricard Zaplana Ruiz"
## $ year : int 2011 2015 2022
## $ genre : chr "Non-fiction" "Non-fiction" "Children's Non-fiction"
One data frame store numbers as characters (xml), while another stores them as integers (html and json).
books_html is a tibblet (from rvest)