Assignment

# Load necessary libraries
library(XML)        # For XML processing
library(jsonlite)   # For JSON processing
library(rvest)      # For HTML processing

# Load the HTML file
html_data <- read_html("C:/Users/ninan/iCloudDrive/Study/CUNY/607 Data Acquisition and Management/w7/books.html") %>%
  html_table(fill = TRUE)
books_html <- html_data[[1]]
print(books_html)

## # A tibble: 3 × 4
##   Title                                                    Authors    Year Genre
##   <chr>                                                    <chr>     <int> <chr>
## 1 Sapiens: A Brief History of Humankind                    Yuval No…  2011 Non-…
## 2 Homo Deus: A Brief History of Tomorrow                   Yuval No…  2015 Non-…
## 3 Unstoppable Us, Volume 1: How Humans Took Over the World Yuval No…  2022 Chil…

# Load the XML file
xml_data <- xmlParse("C:/Users/ninan/iCloudDrive/Study/CUNY/607 Data Acquisition and Management/w7/books.xml")
xml_root <- xmlRoot(xml_data)
books_xml <- xmlToDataFrame(nodes = getNodeSet(xml_root, "//book"))
print(books_xml)

##                                                      title
## 1                    Sapiens: A Brief History of Humankind
## 2                   Homo Deus: A Brief History of Tomorrow
## 3 Unstoppable Us, Volume 1: How Humans Took Over the World
##                                  authors year                  genre
## 1                      Yuval Noah Harari 2011            Non-fiction
## 2                      Yuval Noah Harari 2015            Non-fiction
## 3 Yuval Noah Harari, Ricard Zaplana Ruiz 2022 Children's Non-fiction

# Load the JSON file
books_json <- fromJSON("C:/Users/ninan/iCloudDrive/Study/CUNY/607 Data Acquisition and Management/w7/books.json")
print(books_json)

##                                                      title
## 1                    Sapiens: A Brief History of Humankind
## 2                   Homo Deus: A Brief History of Tomorrow
## 3 Unstoppable Us, Volume 1: How Humans Took Over the World
##                                  authors year                  genre
## 1                      Yuval Noah Harari 2011            Non-fiction
## 2                      Yuval Noah Harari 2015            Non-fiction
## 3 Yuval Noah Harari, Ricard Zaplana Ruiz 2022 Children's Non-fiction

# Check if they are identical
identical(books_html, books_xml)

## [1] FALSE

identical(books_html, books_json)

## [1] FALSE

identical(books_xml, books_json)

## [1] FALSE

# Summary of each dataframe
str(books_html)

## tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Title  : chr [1:3] "Sapiens: A Brief History of Humankind" "Homo Deus: A Brief History of Tomorrow" "Unstoppable Us, Volume 1: How Humans Took Over the World"
##  $ Authors: chr [1:3] "Yuval Noah Harari" "Yuval Noah Harari" "Yuval Noah Harari, Ricard Zaplana Ruiz"
##  $ Year   : int [1:3] 2011 2015 2022
##  $ Genre  : chr [1:3] "Non-fiction" "Non-fiction" "Children's Non-fiction"

str(books_xml)

## 'data.frame':    3 obs. of  4 variables:
##  $ title  : chr  "Sapiens: A Brief History of Humankind" "Homo Deus: A Brief History of Tomorrow" "Unstoppable Us, Volume 1: How Humans Took Over the World"
##  $ authors: chr  "Yuval Noah Harari" "Yuval Noah Harari" "Yuval Noah Harari, Ricard Zaplana Ruiz"
##  $ year   : chr  "2011" "2015" "2022"
##  $ genre  : chr  "Non-fiction" "Non-fiction" "Children's Non-fiction"

str(books_json)

## 'data.frame':    3 obs. of  4 variables:
##  $ title  : chr  "Sapiens: A Brief History of Humankind" "Homo Deus: A Brief History of Tomorrow" "Unstoppable Us, Volume 1: How Humans Took Over the World"
##  $ authors:List of 3
##   ..$ : chr "Yuval Noah Harari"
##   ..$ : chr "Yuval Noah Harari"
##   ..$ : chr  "Yuval Noah Harari" "Ricard Zaplana Ruiz"
##  $ year   : int  2011 2015 2022
##  $ genre  : chr  "Non-fiction" "Non-fiction" "Children's Non-fiction"

Main differences among the dataframes are as follows:

Data Types

One data frame store numbers as characters (xml), while another stores them as integers (html and json).

books_html is a tibblet

books_html is a tibblet (from rvest)

Assignment_7

Nwe Oo Mon (Nina)

2024-10-13

Main differences among the dataframes are as follows:

Data Types

books_html is a tibblet