Assignment: Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Approach: I picked the following books. The last book has two authors
Title Authors Publisher Year format type
The Night Circus Erin Morgenstern New York : Doubleday 2011 Book Fiction
Beach Read Emily Henry New York : Jove 2020 Book Fiction
It’s a Whole Spiel Katherine Locke;Laura Silverman Knopf Books for Young Readers 2019 Kindle Nonfiction
I stored the book’s information above by creating a HTML file from scratch and typing these values. The html file created was then loaded in my github repo: https://raw.githubusercontent.com/datanerddhanya/DATA607/main/books_htmlformat.html
#load the library for reading html file and creating tables
library(rvest,warn.conflicts = FALSE)
## Warning: package 'rvest' was built under R version 4.3.3
library(tibble)
# load the html data into a vector
book_html <- read_html("https://raw.githubusercontent.com/datanerddhanya/DATA607/main/books_htmlformat.html")
# read the books in the list section
books_html_list <- book_html |> html_elements("li")
books_html_list
## {xml_nodeset (3)}
## [1] <li>A book titled <b><span class="title">The Night Circus</span></b> has ...
## [2] <li>A book titled <b><span class="title">Beach Read</span></b> has author ...
## [3] <li>A book titled <b><span class="title">It’s a Whole Spiel</span></b> ha ...
#create a tibble with the variables: title, authors,publisher, year, format, type
mytibble <- tibble(
title = books_html_list |>
html_element(".title") |>
html_text2(),
authors = books_html_list |>
html_element(".authors") |>
html_text2(),
publisher = books_html_list |>
html_element(".publisher") |>
html_text2(),
year = books_html_list |>
html_element(".year") |>
html_text2(),
format = books_html_list |>
html_element(".format") |>
html_text2(),
type = books_html_list |>
html_element(".type") |>
html_text2(),
)
#create a dataframe
books_final_from_html <- data.frame(mytibble)
head(books_final_from_html )
## title authors
## 1 The Night Circus Erin Morgenstern
## 2 Beach Read Emily Henry
## 3 It’s a Whole Spiel Katherine Locke;Laura Silverman
## publisher year format type
## 1 New York : Doubleday 2011 Book Fiction
## 2 New York : Jove 2020 Book Fiction
## 3 Knopf Books for Young Readers 2019 Kindle Nonfiction
Conclusion: I created a HTML file from scratch through R studio, uploaded to github, access the file in R markdown file and read the data to a dataframe.
I stored the book’s information above by creating a XML file and typing these values. The Xml file which i then loaded in my github repo:
https://raw.githubusercontent.com/datanerddhanya/DATA607/main/book_xmlformat.xml
#load the library for reading XML file and creating tables
library(XML)
library(xml2)
library(tibble)
# load the xml data into a vector
book_data <- read_xml("https://raw.githubusercontent.com/datanerddhanya/DATA607/main/book_xmlformat.xml")
#parse to XML
book_xml <- xmlParse(book_data, encoding = "UTF-8")
book_xml
## <?xml version="1.0" encoding="UTF-8"?>
## <MyFavoritebooks>
## <Books>
## <title>The Night Circus</title>
## <authors>Erin Morgenstern</authors>
## <publisher>New York : Doubleday</publisher>
## <year>2011</year>
## <format>Book</format>
## <type>Fiction</type>
## </Books>
## <Books>
## <title>Beach Read</title>
## <authors>Emily Henry</authors>
## <publisher>New York : Jove</publisher>
## <year>2020</year>
## <format>Book</format>
## <type>Fiction</type>
## </Books>
## <Books>
## <title>It’s a Whole Spiel</title>
## <authors>Katherine Locke;Laura Silverman</authors>
## <publisher>Knopf Books for Young Readers</publisher>
## <year>2019</year>
## <format>Kindle</format>
## <type>Nonfiction</type>
## </Books>
## </MyFavoritebooks>
##
# read the books variables into vector
title <- xml_text(xml_find_all(book_data, ".//title"))
authors <- xml_text(xml_find_all(book_data, ".//authors"))
publisher <- xml_text(xml_find_all(book_data, ".//publisher"))
year <- xml_text(xml_find_all(book_data, ".//year"))
format <- xml_text(xml_find_all(book_data, ".//format"))
type <- xml_text(xml_find_all(book_data, ".//type"))
# Format as a tibble
books <- tibble(title, authors,publisher,year, format, type)
books
## # A tibble: 3 × 6
## title authors publisher year format type
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 The Night Circus Erin Morgenstern New York… 2011 Book Fict…
## 2 Beach Read Emily Henry New York… 2020 Book Fict…
## 3 It’s a Whole Spiel Katherine Locke;Laura Silverm… Knopf Bo… 2019 Kindle Nonf…
#load to a dataframe
books_final_from_xml <- xmlToDataFrame(nodes = getNodeSet(book_xml, "//Books"))
head(books_final_from_xml)
## title authors
## 1 The Night Circus Erin Morgenstern
## 2 Beach Read Emily Henry
## 3 It’s a Whole Spiel Katherine Locke;Laura Silverman
## publisher year format type
## 1 New York : Doubleday 2011 Book Fiction
## 2 New York : Jove 2020 Book Fiction
## 3 Knopf Books for Young Readers 2019 Kindle Nonfiction
Conclusion: I created a XML file by hand through R studio, uploaded to github, access the file in R markdown file and read the data to a dataframe.
I stored the book’s information above by creating a JSON file and typing these values. The JSON file which i then loaded in my github repo:
https://raw.githubusercontent.com/datanerddhanya/DATA607/main/book_jsonformat.json
library(rjson)
library(tidyjson)
## Warning: package 'tidyjson' was built under R version 4.3.3
##
## Attaching package: 'tidyjson'
## The following object is masked from 'package:stats':
##
## filter
#read the json file.
# as the file is in a nested format,we need to use rbind to append the rows
result <- fromJSON(file="https://raw.githubusercontent.com/datanerddhanya/DATA607/main/book_jsonformat.json")
books_final_from_json <- as.data.frame(do.call(rbind,result$MyFavoritebooks$Books))
# as the variables are of list type, i need to convert it to character
books_final_from_json$title = as.character(books_final_from_json$title)
books_final_from_json$authors = as.character(books_final_from_json$authors)
books_final_from_json$publisher = as.character(books_final_from_json$publisher)
books_final_from_json$year = as.character(books_final_from_json$year)
books_final_from_json$format = as.character(books_final_from_json$format)
books_final_from_json$type = as.character(books_final_from_json$type)
head(books_final_from_json)
## title authors
## 1 The Night Circus Erin Morgenstern
## 2 Beach Read Emily Henry
## 3 It’s a Whole Spiel Katherine Locke;Laura Silverman
## publisher year format type
## 1 New York : Doubleday 2011 Book Fiction
## 2 New York : Jove 2020 Book Fiction
## 3 Knopf Books for Young Readers 2019 Kindle Nonfiction
Conclusion: I created a JSON file by hand through Notepad, uploaded to github, access the file in R markdown file and read the data to a data frame.i converted the variable types from list to character.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# to check if the three dataframes are identical
# one option is identical() function
is_equal_1 <- identical(books_final_from_xml,books_final_from_html)
is_equal_2 <- identical(books_final_from_xml,books_final_from_json)
is_equal_3 <- identical(books_final_from_json,books_final_from_html)
print(is_equal_1)
## [1] TRUE
print(is_equal_2)
## [1] TRUE
print(is_equal_3)
## [1] TRUE
# this may not efficient for large datasets, hence scalable method is all_equal() in dplyr package.
#It only considers the structure and values.
all_equal_1 <- all.equal(books_final_from_xml,books_final_from_html)
all_equal_2 <- all.equal(books_final_from_xml,books_final_from_json)
all_equal_3 <- all.equal(books_final_from_json,books_final_from_html)
print(all_equal_1)
## [1] TRUE
print(all_equal_2)
## [1] TRUE
print(all_equal_3)
## [1] TRUE
Final conclusion: Yes, All the three dataframes could be generated in same format.i comapred that theya re identical using identical() and are-equal() functions.