Assignment – Working with HTML, XML and JSON in R

Assignment: Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Approach: I picked the following books. The last book has two authors

Title Authors Publisher Year format type

The Night Circus Erin Morgenstern New York : Doubleday 2011 Book Fiction

Beach Read Emily Henry New York : Jove 2020 Book Fiction

It’s a Whole Spiel Katherine Locke;Laura Silverman Knopf Books for Young Readers 2019 Kindle Nonfiction

Generating HTML

I stored the book’s information above by creating a HTML file from scratch and typing these values. The html file created was then loaded in my github repo: https://raw.githubusercontent.com/datanerddhanya/DATA607/main/books_htmlformat.html

#load the library for reading html file and creating tables
library(rvest,warn.conflicts = FALSE)
## Warning: package 'rvest' was built under R version 4.3.3
library(tibble)

# load the html data into a vector
book_html <- read_html("https://raw.githubusercontent.com/datanerddhanya/DATA607/main/books_htmlformat.html")

# read the books in the list section
books_html_list <- book_html |>  html_elements("li")
books_html_list
## {xml_nodeset (3)}
## [1] <li>A book titled <b><span class="title">The Night Circus</span></b> has  ...
## [2] <li>A book titled <b><span class="title">Beach Read</span></b> has author ...
## [3] <li>A book titled <b><span class="title">It’s a Whole Spiel</span></b> ha ...
#create a tibble with the variables: title, authors,publisher, year, format, type  
mytibble <- tibble(
  title = books_html_list |> 
    html_element(".title") |> 
    html_text2(),
  authors = books_html_list |> 
    html_element(".authors") |> 
    html_text2(),
  publisher = books_html_list |> 
    html_element(".publisher") |> 
    html_text2(),
  year = books_html_list |> 
    html_element(".year") |> 
    html_text2(),
  format = books_html_list |> 
    html_element(".format") |> 
    html_text2(),
  type = books_html_list |> 
    html_element(".type") |> 
    html_text2(),
 )

#create a dataframe
books_final_from_html <- data.frame(mytibble)
head(books_final_from_html )
##                title                         authors
## 1   The Night Circus                Erin Morgenstern
## 2         Beach Read                     Emily Henry
## 3 It’s a Whole Spiel Katherine Locke;Laura Silverman
##                       publisher year format       type
## 1          New York : Doubleday 2011   Book    Fiction
## 2               New York : Jove 2020   Book    Fiction
## 3 Knopf Books for Young Readers 2019 Kindle Nonfiction

Conclusion: I created a HTML file from scratch through R studio, uploaded to github, access the file in R markdown file and read the data to a dataframe.

Generating XML

I stored the book’s information above by creating a XML file and typing these values. The Xml file which i then loaded in my github repo:

https://raw.githubusercontent.com/datanerddhanya/DATA607/main/book_xmlformat.xml

#load the library for reading XML file and creating tables
library(XML)
library(xml2)
library(tibble)

# load the xml data into a vector
book_data <- read_xml("https://raw.githubusercontent.com/datanerddhanya/DATA607/main/book_xmlformat.xml")

#parse to XML
book_xml <- xmlParse(book_data, encoding = "UTF-8")
book_xml
## <?xml version="1.0" encoding="UTF-8"?>
## <MyFavoritebooks>
##   <Books>
##     <title>The Night Circus</title>
##     <authors>Erin Morgenstern</authors>
##     <publisher>New York : Doubleday</publisher>
##     <year>2011</year>
##     <format>Book</format>
##     <type>Fiction</type>
##   </Books>
##   <Books>
##     <title>Beach Read</title>
##     <authors>Emily Henry</authors>
##     <publisher>New York : Jove</publisher>
##     <year>2020</year>
##     <format>Book</format>
##     <type>Fiction</type>
##   </Books>
##   <Books>
##     <title>It’s a Whole Spiel</title>
##     <authors>Katherine Locke;Laura Silverman</authors>
##     <publisher>Knopf Books for Young Readers</publisher>
##     <year>2019</year>
##     <format>Kindle</format>
##     <type>Nonfiction</type>
##   </Books>
## </MyFavoritebooks>
## 
# read the books variables into vector
title <- xml_text(xml_find_all(book_data, ".//title"))
authors <- xml_text(xml_find_all(book_data, ".//authors"))
publisher <- xml_text(xml_find_all(book_data, ".//publisher"))
year <- xml_text(xml_find_all(book_data, ".//year"))
format <- xml_text(xml_find_all(book_data, ".//format"))
type <- xml_text(xml_find_all(book_data, ".//type"))


# Format as a tibble
books <- tibble(title, authors,publisher,year, format, type)
books
## # A tibble: 3 × 6
##   title              authors                        publisher year  format type 
##   <chr>              <chr>                          <chr>     <chr> <chr>  <chr>
## 1 The Night Circus   Erin Morgenstern               New York… 2011  Book   Fict…
## 2 Beach Read         Emily Henry                    New York… 2020  Book   Fict…
## 3 It’s a Whole Spiel Katherine Locke;Laura Silverm… Knopf Bo… 2019  Kindle Nonf…
#load to a dataframe
books_final_from_xml <- xmlToDataFrame(nodes = getNodeSet(book_xml, "//Books"))

head(books_final_from_xml)
##                title                         authors
## 1   The Night Circus                Erin Morgenstern
## 2         Beach Read                     Emily Henry
## 3 It’s a Whole Spiel Katherine Locke;Laura Silverman
##                       publisher year format       type
## 1          New York : Doubleday 2011   Book    Fiction
## 2               New York : Jove 2020   Book    Fiction
## 3 Knopf Books for Young Readers 2019 Kindle Nonfiction

Conclusion: I created a XML file by hand through R studio, uploaded to github, access the file in R markdown file and read the data to a dataframe.

Generating JSON

I stored the book’s information above by creating a JSON file and typing these values. The JSON file which i then loaded in my github repo:

https://raw.githubusercontent.com/datanerddhanya/DATA607/main/book_jsonformat.json

library(rjson)
library(tidyjson)
## Warning: package 'tidyjson' was built under R version 4.3.3
## 
## Attaching package: 'tidyjson'
## The following object is masked from 'package:stats':
## 
##     filter
#read the json file.
# as the file is in a nested format,we need to use rbind to append the rows
result <- fromJSON(file="https://raw.githubusercontent.com/datanerddhanya/DATA607/main/book_jsonformat.json")
books_final_from_json <- as.data.frame(do.call(rbind,result$MyFavoritebooks$Books))

# as the variables are of list type, i need to convert it to character
books_final_from_json$title = as.character(books_final_from_json$title) 
books_final_from_json$authors = as.character(books_final_from_json$authors) 
books_final_from_json$publisher = as.character(books_final_from_json$publisher) 
books_final_from_json$year = as.character(books_final_from_json$year) 
books_final_from_json$format = as.character(books_final_from_json$format)
books_final_from_json$type = as.character(books_final_from_json$type) 
head(books_final_from_json)
##                title                         authors
## 1   The Night Circus                Erin Morgenstern
## 2         Beach Read                     Emily Henry
## 3 It’s a Whole Spiel Katherine Locke;Laura Silverman
##                       publisher year format       type
## 1          New York : Doubleday 2011   Book    Fiction
## 2               New York : Jove 2020   Book    Fiction
## 3 Knopf Books for Young Readers 2019 Kindle Nonfiction

Conclusion: I created a JSON file by hand through Notepad, uploaded to github, access the file in R markdown file and read the data to a data frame.i converted the variable types from list to character.

To check if the three data frames are identical

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# to check if the three dataframes are identical
# one option is identical() function 
is_equal_1 <- identical(books_final_from_xml,books_final_from_html)
is_equal_2 <- identical(books_final_from_xml,books_final_from_json)
is_equal_3 <- identical(books_final_from_json,books_final_from_html)
print(is_equal_1)
## [1] TRUE
print(is_equal_2)
## [1] TRUE
print(is_equal_3)
## [1] TRUE
# this may not efficient for large datasets, hence scalable method is all_equal() in dplyr package.
#It only considers the structure and values.

all_equal_1 <- all.equal(books_final_from_xml,books_final_from_html)
all_equal_2 <- all.equal(books_final_from_xml,books_final_from_json)
all_equal_3 <- all.equal(books_final_from_json,books_final_from_html)
print(all_equal_1)
## [1] TRUE
print(all_equal_2)
## [1] TRUE
print(all_equal_3)
## [1] TRUE

Final conclusion: Yes, All the three dataframes could be generated in same format.i comapred that theya re identical using identical() and are-equal() functions.