Coffy Andrews-Guo October 10, 2021
Three of my favorite books on ‘Self Development’ were selected this assignment. The assignment involves working with XML and JSON files with these criteria: (1) At least one of the books should have more than one author. (2)For each book, include the title, authors, and two or three other attributes that you find interesting.
The three books and criteria was separately created in three files:_(1) HTML (using an html table), (2) XML, and (3) JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). These files were created in a code editor (Notepad++).
All three file sources were loaded in R as separate data frames. In my observation all three sources when converted to a data frame have a similar structure and parsing (to character class) expression.
library("textreadr") # package to read HTML file
library("XML") # package to read XML file
library("methods") # also load the other required XML package
library("xml2") # alternative package to read XML file##
## Attaching package: 'xml2'
## The following objects are masked from 'package:textreadr':
##
## read_html, read_xml
library("jsonlite") # package to read JSON files
library("rvest") # package to parse html table into a data frame##
## Attaching package: 'rvest'
## The following object is masked from 'package:textreadr':
##
## read_html
library("tidyverse")## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x purrr::flatten() masks jsonlite::flatten()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
## x rvest::read_html() masks xml2::read_html(), textreadr::read_html()
## x xml2::read_xml() masks textreadr::read_xml()
library("dplyr")HTML_file <- read_html("https://raw.githubusercontent.com/candrewxs/D607_Wk7_XMLnJSON/main/D607%20Wk7_HTML.html")
print(HTML_file)## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body><table>\n<tr>\n<th>Book Id</th>\n<th>Title</th>\n<th>Authors</th>\n ...
Convert data type to an R data frame
HTML_df <- HTML_file %>%
html_table() %>%
as.data.frame()
HTML_df## Book.Id Title Authors
## 1 1 everything's an argument Andrea A. Lunsford, John J. Ruszkiewicz
## 2 2 The Richest Man In Babylon George S. Clason
## 3 3 Emotional Intelligence 2.0 Travis Bradberry, Jean Greaves
## Length Review ISBN
## 1 513 Illustrative 9.781458e+12
## 2 90 Insightful 9.781939e+12
## 3 255 Strategic 9.780974e+11
Tidy data
# transform the ISBN entries numerics to character to show entire length
HTML_df$ISBN <- as.character(HTML_df$ISBN)
HTML_df## Book.Id Title Authors
## 1 1 everything's an argument Andrea A. Lunsford, John J. Ruszkiewicz
## 2 2 The Richest Man In Babylon George S. Clason
## 3 3 Emotional Intelligence 2.0 Travis Bradberry, Jean Greaves
## Length Review ISBN
## 1 513 Illustrative 9781457606069
## 2 90 Insightful 9781939438638
## 3 255 Strategic 978097432065
url_xml <- "https://raw.githubusercontent.com/candrewxs/D607_Wk7_XMLnJSON/main/D607%20Wk7_XML.xml"
print(url_xml)## [1] "https://raw.githubusercontent.com/candrewxs/D607_Wk7_XMLnJSON/main/D607%20Wk7_XML.xml"
Convert XML file and tidy data in R
book_xml <- read_xml(url_xml)
book_xml <- xmlParse(book_xml) %>% xmlToDataFrame()
Books.Id <- c(1, 2, 3) # define new column to add
book_xml2 <- cbind(Books.Id, book_xml) # add column called 'Books.Id'
book_xml2## Books.Id Title Authors
## 1 1 everything's an argument Andrea A. Lunsford, John J. Ruszkiewicz
## 2 2 The Richest Man In Babylon George S. Clason
## 3 3 Emotional Intelligence 2.0 Travis Bradberry, Jean Greaves
## Length Review IBSN
## 1 513 Illustrative 9781457606069
## 2 90 Insightful 9781939438638
## 3 255 Strategic 978097432065
JSON_file <- fromJSON("https://raw.githubusercontent.com/candrewxs/D607_Wk7_XMLnJSON/main/D607%20W7_JSON.json")
print(JSON_file)## $Books
## Book Id Title Authors
## 1 1 everything's an argument Andrea A. Lunsford , John J. Ruskiewicz
## 2 2 The Richest Man In Babylon George S. Clason
## 3 3 Emotional Insightful 2.0 Travis Bradberry , Jean Greaves
## Length Review ISBN
## 1 513 Illustrative 9781457606069
## 2 90 Insightful 9781939438638
## 3 255 Strategic 978097432065
Convert JSON file and tidy was unable to remove index from table
JSON_df <- as.data.frame(JSON_file)
colnames(JSON_df) <- c("Books.Id", "Title", "Authors", "Length", "Review", "ISBN")
JSON_df$Authors <- as.character(JSON_df$Authors)
JSON_df$Length <- as.character(JSON_df$Length)
JSON_df$Review <- as.character(JSON_df$Review)
JSON_df$ISBN <- as.character(JSON_df$ISBN)
JSON_df## Books.Id Title Authors
## 1 1 everything's an argument Andrea A. Lunsford , John J. Ruskiewicz
## 2 2 The Richest Man In Babylon George S. Clason
## 3 3 Emotional Insightful 2.0 Travis Bradberry , Jean Greaves
## Length Review ISBN
## 1 513 Illustrative 9781457606069
## 2 90 Insightful 9781939438638
## 3 255 Strategic 978097432065