Overview:
In this assignment we had to pick three of our favorite books. At least one of the books will need to have more than one author. For each book, we will need to include the following:
I have picked three books of my choice and created a file for each format with the following attributes: Title, Author, Genre, Publisher and Year Published.
I wrote the files using notepad, saved saved them as .html, .json and .xml. I have saved all these three files in a github repository from where I will be extracting the data from.
First load the libraries that to extract and parse the HTML file. Use Rcur to facilitate the extraction process from the web, DT to create a table and XML to read my HTML file:
#load packages
library(RCurl)
library(DT)
library(XML)
#parse file from web
htmlURL <- getURL("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Assignment%205/Books.html")
books_html <- htmlParse(htmlURL)
#create data frame
books_html_table <- readHTMLTable(books_html, stringsAsFactors = FALSE)
books_html_table <- books_html_table[[1]]
books_html_table
## Title Authors
## 1 The Talisman Stephen King, Peter Francis Straub
## 2 The Color of Water James McBride
## 3 The Hunger Games Suzanne Collins
## Genre Publisher Year Published
## 1 Dark fantasy Viking 1984
## 2 Memoir Penguin Group 1996
## 3 Adventure, Science fiction Scholastic 2008
#view
datatable(books_html_table)
#structure of column
str(books_html_table)
## 'data.frame': 3 obs. of 5 variables:
## $ Title : chr "The Talisman" "The Color of Water" "The Hunger Games"
## $ Authors : chr "Stephen King, Peter Francis Straub" "James McBride" "Suzanne Collins"
## $ Genre : chr "Dark fantasy" "Memoir" "Adventure, Science fiction"
## $ Publisher : chr "Viking" "Penguin Group" "Scholastic"
## $ Year Published: chr "1984" "1996" "2008"
Used the same libraries to extract and parse the XML file the ones in previous example. Use Rcurl to facilitate the extraction process from the web, DT to create a table and XML to read my XML file:
#get URL and Parse
xmlURL <- getURL("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Assignment%205/Books.xml")
books_xml <- xmlParse(xmlURL)
#create data frame
books_xml_table <- xmlToDataFrame(books_xml, stringsAsFactors = FALSE)
books_xml_table
## Title Author
## 1 The Talisman Stephen King, Peter Francis Straub
## 2 The Color of Water James McBride
## 3 The Hunger Games Suzanne Collins
## Genre Publisher YearPublished
## 1 Dark fantasy Viking 1984
## 2 Memoir Penguin Group 1996
## 3 Adventure, Science fiction Scholastic 2008
#view
datatable(books_xml_table)
#structure
str(books_xml_table)
## 'data.frame': 3 obs. of 5 variables:
## $ Title : chr "The Talisman" "The Color of Water" "The Hunger Games"
## $ Author : chr "Stephen King, Peter Francis Straub" "James McBride" "Suzanne Collins"
## $ Genre : chr "Dark fantasy" "Memoir" "Adventure, Science fiction"
## $ Publisher : chr "Viking" "Penguin Group" "Scholastic"
## $ YearPublished: chr "1984" "1996" "2008"
Used the same libraries to extract and parse the XML file the ones in previous example as well as RJSONIO. Use Rcurl to facilitate the extraction process from the web, DT to create a table and RJSONIO to read my XML file:
#load package
library(RJSONIO)
#parse file from web
jsonURL <- getURL("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Assignment%205/Books.json")
books_json <- fromJSON(jsonURL)
#create data frame
books_json_table <- do.call("rbind", lapply(books_json[[1]], data.frame, stringsAsFactors = FALSE))
books_json_table
## Title Author
## 1 The Talisman PStephen King, Peter Francis Straub
## 2 The Color of Water James McBride
## 3 The Hunger Games Suzanne Collins
## Genre Publisher Year.Published
## 1 Dark fantasy Viking 1984
## 2 Memoir Penguin Group 1996
## 3 Adventure, Science fiction Scholastic 2008
#view
datatable(books_json_table)
#structure
str(books_json_table)
## 'data.frame': 3 obs. of 5 variables:
## $ Title : chr "The Talisman" "The Color of Water" "The Hunger Games"
## $ Author : chr "PStephen King, Peter Francis Straub" "James McBride" "Suzanne Collins"
## $ Genre : chr "Dark fantasy" "Memoir" "Adventure, Science fiction"
## $ Publisher : chr "Viking" "Penguin Group" "Scholastic"
## $ Year.Published: num 1984 1996 2008
From all three data frames, we can see that they look pretty much identical with few differences. For the XML file, I had to write “Year Published” together(“YearPublished”) in order for the file to run properly. Thus, it shows on the title of the table as “YearPublished”. Also, even though the JSON file allowed me to write “Year Published” separately and not have any issues when running the code, when printed here on R, we can see that the space in between was replaced with a dot (.). Lastly, we can see from looking at the structures of each table that all the fields are read as characters except for “Year Published” in the JSON file, which is read as a number.
Source:
https://www.w3schools.com/html/
https://www.w3schools.com/xml/default.asp
https://www.w3schools.com/js/js_json.asp
https://bookdown.org/yihui/rmarkdown/html-document.html (3.1.1)
https://stackoverflow.com/questions/53241565/how-do-i-get-the-geturl-function-to-work-in-r
https://stackoverflow.com/questions/26840020/issues-with-readhtmltable-in-r