Pick three of your favorite books on one of your favorite subjects. Write R code, using your packages of choice, to load the information from three different sources containing information about the books (HTML, XML, and JSON) into separate R data frames.
Are the three data frames identical?
Load the necessary libraries
library(XML)
library(jsonlite)
library(dplyr)
Three files were created to meet the stated format requirements: HTML, XML, and JSON. These files contain the following variables on books about the history of American whaling, particularly those stories that influenced Herman Melville’s writing of Moby Dick.
These files were uploaded to my corresponding GitHub repository for this course and can be found here.
Read in the HTML table as a dataframe
download.file("https://raw.githubusercontent.com/dbouquin/IS_607/master/whaling_books.html","whaling_books.html", method="curl")
whales_HTML<-readHTMLTable("whaling_books.html", header=TRUE)
whales_HTML <- as.data.frame(whales_HTML)
str(whales_HTML)
## 'data.frame': 3 obs. of 5 variables:
## $ NULL.Title : Factor w/ 3 levels "In the Heart of the Sea: The Tragedy of the Whaleship Essex",..: 3 1 2
## $ NULL.Author.s. : Factor w/ 3 levels "Eric Jay Dolin",..: 2 3 1
## $ NULL.ISBN : Factor w/ 3 levels "0141001828","0393060578",..: 3 1 2
## $ NULL.OriginalPublication: Factor w/ 3 levels "1851","2001",..: 1 2 3
## $ NULL.OCLC : Factor w/ 3 levels "44541812","608132810",..: 1 2 3
Read in the XML file as a dataframe
download.file("https://raw.githubusercontent.com/dbouquin/IS_607/master/whaling_books.xml","whaling_books.xml", method="curl")
whales_XML<-xmlToList(xmlParse("whaling_books.xml"))
whales_XML<-data.frame(do.call(bind_rows, lapply(whales_XML, data.frame, stringsAsFactors=FALSE)))
str(whales_XML)
## 'data.frame': 3 obs. of 7 variables:
## $ title : chr "Moby-Dick; or, The Whale" "In the Heart of the Sea: The Tragedy of the Whaleship Essex" "Leviathan: The History of Whaling in Americae"
## $ author : chr "Herman Melville" "Nathaniel Philbrick" "Eric Jay Dolin"
## $ author.1 : chr "Elizabeth Hardwick (Introduction)" NA NA
## $ author.2 : chr "Rockwell Kent (Illustrator)" NA NA
## $ ISBN : chr "067978327X" "0141001828" "0393060578"
## $ OriginalPublication: chr "1851" "2001" "2008"
## $ OCLC : chr "44541812" "608132810" "85018314"
Read in the JSON file as a dataframe
whales_JSON<-data.frame(fromJSON("https://raw.githubusercontent.com/dbouquin/IS_607/master/whaling_books.json"))
str(whales_JSON)
## 'data.frame': 3 obs. of 5 variables:
## $ book.title : chr "Moby-Dick; or, The Whale" "In the Heart of the Sea: The Tragedy of the Whaleship Essex" "Leviathan: The History of Whaling in America"
## $ book.author :List of 3
## ..$ : chr "Herman Melville" "Elizabeth Hardwick (Introduction)" "Rockwell Kent (Illustrator)"
## ..$ : chr "Nathaniel Philbrick"
## ..$ : chr "Eric Jay Dolin"
## $ book.ISBN : chr "067978327X" "0141001828" "0393060578"
## $ book.OriginalPublication: int 1851 2001 2008
## $ book.OCLC : int 44541812 608132810 85018314
You can see that the dataframes are each different in multiple ways, not least of which are how they treat column naming, and the variables themselves. For example, the HTML table and JSON format each resulted in dataframes containing 3 observations of 5 variables, however, the XML format resulted in 7 variables for each observation. This is because of how the author variable was treated by R when reading in the XML format. Additionally, the datatypes between the different formats differ greatly; when reading in the data from the XML I specified that strings should not be treated as factors to show the difference between this call and not specifying this parameter as seen with the HTML table wherein all the variables are treated as factors rather than character strings. It would be simple to rename the columns for the tables using the colnames() function but the structures and treatment of the variables for each table would still be different.