IS607_HW6

Pick three of your favorite books on one of your favorite subjects. Write R code, using your packages of choice, to load the information from three different sources containing information about the books (HTML, XML, and JSON) into separate R data frames.

Are the three data frames identical?

Load the necessary libraries

library(XML)
library(jsonlite)
library(dplyr)

Three files were created to meet the stated format requirements: HTML, XML, and JSON. These files contain the following variables on books about the history of American whaling, particularly those stories that influenced Herman Melville’s writing of Moby Dick.

Title
Author(s)
ISBN
Original Publication Year
OCLC number

These files were uploaded to my corresponding GitHub repository for this course and can be found here.

Read in the HTML table as a dataframe

download.file("https://raw.githubusercontent.com/dbouquin/IS_607/master/whaling_books.html","whaling_books.html", method="curl")
whales_HTML<-readHTMLTable("whaling_books.html", header=TRUE)
whales_HTML <- as.data.frame(whales_HTML)
str(whales_HTML)

## 'data.frame':    3 obs. of  5 variables:
##  $ NULL.Title              : Factor w/ 3 levels "In the Heart of the Sea: The Tragedy of the Whaleship Essex",..: 3 1 2
##  $ NULL.Author.s.          : Factor w/ 3 levels "Eric Jay Dolin",..: 2 3 1
##  $ NULL.ISBN               : Factor w/ 3 levels "0141001828","0393060578",..: 3 1 2
##  $ NULL.OriginalPublication: Factor w/ 3 levels "1851","2001",..: 1 2 3
##  $ NULL.OCLC               : Factor w/ 3 levels "44541812","608132810",..: 1 2 3

Read in the XML file as a dataframe

download.file("https://raw.githubusercontent.com/dbouquin/IS_607/master/whaling_books.xml","whaling_books.xml", method="curl")
whales_XML<-xmlToList(xmlParse("whaling_books.xml"))
whales_XML<-data.frame(do.call(bind_rows, lapply(whales_XML, data.frame, stringsAsFactors=FALSE)))
str(whales_XML)

## 'data.frame':    3 obs. of  7 variables:
##  $ title              : chr  "Moby-Dick; or, The Whale" "In the Heart of the Sea: The Tragedy of the Whaleship Essex" "Leviathan: The History of Whaling in Americae"
##  $ author             : chr  "Herman Melville" "Nathaniel Philbrick" "Eric Jay Dolin"
##  $ author.1           : chr  "Elizabeth Hardwick (Introduction)" NA NA
##  $ author.2           : chr  "Rockwell Kent (Illustrator)" NA NA
##  $ ISBN               : chr  "067978327X" "0141001828" "0393060578"
##  $ OriginalPublication: chr  "1851" "2001" "2008"
##  $ OCLC               : chr  "44541812" "608132810" "85018314"

Read in the JSON file as a dataframe

whales_JSON<-data.frame(fromJSON("https://raw.githubusercontent.com/dbouquin/IS_607/master/whaling_books.json"))
str(whales_JSON)

## 'data.frame':    3 obs. of  5 variables:
##  $ book.title              : chr  "Moby-Dick; or, The Whale" "In the Heart of the Sea: The Tragedy of the Whaleship Essex" "Leviathan: The History of Whaling in America"
##  $ book.author             :List of 3
##   ..$ : chr  "Herman Melville" "Elizabeth Hardwick (Introduction)" "Rockwell Kent (Illustrator)"
##   ..$ : chr "Nathaniel Philbrick"
##   ..$ : chr "Eric Jay Dolin"
##  $ book.ISBN               : chr  "067978327X" "0141001828" "0393060578"
##  $ book.OriginalPublication: int  1851 2001 2008
##  $ book.OCLC               : int  44541812 608132810 85018314

You can see that the dataframes are each different in multiple ways, not least of which are how they treat column naming, and the variables themselves. For example, the HTML table and JSON format each resulted in dataframes containing 3 observations of 5 variables, however, the XML format resulted in 7 variables for each observation. This is because of how the author variable was treated by R when reading in the XML format. Additionally, the datatypes between the different formats differ greatly; when reading in the data from the XML I specified that strings should not be treated as factors to show the difference between this call and not specifying this parameter as seen with the HTML table wherein all the variables are treated as factors rather than character strings. It would be simple to rename the columns for the tables using the colnames() function but the structures and treatment of the variables for each table would still be different.

IS607_HW6

Daina Bouquin

October 15, 2015