FavoriteBooks

First, we will load the packages required to complete the assignment, XML and RJSONIO

library("XML")
library("RCurl")

## Loading required package: bitops

library("plyr")
library("RJSONIO")
library("knitr")

On my Github, I have handwritten my choice of three books in three different file formats: HTML, XML and JSON.

The three books are: The Bostonians, Small is Beautiful, and How to Lie with Statistics.

For each book, I gathered the following information and put it into a table by hand:

Title Author(s) Publication Date Publisher Type (i.e. fiction or nonfiction) Genre

Let’s start with HTML.

I used this testing page to ensure I had the table I desired in HTML: http://www.w3schools.com/html/tryit.asp?filename=tryhtml_default

My raw HTML code is located here: https://github.com/AsherMeyers/DATA-607/blob/master/Week-8/Books.html

#Identify the URL where the HTML table is located
  html.url <- "https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week-8/Books.html"

#Download the contents of that HTML
books.html <- getURL(html.url)

#Read the HTML into a table in R
books.html.table <- readHTMLTable(books.html, header = TRUE)

View(books.html.table)
kable(books.html.table)

Title	Author(s)	Publisher	Type	Genre	Publication Date
The Bostonians	Henry James	MacMillan	Fiction	Tragicomedy	1886
Small is Beautiful	E.F. Schumacher	Blond & Griggs	Nonfiction	Economics	1973
How to Lie with Statistics	Darrel Huff, Irving Geis	W.W. Norton	Nonfiction	Statistics	1954

Now for reading the table in XML.

books.xml.url <- getURL("https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week-8/Books.xml", ssl.verifyPeer=FALSE) #RCurl breaks when confronted with SSL verification, so we set the verify peer field to false
books.xml.data <- xmlParse(books.xml.url) #Parses the HTML file into an R structure
books.xml.table <- ldply(xmlToList(books.xml.data), data.frame) #converts each list in the books.xml file into a component of a dataframe
kable(books.xml.table)

.id	TITLE	AUTHOR	GENRE	COMPANY	TYPE	YEAR	.attrs
BOOK	The Bostonians	Henry James	Tragicomedy	MacMillan	Fiction	1886	1
BOOK	Small is Beautiful	Ernst Fritz Schumacher	Economics	Blond Griggs	Nonfiction	1973	2
BOOK	How to Lie with Statistics	Darrel Huff	Statistics	W.W. Norton	Nonfiction	1954	3
BOOK	How to Lie with Statistics	Irving Geis	Statistics	W.W. Norton	Nonfiction	1954	3

In JSON:

books.json.url <- "https://raw.githubusercontent.com/AsherMeyers/DATA-607/master/Week-8/Books.json"
books.json.table <- fromJSON(books.json.url)
View(books.json.table)
kable(books.json.table)

title	The Bostonians
author	Henry James
publisher	MacMillan
type	Fiction
genre	Tragicomedy
pubdate	1886

title	Small is Beautiful
author	E.F. Schumacher
publisher	Blond & Griggs
type	Nonfiction
genre	Economics
pubdate	1973

title	How to Lie with Statistics
author	Darrel Huff, Irving Geis
publisher	W.W. Norton
type	Nonfiction
genre	Statistics
pubdate	1954

We see that when importing the table through HMTL, each book gets its own row, while in JSON, each book gets its own column, and the categories are rows; additionally, the headers are rewritten for each book. In XML, the existence of two authors leads to two entries being created for the same book.

I consulted with my past project partner Chris Martin for how to import files into R for XML.

FavoriteBooks

Asher Meyers

March 19, 2016