Week 7 Assignment

The Individual Books

I am an avid reader and have been for many years. I’ve always tended towards the fantasy genre, so my choices of “favorites” reflect that preference. Some of these choices I read early in life, some more recently, but each one I have read at least a few times and enjoy them all over again each time I do.

First Book

The first, and probably most influential book, I loaded into an HTML table. With the XML package, I can retreive it and parse it into a data frame:

# GitHub location
loc <- "https://raw.githubusercontent.com/lysanthus/Data607/master/Week7/book1.html"

# Get the HTML
html <- getURL(loc)

# Parse the HTML
book <- htmlParse(html)

# Get the table headers as column names
headers <- xpathSApply(book,"//th",xmlValue)

# Get the table data for actual book info
data <- xpathSApply(book,"//td",xmlValue)

# Combine into a data frame. Note we had to change the data to a single row.
#   If there had been more than one book, we may have had to take a different
#   approach.
book1 <- data.frame(rbind(data), stringsAsFactors = FALSE, row.names="Book1")

colnames(book1) <- headers

book1

##                       Book Title                       Authors
## Book1 Dragons of Autumn Twilight Margaret Weiss, Tracy Hickman
##       Publish Date Pages
## Book1         1984   448

This book, Dragons of Autumn Twilight, was more or less my gateway into the genre of fantasy fiction. I still like to revisit this (and the others in its trilogy) every few years. In, fact, I may be due for another read.

Second Book

For my second book, I chose a true classic - a book likely on the shelves of many, even those not particularly enamoured with the fantasy genre.

This book, I loaded into a simple XML file.

# GitHub Location
loc <- "https://raw.githubusercontent.com/lysanthus/Data607/master/Week7/book2.xml"

# Get the XML
xml <- getURL(loc)

# Parse the HTML
book <- xmlParse(xml)

# Get the fields
data <- xpathSApply(book,"//book/child::*",xmlValue)

# The column names
headers <- c("title","author","published","pages")

book2 <- data.frame(rbind(data), stringsAsFactors = FALSE, row.names="Book2")

colnames(book2) <- headers

book2

##                            title         author published pages
## Book2 The Fellowship Of The Ring J.R.R. Tolkien      1954   423

This book, The Fellowship Of The Ring is a classic and a must-read for any serious fan of fantasy literature.

Third Book

For the third book, I chose the first of a trilogy I only recently read. Few books developed characters quite as engaging as this did, and the world the author created was rather unique from other novels.

This book I placed into a JSON file.

# GitHub Location
loc <- "https://raw.githubusercontent.com/lysanthus/Data607/master/Week7/book3.json"

# Get the JSON
json <- getURL(loc)

# Parse the JSON
book <- fromJSON(content=json)

# Get the fields
data <- unlist(book, recursive = TRUE)

# The column names
headers <- c("title","author","published","pages")

book3 <- data.frame(rbind(data), stringsAsFactors = FALSE, row.names="Book3")

colnames(book3) <- headers

book3

##                       title     author published pages
## Book3 Assassin's Apprentice Robin Hobb      1995   400

When I discovered Robin Hobb, I was surprised that I had not heard of her books before. They instantly drew me in and captured my imagination throughout the two trilogies.

Analysis

HTML and XML use similar approaches to parsing, while JSON is slightly different. Neither is particularly difficult to do (though I suspect multiple entries in each are a bit trickier to coax into a data frame).

A far as the individual data frames, each are slightly different in that they have different column names (which I could have manually fixed), but otherwise the same:

str(book1)

## 'data.frame':    1 obs. of  4 variables:
##  $ Book Title  : chr "Dragons of Autumn Twilight"
##  $ Authors     : chr "Margaret Weiss, Tracy Hickman"
##  $ Publish Date: chr "1984"
##  $ Pages       : chr "448"

str(book2)

## 'data.frame':    1 obs. of  4 variables:
##  $ title    : chr "The Fellowship Of The Ring"
##  $ author   : chr "J.R.R. Tolkien"
##  $ published: chr "1954"
##  $ pages    : chr "423"

str(book3)

## 'data.frame':    1 obs. of  4 variables:
##  $ title    : chr "Assassin's Apprentice"
##  $ author   : chr "Robin Hobb"
##  $ published: chr "1995"
##  $ pages    : chr "400"

Each variable is imported as a character, which is ok for now and can be coerced into its proper type later if we were to combine them into a single dataframe:

# Fix column names for book1
names(book1) <- c("title","author","published","pages")

# Bind into a single data frame
books <- rbind(book1,book2,book3)

# Fix columns
books$published <- as.numeric(books$published)
books$pages <- as.numeric(books$pages)

books

##                            title                        author published
## Book1 Dragons of Autumn Twilight Margaret Weiss, Tracy Hickman      1984
## Book2 The Fellowship Of The Ring                J.R.R. Tolkien      1954
## Book3      Assassin's Apprentice                    Robin Hobb      1995
##       pages
## Book1   448
## Book2   423
## Book3   400