I am an avid reader and have been for many years. I’ve always tended towards the fantasy genre, so my choices of “favorites” reflect that preference. Some of these choices I read early in life, some more recently, but each one I have read at least a few times and enjoy them all over again each time I do.
The first, and probably most influential book, I loaded into an HTML table. With the XML package, I can retreive it and parse it into a data frame:
# GitHub location
loc <- "https://raw.githubusercontent.com/lysanthus/Data607/master/Week7/book1.html"
# Get the HTML
html <- getURL(loc)
# Parse the HTML
book <- htmlParse(html)
# Get the table headers as column names
headers <- xpathSApply(book,"//th",xmlValue)
# Get the table data for actual book info
data <- xpathSApply(book,"//td",xmlValue)
# Combine into a data frame. Note we had to change the data to a single row.
# If there had been more than one book, we may have had to take a different
# approach.
book1 <- data.frame(rbind(data), stringsAsFactors = FALSE, row.names="Book1")
colnames(book1) <- headers
book1
## Book Title Authors
## Book1 Dragons of Autumn Twilight Margaret Weiss, Tracy Hickman
## Publish Date Pages
## Book1 1984 448
This book, Dragons of Autumn Twilight, was more or less my gateway into the genre of fantasy fiction. I still like to revisit this (and the others in its trilogy) every few years. In, fact, I may be due for another read.
For my second book, I chose a true classic - a book likely on the shelves of many, even those not particularly enamoured with the fantasy genre.
This book, I loaded into a simple XML file.
# GitHub Location
loc <- "https://raw.githubusercontent.com/lysanthus/Data607/master/Week7/book2.xml"
# Get the XML
xml <- getURL(loc)
# Parse the HTML
book <- xmlParse(xml)
# Get the fields
data <- xpathSApply(book,"//book/child::*",xmlValue)
# The column names
headers <- c("title","author","published","pages")
book2 <- data.frame(rbind(data), stringsAsFactors = FALSE, row.names="Book2")
colnames(book2) <- headers
book2
## title author published pages
## Book2 The Fellowship Of The Ring J.R.R. Tolkien 1954 423
This book, The Fellowship Of The Ring is a classic and a must-read for any serious fan of fantasy literature.
For the third book, I chose the first of a trilogy I only recently read. Few books developed characters quite as engaging as this did, and the world the author created was rather unique from other novels.
This book I placed into a JSON file.
# GitHub Location
loc <- "https://raw.githubusercontent.com/lysanthus/Data607/master/Week7/book3.json"
# Get the JSON
json <- getURL(loc)
# Parse the JSON
book <- fromJSON(content=json)
# Get the fields
data <- unlist(book, recursive = TRUE)
# The column names
headers <- c("title","author","published","pages")
book3 <- data.frame(rbind(data), stringsAsFactors = FALSE, row.names="Book3")
colnames(book3) <- headers
book3
## title author published pages
## Book3 Assassin's Apprentice Robin Hobb 1995 400
When I discovered Robin Hobb, I was surprised that I had not heard of her books before. They instantly drew me in and captured my imagination throughout the two trilogies.
HTML and XML use similar approaches to parsing, while JSON is slightly different. Neither is particularly difficult to do (though I suspect multiple entries in each are a bit trickier to coax into a data frame).
A far as the individual data frames, each are slightly different in that they have different column names (which I could have manually fixed), but otherwise the same:
str(book1)
## 'data.frame': 1 obs. of 4 variables:
## $ Book Title : chr "Dragons of Autumn Twilight"
## $ Authors : chr "Margaret Weiss, Tracy Hickman"
## $ Publish Date: chr "1984"
## $ Pages : chr "448"
str(book2)
## 'data.frame': 1 obs. of 4 variables:
## $ title : chr "The Fellowship Of The Ring"
## $ author : chr "J.R.R. Tolkien"
## $ published: chr "1954"
## $ pages : chr "423"
str(book3)
## 'data.frame': 1 obs. of 4 variables:
## $ title : chr "Assassin's Apprentice"
## $ author : chr "Robin Hobb"
## $ published: chr "1995"
## $ pages : chr "400"
Each variable is imported as a character, which is ok for now and can be coerced into its proper type later if we were to combine them into a single dataframe:
# Fix column names for book1
names(book1) <- c("title","author","published","pages")
# Bind into a single data frame
books <- rbind(book1,book2,book3)
# Fix columns
books$published <- as.numeric(books$published)
books$pages <- as.numeric(books$pages)
books
## title author published
## Book1 Dragons of Autumn Twilight Margaret Weiss, Tracy Hickman 1984
## Book2 The Fellowship Of The Ring J.R.R. Tolkien 1954
## Book3 Assassin's Apprentice Robin Hobb 1995
## pages
## Book1 448
## Book2 423
## Book3 400