In this assignment, we will practice loading unstructured data from different file types. First, we load in our books data from an HTML table:

library(RCurl)
## Loading required package: bitops
library(XML)
link <- "https://raw.githubusercontent.com/omarp120/DATA607Week7/master/books.html"
booksHTML <- readHTMLTable(getURL(link))[[1]]
booksHTML
##                                                                                                                         Title
## 1                                                                                                             Circle K Cycles
## 2                                       Japanese Diasporasâ\u0080¯: Unsung Pasts, Conflicting Presents, and Uncertain Futures
## 3 New Worlds, New Livesâ\u0080¯: Globalization and People of Japanese Descent in the Americas and from Latin America in Japan
##                                   Author(s) Year                 Publisher
## 1                      Karen Tei. Yamashita 2001        Coffee House Press
## 2                             Nobuko Adachi 2006                 Routledge
## 3 Lane Ryo Hirabayashi, Akemi Kikumura-Yano 2002 Stanford University Press

Next, we load these same books from a JSON file:

library(jsonlite)
link2 <- "https://raw.githubusercontent.com/omarp120/DATA607Week7/master/books.json"
booksJSON <- fromJSON(link2)[[1]]
booksJSON
##                                                                                                                  title
## 1                                                                                                      Circle K Cycles
## 2                                       Japanese Diasporas : Unsung Pasts, Conflicting Presents, and Uncertain Futures
## 3 New Worlds, New Lives : Globalization and People of Japanese Descent in the Americas and from Latin America in Japan
##                                     authors year                 publisher
## 1                      Karen Tei. Yamashita 2001        Coffee House Press
## 2                             Nobuko Adachi 2006                 Routledge
## 3 Lane Ryo Hirabayashi, Akemi Kikumura-Yano 2002 Stanford University Press

Finally, we load these books from an XML file:

library(XML)
library(plyr)
link3 <- "https://raw.githubusercontent.com/omarp120/DATA607Week7/master/books.xml"
books <- xmlParse(getURL(link3), useInternalNodes = TRUE, validate = FALSE)
booksXML <- ldply(xmlToList(books), data.frame)
booksXML <- booksXML[,2:5] #removes first column, which is an id
booksXML
##                                                                                                                  Title
## 1                                                                                                      Circle K Cycles
## 2                                       Japanese Diasporas : Unsung Pasts, Conflicting Presents, and Uncertain Futures
## 3 New Worlds, New Lives : Globalization and People of Japanese Descent in the Americas and from Latin America in Japan
##                                     Authors Year                 Publisher
## 1                      Karen Tei. Yamashita 2001        Coffee House Press
## 2                             Nobuko Adachi 2006                 Routledge
## 3 Lane Ryo Hirabayashi, Akemi Kikumura-Yano 2002 Stanford University Press

The tables are slightly different, some containing additional columns than others. Also, the XML and HTML tables’ elements have data type ‘fctr’ while the JSON table’s elements are ‘chr’.