In this assignment, we will practice loading unstructured data from different file types. First, we load in our books data from an HTML table:
library(RCurl)
## Loading required package: bitops
library(XML)
link <- "https://raw.githubusercontent.com/omarp120/DATA607Week7/master/books.html"
booksHTML <- readHTMLTable(getURL(link))[[1]]
booksHTML
## Title
## 1 Circle K Cycles
## 2 Japanese Diasporasâ\u0080¯: Unsung Pasts, Conflicting Presents, and Uncertain Futures
## 3 New Worlds, New Livesâ\u0080¯: Globalization and People of Japanese Descent in the Americas and from Latin America in Japan
## Author(s) Year Publisher
## 1 Karen Tei. Yamashita 2001 Coffee House Press
## 2 Nobuko Adachi 2006 Routledge
## 3 Lane Ryo Hirabayashi, Akemi Kikumura-Yano 2002 Stanford University Press
Next, we load these same books from a JSON file:
library(jsonlite)
link2 <- "https://raw.githubusercontent.com/omarp120/DATA607Week7/master/books.json"
booksJSON <- fromJSON(link2)[[1]]
booksJSON
## title
## 1 Circle K Cycles
## 2 Japanese Diasporas : Unsung Pasts, Conflicting Presents, and Uncertain Futures
## 3 New Worlds, New Lives : Globalization and People of Japanese Descent in the Americas and from Latin America in Japan
## authors year publisher
## 1 Karen Tei. Yamashita 2001 Coffee House Press
## 2 Nobuko Adachi 2006 Routledge
## 3 Lane Ryo Hirabayashi, Akemi Kikumura-Yano 2002 Stanford University Press
Finally, we load these books from an XML file:
library(XML)
library(plyr)
link3 <- "https://raw.githubusercontent.com/omarp120/DATA607Week7/master/books.xml"
books <- xmlParse(getURL(link3), useInternalNodes = TRUE, validate = FALSE)
booksXML <- ldply(xmlToList(books), data.frame)
booksXML <- booksXML[,2:5] #removes first column, which is an id
booksXML
## Title
## 1 Circle K Cycles
## 2 Japanese Diasporas : Unsung Pasts, Conflicting Presents, and Uncertain Futures
## 3 New Worlds, New Lives : Globalization and People of Japanese Descent in the Americas and from Latin America in Japan
## Authors Year Publisher
## 1 Karen Tei. Yamashita 2001 Coffee House Press
## 2 Nobuko Adachi 2006 Routledge
## 3 Lane Ryo Hirabayashi, Akemi Kikumura-Yano 2002 Stanford University Press
The tables are slightly different, some containing additional columns than others. Also, the XML and HTML tables’ elements have data type ‘fctr’ while the JSON table’s elements are ‘chr’.