For this assignment I have created a listing of several of my favorite science fiction books in three formats - HTML table, XML and JSON. All files were created by hand in a text editor. All have just one author; however, I have added the awards information to satisfy the requirement of having multiple values in a field. The task is to import all files and convert them to data frames. Original file content and all R code is below.
library(dplyr)
library(stringr)
library(XML) # For HTML and XML manipulation
library(RCurl)
library(tidyjson) # For JSON manipulation
<HTML>
<HEAD />
<BODY>
<table border=1, cellpadding=5>
<caption>Ilya's Favorite Science Fiction Books</caption>
<tr bgcolor=silver>
<th id="title">Book Title</th>
<th id="author">Author</th>
<th id="year">Year</th>
<th id="isbn">ISBN</th>
<th id="award">Awards</th>
<th id="film">Film Adaptation</th>
</tr><tr>
<td headers="title">2001: A Space Odyssey</td>
<td headers="author">Arthur C. Clarke</td>
<td headers="year">1968</td>
<td headers="isbn">0-453-00269-2</td>
<td headers="award"></td>
<td headers="film">2001: A Space Odyssey (Stanley Kubrick; 1968)</td>
</tr><tr>
<td headers="title">Fahrenheit 451</td>
<td headers="author">Ray Bradbury</td>
<td headers="year">1953</td>
<td headers="isbn">978-0-7432-4722-1</td>
<td headers="award">Retro Hugo (2004)<BR>
Prometheus Award (1984)</td>
<td headers="film">Fahrenheit 451 (Francois Truffaut; 1966)</td>
</tr><tr>
<td headers="title">The Gods Themselves</td>
<td headers="author">Isaac Asimov</td>
<td headers="year">1972</td>
<td headers="isbn">0-385-02701-X</td>
<td headers="award">Nebula Award (1972)<BR>
Hugo Award (1973)<BR>
Locus Award (1973)</td>
<td headers="film"></td>
</tr>
</table>
</BODY>
</HTML>
# Import HTML file
htmlFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.html")
# Parse HTML table
booksHTML <- readHTMLTable(htmlFile, header = TRUE)
# Convert to data frame and adjust column names
booksHTML <- as.data.frame(booksHTML)
colnames(booksHTML) <- c("title", "author", "year", "isbn", "award", "film")
# Adjust award column
booksHTML$award <- str_replace_all(booksHTML$award, "\t", "")
booksHTML$award <- str_replace_all(booksHTML$award, "\n", "; ")
| title | author | year | isbn | award | film |
|---|---|---|---|---|---|
| 2001: A Space Odyssey | Arthur C. Clarke | 1968 | 0-453-00269-2 | 2001: A Space Odyssey (Stanley Kubrick; 1968) | |
| Fahrenheit 451 | Ray Bradbury | 1953 | 978-0-7432-4722-1 | Retro Hugo (2004); Prometheus Award (1984) | Fahrenheit 451 (Francois Truffaut; 1966) |
| The Gods Themselves | Isaac Asimov | 1972 | 0-385-02701-X | Nebula Award (1972); Hugo Award (1973); Locus Award (1973) |
The data was loaded and converted to a data frame with minimal manipulation (just minor clean-up). Of course, there is still a question about what to do with multiple award values for a given book. Film attributes can also be split accross multiple columns.
<?xml version="1.0" encoding="windows-1252"?>
<scifi_books>
<book id="1">
<title>2001: A Space Odyssey</title>
<author>Arthur C. Clarke</author>
<year>1968</year>
<isbn>0-453-00269-2</isbn>
<film name="2001: A Space Odyssey" director="Stanley Kubrick" year="1968" />
</book>
<book id="2">
<title>Fahrenheit 451</title>
<author>Ray Bradbury</author>
<year>1953</year>
<isbn>978-0-7432-4722-1</isbn>
<awardlist>
<award>
<name>Retro Hugo</name>
<year>2004</year>
</award>
<award>
<name>Prometheus Award</name>
<year>1984</year>
</award>
</awardlist>
<film name="Fahrenheit 451" director="Francois Truffaut" year="1966" />
</book>
<book id="3">
<title>The Gods Themselves</title>
<author>Isaac Asimov</author>
<year>1972</year>
<isbn>0-385-02701-X</isbn>
<awardlist>
<award>
<name>Nebula Award</name>
<year>1972</year>
</award>
<award>
<name>Hugo Award</name>
<year>1973</year>
</award>
<award>
<name>Locus Award</name>
<year>1973</year>
</award>
</awardlist>
</book>
</scifi_books>
# Import XML file
xmlFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.xml")
# Parse XML and get root
booksXML <- xmlParse(xmlFile)
root <- xmlRoot(booksXML)
# Convert main fields to data frame
booksXML <- xmlToDataFrame(root)
# Loop through XML nodes to get film attributes
for (i in 1:xmlSize(root)) {
if (!is.null(root[[i]][["film"]])) {
booksXML[i, "film"] <- paste(xmlAttrs(root[[i]][["film"]]), collapse = " - ")
}
}
# Loop thorugh XML nodes to collect award data
for (i in 1:xmlSize(root)) {
if (!is.null(root[[i]][["awardlist"]])) {
awards <- xmlChildren(root[[i]][["awardlist"]])
awardlist <- ""
for (j in 1:xmlSize(awards)) {
jaward <- paste(xmlValue(awards[[j]][["name"]]), xmlValue(awards[[j]][["year"]]), sep = " - ")
if (j==1) {sep <- ""} else {sep <- "; "}
awardlist <- paste(awardlist, jaward, sep = sep)
}
booksXML[i, "awardlist"] <- awardlist
}
}
| title | author | year | isbn | film | awardlist |
|---|---|---|---|---|---|
| 2001: A Space Odyssey | Arthur C. Clarke | 1968 | 0-453-00269-2 | 2001: A Space Odyssey - Stanley Kubrick - 1968 | NA |
| Fahrenheit 451 | Ray Bradbury | 1953 | 978-0-7432-4722-1 | Fahrenheit 451 - Francois Truffaut - 1966 | Retro Hugo - 2004; Prometheus Award - 1984 |
| The Gods Themselves | Isaac Asimov | 1972 | 0-385-02701-X | NA | Nebula Award - 1972; Hugo Award - 1973; Locus Award - 1973 |
XML format required more work. Main fields were easily converted, but it took a couple of loops to traverse the XML tree structure to gather film attributes and award information. For this exercise, multiple awards were merged into a single entry similar to HTML output.
{"scifi_books" : [
{
"title" : "2001: A Space Odyssey",
"author" : "Arthur C. Clarke",
"year" : "1968",
"isbn" : "0-453-00269-2",
"film" : [{
"name" : "2001: A Space Odyssey",
"director" : "Stanley Kubrick",
"year" : "1968"
}]
},
{
"title" : "Fahrenheit 451",
"author" : "Ray Bradbury",
"year" : "1953",
"isbn" : "978-0-7432-4722-1",
"award" : [{
"name" : "Retro Hugo",
"year" : "2004"
},
{
"name" : "Prometheus Award",
"year" : "1984"
}],
"film" : [{
"name" : "Fahrenheit 451",
"director" : "Francois Truffaut",
"year" : "1966"
}]
},
{
"title" : "The Gods Themselves",
"author" : "Isaac Asimov",
"year" : "1972",
"isbn" : "0-385-02701-X",
"award" : [{
"name" : "Nebula Award",
"year" : "1972"
},
{
"name" : "Hugo Award",
"year" : "1973"
},
{
"name" : "Locus Award",
"year" : "1973"
}]
}]
}
# Import JSON file
jsonFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.json")
# Parse book data
booksJSON <- jsonFile %>%
enter_object("scifi_books") %>%
gather_array("book.id") %>%
spread_values(title = jstring("title"),
author = jstring("author"),
year = jnumber("year"),
isbn = jstring("isbn"))
# Parse award data
awards <- jsonFile %>%
enter_object("scifi_books") %>%
gather_array("book.id") %>% # Get id for all books to join later
enter_object("award") %>% # Get a listing of all awards
gather_array() %>%
spread_values(award.name = jstring("name"),
award.year = jnumber("year"))
# Parse film data
films <- jsonFile %>%
enter_object("scifi_books") %>%
gather_array("book.id") %>% # Get id for all books to join later
enter_object("film") %>% # Get a listing of all films
gather_array() %>%
spread_values(film.title = jstring("name"),
film.director = jstring("director"),
film.year = jnumber("year"))
# Merge book, award and film data
booksJSONflat <- booksJSON %>%
left_join(awards, by = "book.id") %>%
left_join(films, by = "book.id") %>%
select(book.id, title, author, year, isbn, award.name, award.year,
film.title, film.director, film.year)
| book.id | title | author | year | isbn | award.name | award.year | film.title | film.director | film.year |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2001: A Space Odyssey | Arthur C. Clarke | 1968 | 0-453-00269-2 | NA | NA | 2001: A Space Odyssey | Stanley Kubrick | 1968 |
| 2 | Fahrenheit 451 | Ray Bradbury | 1953 | 978-0-7432-4722-1 | Retro Hugo | 2004 | Fahrenheit 451 | Francois Truffaut | 1966 |
| 2 | Fahrenheit 451 | Ray Bradbury | 1953 | 978-0-7432-4722-1 | Prometheus Award | 1984 | Fahrenheit 451 | Francois Truffaut | 1966 |
| 3 | The Gods Themselves | Isaac Asimov | 1972 | 0-385-02701-X | Nebula Award | 1972 | NA | NA | NA |
| 3 | The Gods Themselves | Isaac Asimov | 1972 | 0-385-02701-X | Hugo Award | 1973 | NA | NA | NA |
| 3 | The Gods Themselves | Isaac Asimov | 1972 | 0-385-02701-X | Locus Award | 1973 | NA | NA | NA |
With some research tidyjson package turned out to be very powerful in manipulating JSON data. For this exercise I went with a long structure for the data frame based on award data.
It took quite a bit of time to figure out how to traverse different data structures. XML and JSON allow for the most flexibility in manipulating stored data. HTML is simplier to handle, but much less flexible.