Task

For this assignment I have created a listing of several of my favorite science fiction books in three formats - HTML table, XML and JSON. All files were created by hand in a text editor. All have just one author; however, I have added the awards information to satisfy the requirement of having multiple values in a field. The task is to import all files and convert them to data frames. Original file content and all R code is below.

Required Libraries

library(dplyr)
library(stringr)
library(XML)       # For HTML and XML manipulation
library(RCurl)
library(tidyjson)  # For JSON manipulation

HTML

Original File

<HTML>
<HEAD />
<BODY>
<table border=1, cellpadding=5>
    <caption>Ilya's Favorite Science Fiction Books</caption>
    <tr bgcolor=silver>
        <th id="title">Book Title</th>
        <th id="author">Author</th>
        <th id="year">Year</th>
        <th id="isbn">ISBN</th>
        <th id="award">Awards</th>
        <th id="film">Film Adaptation</th>
    </tr><tr>
        <td headers="title">2001: A Space Odyssey</td>
        <td headers="author">Arthur C. Clarke</td>
        <td headers="year">1968</td>
        <td headers="isbn">0-453-00269-2</td>
        <td headers="award"></td>
        <td headers="film">2001: A Space Odyssey (Stanley Kubrick; 1968)</td>
    </tr><tr>
        <td headers="title">Fahrenheit 451</td>
        <td headers="author">Ray Bradbury</td>
        <td headers="year">1953</td>
        <td headers="isbn">978-0-7432-4722-1</td>
        <td headers="award">Retro Hugo (2004)<BR>
            Prometheus Award (1984)</td>
        <td headers="film">Fahrenheit 451 (Francois Truffaut; 1966)</td>
    </tr><tr>
        <td headers="title">The Gods Themselves</td>
        <td headers="author">Isaac Asimov</td>
        <td headers="year">1972</td>
        <td headers="isbn">0-385-02701-X</td>
        <td headers="award">Nebula Award (1972)<BR>
            Hugo Award (1973)<BR>
            Locus Award (1973)</td>
        <td headers="film"></td>
    </tr>
</table>
</BODY>
</HTML>

Data Import and Manipulation

# Import HTML file
htmlFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.html")

# Parse HTML table
booksHTML <- readHTMLTable(htmlFile, header = TRUE)

# Convert to data frame and adjust column names
booksHTML <- as.data.frame(booksHTML)
colnames(booksHTML) <- c("title", "author", "year", "isbn", "award", "film")

# Adjust award column
booksHTML$award <- str_replace_all(booksHTML$award, "\t", "")
booksHTML$award <- str_replace_all(booksHTML$award, "\n", "; ")

Converted Data Frame

title author year isbn award film
2001: A Space Odyssey Arthur C. Clarke 1968 0-453-00269-2 2001: A Space Odyssey (Stanley Kubrick; 1968)
Fahrenheit 451 Ray Bradbury 1953 978-0-7432-4722-1 Retro Hugo (2004); Prometheus Award (1984) Fahrenheit 451 (Francois Truffaut; 1966)
The Gods Themselves Isaac Asimov 1972 0-385-02701-X Nebula Award (1972); Hugo Award (1973); Locus Award (1973)

The data was loaded and converted to a data frame with minimal manipulation (just minor clean-up). Of course, there is still a question about what to do with multiple award values for a given book. Film attributes can also be split accross multiple columns.

XML

Original File

<?xml version="1.0" encoding="windows-1252"?>
<scifi_books>
    <book id="1">
        <title>2001: A Space Odyssey</title>
        <author>Arthur C. Clarke</author>
        <year>1968</year>
        <isbn>0-453-00269-2</isbn>
        <film name="2001: A Space Odyssey" director="Stanley Kubrick" year="1968" />
    </book>
    <book id="2">
        <title>Fahrenheit 451</title>
        <author>Ray Bradbury</author>
        <year>1953</year>
        <isbn>978-0-7432-4722-1</isbn>
        <awardlist>
            <award>
                <name>Retro Hugo</name>
                <year>2004</year>
            </award>
            <award>
                <name>Prometheus Award</name>
                <year>1984</year>
            </award>
        </awardlist>
        <film name="Fahrenheit 451" director="Francois Truffaut" year="1966" />
    </book>
    <book id="3">
        <title>The Gods Themselves</title>
        <author>Isaac Asimov</author>
        <year>1972</year>
        <isbn>0-385-02701-X</isbn>
        <awardlist>
            <award>
                <name>Nebula Award</name>
                <year>1972</year>
            </award>
            <award>
                <name>Hugo Award</name>
                <year>1973</year>
            </award>
            <award>
                <name>Locus Award</name>
                <year>1973</year>
            </award>
        </awardlist>
    </book>
</scifi_books>

Data Import and Manipulation

# Import XML file
xmlFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.xml")

# Parse XML and get root
booksXML <- xmlParse(xmlFile)
root <- xmlRoot(booksXML)

# Convert main fields to data frame
booksXML <- xmlToDataFrame(root)

# Loop through XML nodes to get film attributes
for (i in 1:xmlSize(root)) {
  if (!is.null(root[[i]][["film"]])) { 
    booksXML[i, "film"] <- paste(xmlAttrs(root[[i]][["film"]]), collapse = " - ")
  }
}

# Loop thorugh XML nodes to collect award data
for (i in 1:xmlSize(root)) {
  if (!is.null(root[[i]][["awardlist"]])) { 
    awards <- xmlChildren(root[[i]][["awardlist"]])
    awardlist <- ""
    for (j in 1:xmlSize(awards)) {
      jaward <- paste(xmlValue(awards[[j]][["name"]]), xmlValue(awards[[j]][["year"]]), sep = " - ")
      if (j==1) {sep <- ""} else {sep <- "; "}
      awardlist <- paste(awardlist, jaward, sep = sep)
    }
    booksXML[i, "awardlist"] <- awardlist
  }
}

Converted Data Frame

title author year isbn film awardlist
2001: A Space Odyssey Arthur C. Clarke 1968 0-453-00269-2 2001: A Space Odyssey - Stanley Kubrick - 1968 NA
Fahrenheit 451 Ray Bradbury 1953 978-0-7432-4722-1 Fahrenheit 451 - Francois Truffaut - 1966 Retro Hugo - 2004; Prometheus Award - 1984
The Gods Themselves Isaac Asimov 1972 0-385-02701-X NA Nebula Award - 1972; Hugo Award - 1973; Locus Award - 1973

XML format required more work. Main fields were easily converted, but it took a couple of loops to traverse the XML tree structure to gather film attributes and award information. For this exercise, multiple awards were merged into a single entry similar to HTML output.

JSON

Original File

{"scifi_books" : [
    {
    "title" : "2001: A Space Odyssey",
    "author" : "Arthur C. Clarke",
    "year" : "1968",
    "isbn" : "0-453-00269-2",
    "film" : [{
        "name" : "2001: A Space Odyssey",
        "director" : "Stanley Kubrick",
        "year" : "1968"
        }]
    },
    {
    "title" : "Fahrenheit 451",
    "author" : "Ray Bradbury",
    "year" : "1953",
    "isbn" : "978-0-7432-4722-1",
    "award" : [{
        "name" : "Retro Hugo",
        "year" : "2004"
        },
        {
        "name" : "Prometheus Award",
        "year" : "1984"
        }],
    "film" : [{
        "name" : "Fahrenheit 451",
        "director" : "Francois Truffaut",
        "year" : "1966"
        }]
    },
    {
    "title" : "The Gods Themselves",
    "author" : "Isaac Asimov",
    "year" : "1972",
    "isbn" : "0-385-02701-X",
    "award" : [{
        "name" : "Nebula Award",
        "year" : "1972"
        },
        {
        "name" : "Hugo Award",
        "year" : "1973"
        },
        {
        "name" : "Locus Award",
        "year" : "1973"
        }]
    }]
}

Data Import and Manipulation

# Import JSON file
jsonFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.json")

# Parse book data
booksJSON <- jsonFile %>% 
  enter_object("scifi_books") %>% 
  gather_array("book.id") %>% 
  spread_values(title = jstring("title"), 
                author = jstring("author"),
                year = jnumber("year"), 
                isbn = jstring("isbn"))

# Parse award data
awards <- jsonFile %>% 
  enter_object("scifi_books") %>% 
  gather_array("book.id") %>%                    # Get id for all books to join later
  enter_object("award") %>%                      # Get a listing of all awards
  gather_array() %>% 
  spread_values(award.name = jstring("name"),
                award.year = jnumber("year"))

# Parse film data
films <- jsonFile %>% 
  enter_object("scifi_books") %>% 
  gather_array("book.id") %>%                    # Get id for all books to join later
  enter_object("film") %>%                       # Get a listing of all films
  gather_array() %>% 
  spread_values(film.title = jstring("name"),
                film.director = jstring("director"),
                film.year = jnumber("year"))

# Merge book, award and film data
booksJSONflat <- booksJSON %>% 
  left_join(awards, by = "book.id") %>% 
  left_join(films, by = "book.id") %>% 
  select(book.id, title, author, year, isbn, award.name, award.year, 
         film.title, film.director, film.year)

Converted Data Frame

book.id title author year isbn award.name award.year film.title film.director film.year
1 2001: A Space Odyssey Arthur C. Clarke 1968 0-453-00269-2 NA NA 2001: A Space Odyssey Stanley Kubrick 1968
2 Fahrenheit 451 Ray Bradbury 1953 978-0-7432-4722-1 Retro Hugo 2004 Fahrenheit 451 Francois Truffaut 1966
2 Fahrenheit 451 Ray Bradbury 1953 978-0-7432-4722-1 Prometheus Award 1984 Fahrenheit 451 Francois Truffaut 1966
3 The Gods Themselves Isaac Asimov 1972 0-385-02701-X Nebula Award 1972 NA NA NA
3 The Gods Themselves Isaac Asimov 1972 0-385-02701-X Hugo Award 1973 NA NA NA
3 The Gods Themselves Isaac Asimov 1972 0-385-02701-X Locus Award 1973 NA NA NA

With some research tidyjson package turned out to be very powerful in manipulating JSON data. For this exercise I went with a long structure for the data frame based on award data.

Conclusion

It took quite a bit of time to figure out how to traverse different data structures. XML and JSON allow for the most flexibility in manipulating stored data. HTML is simplier to handle, but much less flexible.