DATA 607 Week 7 Assignment: HTML, XML, JSON

Task

For this assignment I have created a listing of several of my favorite science fiction books in three formats - HTML table, XML and JSON. All files were created by hand in a text editor. All have just one author; however, I have added the awards information to satisfy the requirement of having multiple values in a field. The task is to import all files and convert them to data frames. Original file content and all R code is below.

Required Libraries

library(dplyr)
library(stringr)
library(XML)       # For HTML and XML manipulation
library(RCurl)
library(tidyjson)  # For JSON manipulation

HTML

Original File

<HTML>
<HEAD />
<BODY>
<table border=1, cellpadding=5>
    <caption>Ilya's Favorite Science Fiction Books</caption>
    <tr bgcolor=silver>
        <th id="title">Book Title</th>
        <th id="author">Author</th>
        <th id="year">Year</th>
        <th id="isbn">ISBN</th>
        <th id="award">Awards</th>
        <th id="film">Film Adaptation</th>
    </tr><tr>
        <td headers="title">2001: A Space Odyssey</td>
        <td headers="author">Arthur C. Clarke</td>
        <td headers="year">1968</td>
        <td headers="isbn">0-453-00269-2</td>
        <td headers="award"></td>
        <td headers="film">2001: A Space Odyssey (Stanley Kubrick; 1968)</td>
    </tr><tr>
        <td headers="title">Fahrenheit 451</td>
        <td headers="author">Ray Bradbury</td>
        <td headers="year">1953</td>
        <td headers="isbn">978-0-7432-4722-1</td>
        <td headers="award">Retro Hugo (2004)<BR>
            Prometheus Award (1984)</td>
        <td headers="film">Fahrenheit 451 (Francois Truffaut; 1966)</td>
    </tr><tr>
        <td headers="title">The Gods Themselves</td>
        <td headers="author">Isaac Asimov</td>
        <td headers="year">1972</td>
        <td headers="isbn">0-385-02701-X</td>
        <td headers="award">Nebula Award (1972)<BR>
            Hugo Award (1973)<BR>
            Locus Award (1973)</td>
        <td headers="film"></td>
    </tr>
</table>
</BODY>
</HTML>

Data Import and Manipulation

# Import HTML file
htmlFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.html")

# Parse HTML table
booksHTML <- readHTMLTable(htmlFile, header = TRUE)

# Convert to data frame and adjust column names
booksHTML <- as.data.frame(booksHTML)
colnames(booksHTML) <- c("title", "author", "year", "isbn", "award", "film")

# Adjust award column
booksHTML$award <- str_replace_all(booksHTML$award, "\t", "")
booksHTML$award <- str_replace_all(booksHTML$award, "\n", "; ")

Converted Data Frame

title	author	year	isbn	award	film
2001: A Space Odyssey	Arthur C. Clarke	1968	0-453-00269-2		2001: A Space Odyssey (Stanley Kubrick; 1968)
Fahrenheit 451	Ray Bradbury	1953	978-0-7432-4722-1	Retro Hugo (2004); Prometheus Award (1984)	Fahrenheit 451 (Francois Truffaut; 1966)
The Gods Themselves	Isaac Asimov	1972	0-385-02701-X	Nebula Award (1972); Hugo Award (1973); Locus Award (1973)

The data was loaded and converted to a data frame with minimal manipulation (just minor clean-up). Of course, there is still a question about what to do with multiple award values for a given book. Film attributes can also be split accross multiple columns.

XML

Original File

<?xml version="1.0" encoding="windows-1252"?>
<scifi_books>
    <book id="1">
        <title>2001: A Space Odyssey</title>
        <author>Arthur C. Clarke</author>
        <year>1968</year>
        <isbn>0-453-00269-2</isbn>
        <film name="2001: A Space Odyssey" director="Stanley Kubrick" year="1968" />
    </book>
    <book id="2">
        <title>Fahrenheit 451</title>
        <author>Ray Bradbury</author>
        <year>1953</year>
        <isbn>978-0-7432-4722-1</isbn>
        <awardlist>
            <award>
                <name>Retro Hugo</name>
                <year>2004</year>
            </award>
            <award>
                <name>Prometheus Award</name>
                <year>1984</year>
            </award>
        </awardlist>
        <film name="Fahrenheit 451" director="Francois Truffaut" year="1966" />
    </book>
    <book id="3">
        <title>The Gods Themselves</title>
        <author>Isaac Asimov</author>
        <year>1972</year>
        <isbn>0-385-02701-X</isbn>
        <awardlist>
            <award>
                <name>Nebula Award</name>
                <year>1972</year>
            </award>
            <award>
                <name>Hugo Award</name>
                <year>1973</year>
            </award>
            <award>
                <name>Locus Award</name>
                <year>1973</year>
            </award>
        </awardlist>
    </book>
</scifi_books>

Data Import and Manipulation

# Import XML file
xmlFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.xml")

# Parse XML and get root
booksXML <- xmlParse(xmlFile)
root <- xmlRoot(booksXML)

# Convert main fields to data frame
booksXML <- xmlToDataFrame(root)

# Loop through XML nodes to get film attributes
for (i in 1:xmlSize(root)) {
  if (!is.null(root[[i]][["film"]])) { 
    booksXML[i, "film"] <- paste(xmlAttrs(root[[i]][["film"]]), collapse = " - ")
  }
}

# Loop thorugh XML nodes to collect award data
for (i in 1:xmlSize(root)) {
  if (!is.null(root[[i]][["awardlist"]])) { 
    awards <- xmlChildren(root[[i]][["awardlist"]])
    awardlist <- ""
    for (j in 1:xmlSize(awards)) {
      jaward <- paste(xmlValue(awards[[j]][["name"]]), xmlValue(awards[[j]][["year"]]), sep = " - ")
      if (j==1) {sep <- ""} else {sep <- "; "}
      awardlist <- paste(awardlist, jaward, sep = sep)
    }
    booksXML[i, "awardlist"] <- awardlist
  }
}

Converted Data Frame

title	author	year	isbn	film	awardlist
2001: A Space Odyssey	Arthur C. Clarke	1968	0-453-00269-2	2001: A Space Odyssey - Stanley Kubrick - 1968	NA
Fahrenheit 451	Ray Bradbury	1953	978-0-7432-4722-1	Fahrenheit 451 - Francois Truffaut - 1966	Retro Hugo - 2004; Prometheus Award - 1984
The Gods Themselves	Isaac Asimov	1972	0-385-02701-X	NA	Nebula Award - 1972; Hugo Award - 1973; Locus Award - 1973

XML format required more work. Main fields were easily converted, but it took a couple of loops to traverse the XML tree structure to gather film attributes and award information. For this exercise, multiple awards were merged into a single entry similar to HTML output.

JSON

Original File

{"scifi_books" : [
    {
    "title" : "2001: A Space Odyssey",
    "author" : "Arthur C. Clarke",
    "year" : "1968",
    "isbn" : "0-453-00269-2",
    "film" : [{
        "name" : "2001: A Space Odyssey",
        "director" : "Stanley Kubrick",
        "year" : "1968"
        }]
    },
    {
    "title" : "Fahrenheit 451",
    "author" : "Ray Bradbury",
    "year" : "1953",
    "isbn" : "978-0-7432-4722-1",
    "award" : [{
        "name" : "Retro Hugo",
        "year" : "2004"
        },
        {
        "name" : "Prometheus Award",
        "year" : "1984"
        }],
    "film" : [{
        "name" : "Fahrenheit 451",
        "director" : "Francois Truffaut",
        "year" : "1966"
        }]
    },
    {
    "title" : "The Gods Themselves",
    "author" : "Isaac Asimov",
    "year" : "1972",
    "isbn" : "0-385-02701-X",
    "award" : [{
        "name" : "Nebula Award",
        "year" : "1972"
        },
        {
        "name" : "Hugo Award",
        "year" : "1973"
        },
        {
        "name" : "Locus Award",
        "year" : "1973"
        }]
    }]
}

Data Import and Manipulation

# Import JSON file
jsonFile <- getURL("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/books.json")

# Parse book data
booksJSON <- jsonFile %>% 
  enter_object("scifi_books") %>% 
  gather_array("book.id") %>% 
  spread_values(title = jstring("title"), 
                author = jstring("author"),
                year = jnumber("year"), 
                isbn = jstring("isbn"))

# Parse award data
awards <- jsonFile %>% 
  enter_object("scifi_books") %>% 
  gather_array("book.id") %>%                    # Get id for all books to join later
  enter_object("award") %>%                      # Get a listing of all awards
  gather_array() %>% 
  spread_values(award.name = jstring("name"),
                award.year = jnumber("year"))

# Parse film data
films <- jsonFile %>% 
  enter_object("scifi_books") %>% 
  gather_array("book.id") %>%                    # Get id for all books to join later
  enter_object("film") %>%                       # Get a listing of all films
  gather_array() %>% 
  spread_values(film.title = jstring("name"),
                film.director = jstring("director"),
                film.year = jnumber("year"))

# Merge book, award and film data
booksJSONflat <- booksJSON %>% 
  left_join(awards, by = "book.id") %>% 
  left_join(films, by = "book.id") %>% 
  select(book.id, title, author, year, isbn, award.name, award.year, 
         film.title, film.director, film.year)

Converted Data Frame

book.id	title	author	year	isbn	award.name	award.year	film.title	film.director	film.year
1	2001: A Space Odyssey	Arthur C. Clarke	1968	0-453-00269-2	NA	NA	2001: A Space Odyssey	Stanley Kubrick	1968
2	Fahrenheit 451	Ray Bradbury	1953	978-0-7432-4722-1	Retro Hugo	2004	Fahrenheit 451	Francois Truffaut	1966
2	Fahrenheit 451	Ray Bradbury	1953	978-0-7432-4722-1	Prometheus Award	1984	Fahrenheit 451	Francois Truffaut	1966
3	The Gods Themselves	Isaac Asimov	1972	0-385-02701-X	Nebula Award	1972	NA	NA	NA
3	The Gods Themselves	Isaac Asimov	1972	0-385-02701-X	Hugo Award	1973	NA	NA	NA
3	The Gods Themselves	Isaac Asimov	1972	0-385-02701-X	Locus Award	1973	NA	NA	NA

With some research tidyjson package turned out to be very powerful in manipulating JSON data. For this exercise I went with a long structure for the data frame based on award data.

Conclusion

It took quite a bit of time to figure out how to traverse different data structures. XML and JSON allow for the most flexibility in manipulating stored data. HTML is simplier to handle, but much less flexible.

DATA 607 Week 7 Assignment: HTML, XML, JSON

Ilya Kats

March 18, 2017

Task

Required Libraries

HTML

Original File

Data Import and Manipulation

Converted Data Frame

XML

Original File

Data Import and Manipulation

Converted Data Frame

JSON

Original File

Data Import and Manipulation

Converted Data Frame

Conclusion