DATA 607 Week 8 Assignment

Loading Packages

For this assignment, I used the XML package as the book recommends, but ran into some issues reading the hosted files on GitHub. I found the xml2 package and was able to use the two in combination to successfully read the HTML and XML files.

packages <- c("XML", "xml2", "jsonlite")
lapply(packages, library, character.only = TRUE)

Creating the Files

All three file types were created “by hand” in my IDE using the appropriate structure for each type, and saved as the appropriate file type needed for the assignment (.html, .json, and .xml). For the attributes included in the files, I used the book title and author(s) as required for the assignment, as well as number of pages, the publisher, and the two different ISBN numbers for the books.

Reading an HTML Table to Data Frame

I believe there is an isue with using the XML package functions and https:// URLs. xml2’s read_html was able to read both the html and XML files by just passing the address as a string. I was then able to combine the functions from XML with the read html file.

Since we are just trying to create a data frame from the whole HTML table, I simply passed the read URL into htmlParse, stored it in an object, and then cascaded this down through readHTMLTable and data.frame to create the data frame.

html_url <- "https://raw.githubusercontent.com/Logan213/DATA607_Week8/master/book.html"
html_url <- read_html(html_url)

parsed_books1 <- htmlParse(html_url)
books_table <- readHTMLTable(parsed_books1)

html.df <- data.frame(books_table, stringsAsFactors = FALSE)

# split up data frame for readability:
html.df[1]

##   NULL.ID
## 1     001
## 2     002
## 3     003

html.df[2]

##                                                                                   NULL.Title
## 1 Understanding Audio: Getting the Most Out of Your Project of Professional Recording Studio
## 2                    Here, There, and Everywhere: My Life Recording the Music of The Beatles
## 3                                                        Mixing Secrets for the Small Studio

html.df[3:8]

##         NULL.Author1  NULL.Author2 NULL.Pages NULL.Publisher NULL.ISBN.10
## 1 Daniel M. Thompson                      368  Berklee Press   0634009591
## 2      Geoff Emerick Howard Massey        400         Gotham  15924017791
## 3        Mike Senior                      352    Focal Press   0240815807
##     NULL.ISBN.13
## 1 978-0634009594
## 2               
## 3 978-0240815800

Reading a JSON File to Data Frame

To parse over the JSON file, I knew about another package other than the one mentioned in the text (RJSONIO), whihc is jsonlite. This package will read a file from a given URL, parse over it, and then turn it into a nested data frame all in one step (fromJSON function).

In order to turn the returned data frame into a format we are familiar with, the flatten function from jsonlite will turn any of the nested data frames from a column into a regular 2-dimensional data frame.

json_url <- "https://raw.githubusercontent.com/Logan213/DATA607_Week8/master/book.json"
json_file <- fromJSON(json_url)

json.df <- flatten(data.frame(json_file, stringsAsFactors = FALSE))
json.df

##   Books.id
## 1      001
## 2      002
## 3      003
##                                                                                  Books.title
## 1 Understanding Audio: Getting the Most Out of Your Project of Professional Recording Studio
## 2                    Here, There, and Everywhere: My Life Recording the Music of The Beatles
## 3                                                        Mixing Secrets for the Small Studio
##   Books.pages Books.publisher Books.ISBN.10  Books.ISBN.13
## 1         368   Berklee Press    0634009591 978-0634009594
## 2         400          Gotham   15924017791               
## 3         352     Focal Press    0240815807 978-0240815800
##   Books.authors.author1 Books.authors.author2
## 1    Daniel M. Thompson                  <NA>
## 2         Geoff Emerick         Howard Massey
## 3           Mike Senior                  <NA>

Reading an XML File to Data Frame

Similar to the HTML table above, I had to use a combination of xml2’s read_xml function and the other functions in the XML package. This time, I followed the steps outlined in the text, which consists of parsing over the file, then storing the top-level node in an object.

# open XML file
xml_url <- "https://raw.githubusercontent.com/Logan213/DATA607_Week8/master/book.xml"
xml_url <- read_xml(xml_url)

parsed_books2 <- xmlParse(xml_url)
root <- xmlRoot(parsed_books2)

Just to check how to access different levels below are the returns for using the method using the root object storing the top-level node, and also using xml2’s xml_children function on the un-parsed URL:

#using XML package
root[[1]]

## <BOOK>
##   <ID>001</ID>
##   <TITLE>Understanding Audio: Getting the Most Out of Your Project of Professional Recording Studio</TITLE>
##   <AUTHOR1>Daniel M. Thompson</AUTHOR1>
##   <AUTHOR2> </AUTHOR2>
##   <PAGES>368</PAGES>
##   <PUBLISHER>Berklee Press</PUBLISHER>
##   <ISBN-10>0634009591</ISBN-10>
##   <ISBN-13>978-0634009594</ISBN-13>
## </BOOK>

# same thing in xml2
xml_children(xml_url)[1]

## {xml_nodeset (1)}
## [1] <BOOK>\n        <ID>001</ID>\n        <TITLE>Understanding Audio: Ge ...

There are some other functions in xml2 that I probably could have used in combination with other base-R functions to create the dataframe, but XML’s xmlToDataFrame seems wrap everything up in an easy-to-use package. Again, as demonstrated in the text, I passed the root object into this function and stored the result in an object.

xml.df <- xmlToDataFrame(root, stringsAsFactors = FALSE)
xml.df

##    ID
## 1 001
## 2 002
## 3 003
##                                                                                        TITLE
## 1 Understanding Audio: Getting the Most Out of Your Project of Professional Recording Studio
## 2                    Here, There, and Everywhere: My Life Recording the Music of The Beatles
## 3                                                        Mixing Secrets for the Small Studio
##              AUTHOR1       AUTHOR2 PAGES     PUBLISHER     ISBN-10
## 1 Daniel M. Thompson                 368 Berklee Press  0634009591
## 2      Geoff Emerick Howard Massey   400        Gotham 15924017791
## 3        Mike Senior                 352   Focal Press  0240815807
##          ISBN-13
## 1 978-0634009594
## 2               
## 3 978-0240815800

Conclusion

The files I created for this assignment had fairly basic structure, so the methods above worked fine. If the files were a little more complex, I might have needed to use some of the other functions included in the pacakges to get the same end result.

Since we wanted to compare the different data frames, I kept the parsing and creation of the data frames straight-forward. All three data frames have the same dimensions of 3 rows and 8 columns. The main differences are how empty nodes were treated (the json data frame has NAs where there no information for the authors or ISBN), the type of data in each column (the json and XML data frames consist of characters, not factors like the html data frame) and how the names for the data frames were created:

names(c(html.df, json.df, xml.df))

##  [1] "NULL.ID"               "NULL.Title"           
##  [3] "NULL.Author1"          "NULL.Author2"         
##  [5] "NULL.Pages"            "NULL.Publisher"       
##  [7] "NULL.ISBN.10"          "NULL.ISBN.13"         
##  [9] "Books.id"              "Books.title"          
## [11] "Books.pages"           "Books.publisher"      
## [13] "Books.ISBN.10"         "Books.ISBN.13"        
## [15] "Books.authors.author1" "Books.authors.author2"
## [17] "ID"                    "TITLE"                
## [19] "AUTHOR1"               "AUTHOR2"              
## [21] "PAGES"                 "PUBLISHER"            
## [23] "ISBN-10"               "ISBN-13"

The differences may be the result of how the files were read, parsed, and then fed into the data frame.