Data 607 Week 7 Assignment

Assignment Overview

This week’s assignment required the creation of three files in HTML table, XML, and JSON formats containing information about three of our favorite books in a particular subject area, with at least one book having multiple authors. I chose three works of interactional sociology, two relatively well-known classics from the tradition and a more recent example, and included information about each book’s title, authors, year of publication, publisher, number of pages of the first edition indicated by Google Books, and number of citations according to Google Scholar.

Load required packages

I used the XML package to parse the XML and HTML files, and the jsonlite package to parse the JSON file.

library(RCurl)
library(XML)
library(jsonlite)
library(DT)
library(stringr)
library(tidyr)
library(dplyr)

Parse XML

xml.URL <- 
  getURL("https://raw.githubusercontent.com/juddanderman/cuny-data-607/master/Week7_Assignment/books.xml")
books.xml <- xmlParse(xml.URL)
root <- xmlRoot(books.xml)
xmlName(root)

## [1] "Sociology_Books"

xmlSize(root)

## [1] 3

I used xmlValue() in nested calls to the function xmlSApply() to retrieve the values for the grandchildren of the root node, which contain the relevant data about each of the selected books. The resulting matrix was then transposed and stored in a data frame.

xmlSApply(root, function(x) xmlSApply(x, xmlValue))

##                Book                                       
## Title          "The Presentation of Self in Everyday Life"
## Author         "Erving Goffman"                           
## Author         ""                                         
## Year_Published "1959"                                     
## Publisher      "Doubleday"                                
## Pages          "259"                                      
## Citations      "43536"                                    
##                Book                         
## Title          "Studies in Ethnomethodology"
## Author         "Harold Garfinkel"           
## Author         ""                           
## Year_Published "1967"                       
## Publisher      "Prentice-Hall"              
## Pages          "288"                        
## Citations      "3508"                       
##                Book                                                                            
## Title          "The Spectacle of History: Speech, Text, and Memory at the Iran-Contra Hearings"
## Author         "Michael E. Lynch"                                                              
## Author         "David Bogen"                                                                   
## Year_Published "1996"                                                                          
## Publisher      "Duke University Press"                                                         
## Pages          "368"                                                                           
## Citations      "378"

class(xmlSApply(root, function(x) xmlSApply(x, xmlValue)))

## [1] "matrix"

xml.df <- data.frame(t(xmlSApply(root, function(x) xmlSApply(x, xmlValue))), row.names = NULL)

Parse HTML Table

html.URL <- 
  getURL("https://raw.githubusercontent.com/juddanderman/cuny-data-607/master/Week7_Assignment/books.html")
books.html <- readHTMLTable(html.URL, header = TRUE)
books.html

## $`Sociology Books`
##                                                                            Title
## 1                                      The Presentation of Self in Everyday Life
## 2                                                    Studies in Ethnomethodology
## 3 The Spectacle of History: Speech, Text, and Memory at the Iran-Contra Hearings
##              Author      Author Year Published             Publisher Pages
## 1    Erving Goffman                       1959             Doubleday   259
## 2 Herbert Garfinkel                       1967         Prentice-Hall   288
## 3  Michael E. Lynch David Bogen           1996 Duke University Press   368
##   Citations
## 1     43536
## 2      3508
## 3       378

class(books.html)

## [1] "list"

html.df <- data.frame(books.html$`Sociology Books`)

Parse JSON

json.URL <- 
  getURL("https://raw.githubusercontent.com/juddanderman/cuny-data-607/master/Week7_Assignment/books.json")
books.json <- fromJSON(json.URL)
books.json

## $`Sociology Books`
##                                                                            Title
## 1                                      The Presentation of Self in Everyday Life
## 2                                                    Studies in Ethnomethodology
## 3 The Spectacle of History: Speech, Text, and Memory at the Iran-Contra Hearings
##                          Author Year Published             Publisher Pages
## 1                Erving Goffman           1959             Doubleday   259
## 2              Harold Garfinkel           1967         Prentice-Hall   288
## 3 Michael E. Lynch, David Bogen           1996 Duke University Press   368
##   Citations
## 1     43536
## 2      3508
## 3       378

class(books.json)

## [1] "list"

json.df <- data.frame(books.json$`Sociology Books`)

Output Contents of R Data Frames

options(DT.options = list(dom = 't', scrollX = TRUE))

datatable(xml.df)

datatable(html.df)

datatable(json.df)

Without performing additional processing or manipulation, the data frames generated from each of the files are similar but not identical. The data frames derived from the XML and HTML table files are identical aside from the difference in the column name for year of publication (Year_Published in xml.df versus Year.Published in html.df), but this difference could have been prevented by substituting the underscore with a period character in the relevant element names of the original XML file. The json.df data frame has a slightly different structure than the other two owing to my use of an array to store the two author names for the third book. As a result, the author values were parsed as a list rather than as an atomic vector.

is.atomic(books.json$`Sociology Books`$Author)

## [1] FALSE

is.atomic(books.json$`Sociology Books`$Title)

## [1] TRUE

This data frame could be made to resemble the other two by separating its Author column into two columns as below.

json.df <- json.df %>% 
  mutate(Author = sapply(json.df$Author, function(x) paste(x, collapse = ","))) %>%
  separate(Author, c("Author", "Author.1"), sep = ",")

datatable(json.df)