This week’s assignment required the creation of three files in HTML table, XML, and JSON formats containing information about three of our favorite books in a particular subject area, with at least one book having multiple authors. I chose three works of interactional sociology, two relatively well-known classics from the tradition and a more recent example, and included information about each book’s title, authors, year of publication, publisher, number of pages of the first edition indicated by Google Books, and number of citations according to Google Scholar.
I used the XML package to parse the XML and HTML files, and the jsonlite package to parse the JSON file.
library(RCurl)
library(XML)
library(jsonlite)
library(DT)
library(stringr)
library(tidyr)
library(dplyr)
xml.URL <-
getURL("https://raw.githubusercontent.com/juddanderman/cuny-data-607/master/Week7_Assignment/books.xml")
books.xml <- xmlParse(xml.URL)
root <- xmlRoot(books.xml)
xmlName(root)
## [1] "Sociology_Books"
xmlSize(root)
## [1] 3
I used xmlValue() in nested calls to the function xmlSApply() to retrieve the values for the grandchildren of the root node, which contain the relevant data about each of the selected books. The resulting matrix was then transposed and stored in a data frame.
xmlSApply(root, function(x) xmlSApply(x, xmlValue))
## Book
## Title "The Presentation of Self in Everyday Life"
## Author "Erving Goffman"
## Author ""
## Year_Published "1959"
## Publisher "Doubleday"
## Pages "259"
## Citations "43536"
## Book
## Title "Studies in Ethnomethodology"
## Author "Harold Garfinkel"
## Author ""
## Year_Published "1967"
## Publisher "Prentice-Hall"
## Pages "288"
## Citations "3508"
## Book
## Title "The Spectacle of History: Speech, Text, and Memory at the Iran-Contra Hearings"
## Author "Michael E. Lynch"
## Author "David Bogen"
## Year_Published "1996"
## Publisher "Duke University Press"
## Pages "368"
## Citations "378"
class(xmlSApply(root, function(x) xmlSApply(x, xmlValue)))
## [1] "matrix"
xml.df <- data.frame(t(xmlSApply(root, function(x) xmlSApply(x, xmlValue))), row.names = NULL)
html.URL <-
getURL("https://raw.githubusercontent.com/juddanderman/cuny-data-607/master/Week7_Assignment/books.html")
books.html <- readHTMLTable(html.URL, header = TRUE)
books.html
## $`Sociology Books`
## Title
## 1 The Presentation of Self in Everyday Life
## 2 Studies in Ethnomethodology
## 3 The Spectacle of History: Speech, Text, and Memory at the Iran-Contra Hearings
## Author Author Year Published Publisher Pages
## 1 Erving Goffman 1959 Doubleday 259
## 2 Herbert Garfinkel 1967 Prentice-Hall 288
## 3 Michael E. Lynch David Bogen 1996 Duke University Press 368
## Citations
## 1 43536
## 2 3508
## 3 378
class(books.html)
## [1] "list"
html.df <- data.frame(books.html$`Sociology Books`)
json.URL <-
getURL("https://raw.githubusercontent.com/juddanderman/cuny-data-607/master/Week7_Assignment/books.json")
books.json <- fromJSON(json.URL)
books.json
## $`Sociology Books`
## Title
## 1 The Presentation of Self in Everyday Life
## 2 Studies in Ethnomethodology
## 3 The Spectacle of History: Speech, Text, and Memory at the Iran-Contra Hearings
## Author Year Published Publisher Pages
## 1 Erving Goffman 1959 Doubleday 259
## 2 Harold Garfinkel 1967 Prentice-Hall 288
## 3 Michael E. Lynch, David Bogen 1996 Duke University Press 368
## Citations
## 1 43536
## 2 3508
## 3 378
class(books.json)
## [1] "list"
json.df <- data.frame(books.json$`Sociology Books`)
options(DT.options = list(dom = 't', scrollX = TRUE))
datatable(xml.df)
datatable(html.df)
datatable(json.df)
Without performing additional processing or manipulation, the data frames generated from each of the files are similar but not identical. The data frames derived from the XML and HTML table files are identical aside from the difference in the column name for year of publication (Year_Published in xml.df versus Year.Published in html.df), but this difference could have been prevented by substituting the underscore with a period character in the relevant element names of the original XML file. The json.df data frame has a slightly different structure than the other two owing to my use of an array to store the two author names for the third book. As a result, the author values were parsed as a list rather than as an atomic vector.
is.atomic(books.json$`Sociology Books`$Author)
## [1] FALSE
is.atomic(books.json$`Sociology Books`$Title)
## [1] TRUE
This data frame could be made to resemble the other two by separating its Author column into two columns as below.
json.df <- json.df %>%
mutate(Author = sapply(json.df$Author, function(x) paste(x, collapse = ","))) %>%
separate(Author, c("Author", "Author.1"), sep = ",")
datatable(json.df)