DATA 607 Week 8: XML and JSON

For this assignment, files containing information about my favorite physics textbooks from my undergraduate study are read in from text files in three different formats:

The files are parsed, and the data stores in data frames.

HTML

The HTML file is parsed and the tables extracted using the XML package. The year and page columns are converted from text to integers.

my_html <- htmlParse("./books.html")
books_html <- readHTMLTable(my_html, stringsAsFactors = FALSE)[[1]]
books_html$year <- as.integer(books_html$year)
books_html$pages <- as.integer(books_html$pages)

title	authors	year	pages
An Introduction to Modern Astrophysics (2nd Edition)	Carroll, Bradley W.; Ostlie, Dale A.	2006	1400
Spacetime and Geometry: An Introduction to General Relativity	Carroll, Sean	2003	513
Introduction to Electrodynamics (3rd Edition)	Griffiths, David J.	1999	576

XML

The XML file is parsed, again using the XML package. The title, year, and page columns are extracted using the XPATH language.

my_xml <- xmlParse("./books.xml")
title <- xpathSApply(my_xml, "//title", xmlValue)
year <- as.integer(xpathSApply(my_xml, "//year", xmlValue))
pages <- as.integer(xpathSApply(my_xml, "//pages", xmlValue))

Because the authors are stored as attribute values, they have to be handled differently. The author attributes are extracted into a list using XPATH. A loop is then used to match up first and second authors where both exist.

authors_list <- xpathSApply(my_xml, "//authors", xmlAttrs)
authors <- rep('', length(authors_list))
for(i in 1:length(authors_list)) {
  authors[i] <- ifelse(length(authors_list[[i]]) == 1, authors_list[[i]][[1]],
                       paste(authors_list[[i]][[1]], authors_list[[i]][[2]], sep = "; "))
}

Finally, the data is stored in a data frame:

books_xml <- data.frame(title, authors, year, pages, stringsAsFactors = FALSE)

title	authors	year	pages
An Introduction to Modern Astrophysics (2nd Edition)	Carroll, Bradley W.; Ostlie, Dale A.	2006	1400
Spacetime and Geometry: An Introduction to General Relativity	Carroll, Sean	2003	513
Introduction to Electrodynamics (3rd Edition)	Griffiths, David J.	1999	576

JSON

The JSON file is read in using the jsonlite package. As the fromJSON command detects the native data type, it is not necessary to convert the year and pages columns from text.

my_json <- fromJSON("./books.json")
books_json <- my_json[[1]]

title	authors	year	pages
An Introduction to Modern Astrophysics (2nd Edition)	Carroll, Bradley W., Ostlie, Dale A.	2006	1400
Spacetime and Geometry: An Introduction to General Relativity	Carroll, Sean	2003	513
Introduction to Electrodynamics (3rd Edition)	Griffiths, David J.	1999	576

Analyzing the Data Frames

The data frames created from the HTML and XML files are compared and found to be identical:

identical(books_html, books_xml)

## [1] TRUE

The data frames created from the HTML and JSON files are then compared and found to be different:

identical(books_html, books_json)

## [1] FALSE

For thoroughness, the difference between these data frames is investigated. Looking at the printed data frames above, it appears that there is a difference in the authors column for the first book (with two authors). It is first verified that the two data frames are otherwise identical, and then the class of the two authors columns are compared.

identical(books_html[, -2], books_json[, -2])

## [1] TRUE

class(books_html$authors)

## [1] "character"

class(books_json$authors)

## [1] "list"

DATA 607 Week 8: XML and JSON

Dan Smilowitz

March 20, 2016

HTML

XML

JSON

Analyzing the Data Frames