0.1 Creating the Source Files

The source files in HTML, XML, JSON formats are created manually for 3 books.

urlhtml = "https://raw.githubusercontent.com/completegraph/DataStore/master/book.html"
urljson = "https://raw.githubusercontent.com/completegraph/DataStore/master/book.json"
urlxml  = "https://raw.githubusercontent.com/completegraph/DataStore/master/book.xml"

1 HTML Version

The original table uses rowspan to convey the 1-to-many relationship of a book to its co-authors. This is illustrated below.

HTML with table

HTML with table

However, in the parsed version, using the XML library, we see that translating the rowspanning arguments into a data frame creates additional rows for each co-author. This is because the common columns are duplicated for each co-author. Where the original HTML table shows 3 books, the XML and htmltab packages create a table of 6 rows.

url1 = getURL(urlhtml)
hobj = htmlParse(url1)
booksTable = htmltab(hobj, which=1)
knitr::kable(booksTable)
Title ISBN Publisher Copyright LastName FirstName MiddleName
2 The econometrics of financial markets 9780691043012 Princeton University Press 1997 Campbell John Y.
3 The econometrics of financial markets 9780691043012 Princeton University Press 1997 Lo Andrew W.
4 The econometrics of financial markets 9780691043012 Princeton University Press 1997 MacKinlay A. Craig
5 Convex optimization 9780521833783 Cambridge University Press 2004 Boyd Stephen NA
6 Convex optimization 9780521833783 Cambridge University Press 2004 Vanderberghe Lieven NA
7 The shifts and the shocks: what we’ve learned - and have still to learn from the financial crisis 9781594205446 Penguin Press 2014 Wolf Martin NA

1.1 XML Version

We can use the xmlParse() command to load the document. xmlSApply recursively traverses the XML document to extract the value. Note that the authors have been compressed into a single text string. Additional work will be required to unmangle the authors.

url2 = getURL(urlxml)

xobj = xmlParse( url2)
root = xmlRoot(xobj)

dfxml = xmlSApply(root, function(x) xmlSApply(x, xmlValue))
df2  = data.frame(t(dfxml), row.names=NULL)
knitr::kable(df2)
Title ISBN Publisher Copyright Authors
The econometrics of financial markets 9780691043012 Princeton University Press 1997 CampbellJohnY.LoAndrewW.MacKinlayA.Craig
Convex optimization 9780521833783 Cambridge University Press 2004 BoydStephenVanderbergheLieven
The shifts and the shocks: what we’ve learned - and have still to learn from the financial crisis 9781594205446 Penguin Press 2014 WolfMartin

1.2 JSON Version

The JSON file format is the only one to preserve the structural format of the co-authorship. The raw file is converted from JSON into a dataframe directly by the jsonlite library.

jobj = fromJSON(urljson)

knitr::kable(jobj)
Title ISBN Publisher Copyright Author
The econometrics of financial markets 9780691043012 Princeton University Press 1997 list(Last Name = c(“Campbell”, “Lo”, “MacKinlay”), First Name = c(“John”, “Andrew”, “A.”), Middle Name = c(“Y.”, “W.”, “Craig”))
Convex optimization 9780521833783 Cambridge University Press 2004 list(Last Name = c(“Boyd”, “Vanderberghe”), First Name = c(“Stephen”, “Lieven”), Middle Name = c(NA, NA))
The shifts and the shocks: what we’ve learned - and have still to learn from the financial crisis 9781594205446 Penguin Press 2014 list(Last Name = “Wolf”, First Name = “Martin”, Middle Name = NA)

By inspecting the dataframe in more detail, we see that the individual coauthors are stored in a nested dataframe with data elements mapped correctly as in the first book.

knitr::kable(jobj$books$Author[1])
Last Name First Name Middle Name
Campbell John Y.
Lo Andrew W.
MacKinlay A. Craig

1.3 Conclusion

Of the 3 formats, the resulting dataframes produces are quite different. JSON does the best in preserving the logical structure of the original input data. XML is next best but the available libraries do a mediocre job to display the data. HTML does not encode the logical structure of the information but rather enforces presentational layer details into the design of its table.