1 HTML Version

The original table uses rowspan to convey the 1-to-many relationship of a book to its co-authors. This is illustrated below.

HTML with table

However, in the parsed version, using the XML library, we see that translating the rowspanning arguments into a data frame creates additional rows for each co-author. This is because the common columns are duplicated for each co-author. Where the original HTML table shows 3 books, the XML and htmltab packages create a table of 6 rows.

url1 = getURL(urlhtml)
hobj = htmlParse(url1)
booksTable = htmltab(hobj, which=1)
knitr::kable(booksTable)

	Title	ISBN	Publisher	Copyright	LastName	FirstName	MiddleName
2	The econometrics of financial markets	9780691043012	Princeton University Press	1997	Campbell	John	Y.
3	The econometrics of financial markets	9780691043012	Princeton University Press	1997	Lo	Andrew	W.
4	The econometrics of financial markets	9780691043012	Princeton University Press	1997	MacKinlay	A.	Craig
5	Convex optimization	9780521833783	Cambridge University Press	2004	Boyd	Stephen	NA
6	Convex optimization	9780521833783	Cambridge University Press	2004	Vanderberghe	Lieven	NA
7	The shifts and the shocks: what we’ve learned - and have still to learn from the financial crisis	9781594205446	Penguin Press	2014	Wolf	Martin	NA

1.1 XML Version

We can use the xmlParse() command to load the document. xmlSApply recursively traverses the XML document to extract the value. Note that the authors have been compressed into a single text string. Additional work will be required to unmangle the authors.

url2 = getURL(urlxml)

xobj = xmlParse( url2)
root = xmlRoot(xobj)

dfxml = xmlSApply(root, function(x) xmlSApply(x, xmlValue))
df2  = data.frame(t(dfxml), row.names=NULL)
knitr::kable(df2)

Title	ISBN	Publisher	Copyright	Authors
The econometrics of financial markets	9780691043012	Princeton University Press	1997	CampbellJohnY.LoAndrewW.MacKinlayA.Craig
Convex optimization	9780521833783	Cambridge University Press	2004	BoydStephenVanderbergheLieven
The shifts and the shocks: what we’ve learned - and have still to learn from the financial crisis	9781594205446	Penguin Press	2014	WolfMartin

1.2 JSON Version

The JSON file format is the only one to preserve the structural format of the co-authorship. The raw file is converted from JSON into a dataframe directly by the jsonlite library.

jobj = fromJSON(urljson)

knitr::kable(jobj)

Title	ISBN	Publisher	Copyright	Author
The econometrics of financial markets	9780691043012	Princeton University Press	1997	list(`Last Name` = c(“Campbell”, “Lo”, “MacKinlay”), `First Name` = c(“John”, “Andrew”, “A.”), `Middle Name` = c(“Y.”, “W.”, “Craig”))
Convex optimization	9780521833783	Cambridge University Press	2004	list(`Last Name` = c(“Boyd”, “Vanderberghe”), `First Name` = c(“Stephen”, “Lieven”), `Middle Name` = c(NA, NA))
The shifts and the shocks: what we’ve learned - and have still to learn from the financial crisis	9781594205446	Penguin Press	2014	list(`Last Name` = “Wolf”, `First Name` = “Martin”, `Middle Name` = NA)

By inspecting the dataframe in more detail, we see that the individual coauthors are stored in a nested dataframe with data elements mapped correctly as in the first book.

knitr::kable(jobj$books$Author[1])

Last Name	First Name	Middle Name
Campbell	John	Y.
Lo	Andrew	W.
MacKinlay	A.	Craig

1.3 Conclusion

Of the 3 formats, the resulting dataframes produces are quite different. JSON does the best in preserving the logical structure of the original input data. XML is next best but the available libraries do a mediocre job to display the data. HTML does not encode the logical structure of the information but rather enforces presentational layer details into the design of its table.

Assignment 7 DATA 607 Books

Alexander Ng

March 17, 2019

0.1 Creating the Source Files

1 HTML Version

1.1 XML Version

1.2 JSON Version

1.3 Conclusion