The source files in HTML, XML, JSON formats are created manually for 3 books.
urlhtml = "https://raw.githubusercontent.com/completegraph/DataStore/master/book.html"
urljson = "https://raw.githubusercontent.com/completegraph/DataStore/master/book.json"
urlxml = "https://raw.githubusercontent.com/completegraph/DataStore/master/book.xml"The original table uses rowspan to convey the 1-to-many relationship of a book to its co-authors. This is illustrated below.
HTML with table
However, in the parsed version, using the XML library, we see that translating the rowspanning arguments into a data frame creates additional rows for each co-author. This is because the common columns are duplicated for each co-author. Where the original HTML table shows 3 books, the XML and htmltab packages create a table of 6 rows.
url1 = getURL(urlhtml)
hobj = htmlParse(url1)
booksTable = htmltab(hobj, which=1)
knitr::kable(booksTable)| Title | ISBN | Publisher | Copyright | LastName | FirstName | MiddleName | |
|---|---|---|---|---|---|---|---|
| 2 | The econometrics of financial markets | 9780691043012 | Princeton University Press | 1997 | Campbell | John | Y. |
| 3 | The econometrics of financial markets | 9780691043012 | Princeton University Press | 1997 | Lo | Andrew | W. |
| 4 | The econometrics of financial markets | 9780691043012 | Princeton University Press | 1997 | MacKinlay | A. | Craig |
| 5 | Convex optimization | 9780521833783 | Cambridge University Press | 2004 | Boyd | Stephen | NA |
| 6 | Convex optimization | 9780521833783 | Cambridge University Press | 2004 | Vanderberghe | Lieven | NA |
| 7 | The shifts and the shocks: what we’ve learned - and have still to learn from the financial crisis | 9781594205446 | Penguin Press | 2014 | Wolf | Martin | NA |
We can use the xmlParse() command to load the document. xmlSApply recursively traverses the XML document to extract the value. Note that the authors have been compressed into a single text string. Additional work will be required to unmangle the authors.
url2 = getURL(urlxml)
xobj = xmlParse( url2)
root = xmlRoot(xobj)
dfxml = xmlSApply(root, function(x) xmlSApply(x, xmlValue))
df2 = data.frame(t(dfxml), row.names=NULL)
knitr::kable(df2)| Title | ISBN | Publisher | Copyright | Authors |
|---|---|---|---|---|
| The econometrics of financial markets | 9780691043012 | Princeton University Press | 1997 | CampbellJohnY.LoAndrewW.MacKinlayA.Craig |
| Convex optimization | 9780521833783 | Cambridge University Press | 2004 | BoydStephenVanderbergheLieven |
| The shifts and the shocks: what we’ve learned - and have still to learn from the financial crisis | 9781594205446 | Penguin Press | 2014 | WolfMartin |
The JSON file format is the only one to preserve the structural format of the co-authorship. The raw file is converted from JSON into a dataframe directly by the jsonlite library.
jobj = fromJSON(urljson)
knitr::kable(jobj)
|
By inspecting the dataframe in more detail, we see that the individual coauthors are stored in a nested dataframe with data elements mapped correctly as in the first book.
knitr::kable(jobj$books$Author[1])
|
Of the 3 formats, the resulting dataframes produces are quite different. JSON does the best in preserving the logical structure of the original input data. XML is next best but the available libraries do a mediocre job to display the data. HTML does not encode the logical structure of the information but rather enforces presentational layer details into the design of its table.