HTML

booksHTML <- getURL(booksHTML, .opts = list(ssl.verifypeer = FALSE))
dfHTML <- readHTMLTable(booksHTML, header=T)
dfHTML <- dfHTML[[1]]
knitr::kable(dfHTML)

Title	Author(s)	ISBN	Pages
Leafcutter Ants: Civilization by Instinct	Bert Hölldobler, E.O. Wilson	978-0-393-33868-3	192
Honeybee Democracy	Thomas D. Seeley	978-0-691-14721-5	280
Sweetness and Light: The Mysterious History of the Honeybee	Hattie Ellis	978-0-307-54786-6	258

XML

booksXML <- getURL(booksXML)
booksXML <- xmlTreeParse(booksXML, useInternalNodes = T, encoding = "UTF-8")
dfXML <- xmlToList(booksXML)
dfXMLtemp <- as.data.frame(dfXML[[1]], stringsAsFactors = F)
dfXMLtemp["Author"] <- paste(dfXMLtemp["Author"], sep=", ", dfXMLtemp["Author.1"])
dfXMLtemp <- dfXMLtemp[, -which(names(dfXMLtemp) %in% c("Author.1"))]
dfXMLtemp <- rbind(dfXMLtemp, dfXML[[2]], dfXML[[3]])
dfXML <- dfXMLtemp
rm(dfXMLtemp)
knitr::kable(dfXML)

Title	Author	ISBN	Pages
Leafcutter Ants: Civilization by Instinct	Bert Hölldobler, E.O. Wilson	978-0-393-33868-3	192
Honeybee Democracy	Thomas D. Seeley	978-0-691-14721-5	280
Sweetness and Light: The Mysterious History of the Honeybee	Hattie Ellis	978-0-307-54786-6	258

JSON

dfJSON <- fromJSON(booksJSON)
dfJSON <- dfJSON[[1]][[1]]
knitr::kable(dfJSON)

Title	Author	ISBN	Pages
Leafcutter Ants: Civilization by Instinct	Bert Hölldobler, E.O. Wilson	978-0-393-33868-3	192
Honeybee Democracy	Thomas D. Seeley	978-0-691-14721-5	280
Sweetness and Light: The Mysterious History of the Honeybee	Hattie Ellis	978-0-307-54786-6	258

Similarities versus Differences

Coding

When it came to coding, the easiest to get in was the JSON file, by far. It was four lines of code - one of which being a single library! - and was fairly intuitive. Second easiest was the HTML file, although I admit it would have been more difficult if I hadn’t combined the authors for the first book together in a cell. Lastly was the XML file. Importing XML… it doesn’t seem that the current packages out there really like there being duplicate tags.

Final Product

All of the tables have the same titles, authors, ISBNs, and Pages. The HTML has a different column name for the authors, “Author(s)”, while the other two are “Author”. When it comes to the content in the author columns, the multiple authors are where the differences exist. For the HTML table, I manually put them together in the HTML file, so when it was brought over to R, that was maintained. I had to use the paste function when dealing with the XML import to combine the authors, which results in a near-manual way to get the same as the HTML column. Lastly, JSON took care of everything on its own.

Conclusion

The most comfortable data type of the three is, easily, JSON, when it comes to R.

gabartomeoWeek07

Gabrielle Bartomeo

March 18, 2018