This requires rvest, jsonlite, xml2, purr, and RCurl


The book data set

The author admits that dinosaurs can seem rather juvenile, but there is some fascinating work out there, and much more change and turmoil than one would expect for a field focused on the long ago. Dinosaurs Without Bones is highly recommended.

JSON

jsonTable <- fromJSON(getURL("http://www.yasth.org/data607/books.json"))
kable(jsonTable)
Authors Published ISBN-13 Subjects Title Subtitle Illustrators
Anthony J Martin 18-03-13 978-1605984995 c(“Fossils”, “Dinosaurs”, “Icythyology”) Dinosaurs Without Bones Dinosaur Lives Revealed by their Trace Fossils NULL
Anthony J Martin 18-03-06 978-1681776569 c(“Dinosaurs”, “Biology & Life Sciences”, “Paleontolgy”) The Evolution Underground Burrows, Bunkers, and the Marvelous Subterranean World Beneath our Feet NULL
c(“Robert Saruda”, “Matthew Reinhart”) 12-07-05 978-0763622282 c(“Children’s Exploration”, “Veterinary Encyclopedia”, “Children’s Fossil”) Encyclopedia Prehistorica Dinosaurs The Definitive Pop-Up c(“Robert Saruda”, “Matthew Reinhart”)

XML

rawXML <- XML::xmlParse(getURL("http://www.yasth.org/data607/books.xml"))
xmlTable <- XML::xmlToDataFrame(rawXML)
kable(xmlTable)
Authors Published ISBN-13 Subjects Title Subtitle Illustrators
Anthony J Martin 18-03-13 978-1605984995 FossilsDinosaursIcythyology Dinosaurs Without Bones Dinosaur Lives Revealed by their Trace Fossils NA
Anthony J Martin 18-03-06 978-1681776569 DinosaursBiology & Life SciencesPaleontolgy The Evolution Underground Burrows, Bunkers, and the Marvelous Subterranean World Beneath our Feet NA
Robert SarudaMatthew Reinhart 12-07-05 978-0763622282 Children’s ExplorationVeterinary EncyclopediaChildren’s Fossil Encyclopedia Prehistorica Dinosaurs The Definitive Pop-Up Robert SarudaMatthew Reinhart

HTML

rawHtml <- xml2::read_html("http://www.yasth.org/data607/books.html")
htmlTable <- rvest::html_table(rawHtml)[[1]]
kable(htmlTable)
Title Subtitle Authors Illustrators Published ISBN-13 Subjects
Dinosaurs Without Bones Dinosaur Lives Revealed by their Trace Fossils Anthony J Martin 13-Mar-18 978-1605984995 Fossils, Dinosaurs, Icythyology
The Evolution Underground Burrows, Bunkers, and the Marvelous Subterranean World Beneath our Feet Anthony J Martin 6-Mar-14 978-1681776569 Dinosaurs, Biology & Life Sciences, Paleontolgy
Encyclopedia Prehistorica Dinosaurs The Definitive Pop-Up Robert Saruda, Matthew Reinhart Robert Saruda, Matthew Reinhart 12-Jul-05 978-0763622282 Children’s Exploration, Veterinary Encyclopedia, Children’s Fossil

Analysis

For simple types they are all the same.

htmlTable$Title
## [1] "Dinosaurs Without Bones"            
## [2] "The Evolution Underground"          
## [3] "Encyclopedia Prehistorica Dinosaurs"
jsonTable$Title
## [1] "Dinosaurs Without Bones"            
## [2] "The Evolution Underground"          
## [3] "Encyclopedia Prehistorica Dinosaurs"
xmlTable$Title
## [1] "Dinosaurs Without Bones"            
## [2] "The Evolution Underground"          
## [3] "Encyclopedia Prehistorica Dinosaurs"

The HTML table as constructed has lost data as Authors and subjects are concatenated in comma seperated form. This is obviously a contrived example, but it shows that there might be data loss issues. We can of course correct for this to some extent, but in real world data sets and situations it will be error prone (some professionals have commas in their authorial names, etc.) The XML table has also because of the default data frame conversion a concatentation. One can of course use the actual representation of the data to fetch it in a more accurate form.

xmlTable$Authors
## [1] "Anthony J Martin"              "Anthony J Martin"             
## [3] "Robert SarudaMatthew Reinhart"
jsonTable$Authors
## [[1]]
## [1] "Anthony J Martin"
## 
## [[2]]
## [1] "Anthony J Martin"
## 
## [[3]]
## [1] "Robert Saruda"    "Matthew Reinhart"
htmlTable$Authors
## [1] "Anthony J Martin"                "Anthony J Martin"               
## [3] "Robert Saruda, Matthew Reinhart"
unlist(strsplit(htmlTable$Authors,","))
## [1] "Anthony J Martin"  "Anthony J Martin"  "Robert Saruda"    
## [4] " Matthew Reinhart"
purrr::map(XML::getNodeSet(rawXML, "//Books//Authors"), ~ XML::xmlToList(.x))
## [[1]]
## [[1]]$Author
## [1] "Anthony J Martin"
## 
## 
## [[2]]
## [[2]]$Author
## [1] "Anthony J Martin"
## 
## 
## [[3]]
## [[3]]$Author
## [1] "Robert Saruda"
## 
## [[3]]$Author
## [1] "Matthew Reinhart"