This requires rvest, jsonlite, xml2, purr, and RCurl
The author admits that dinosaurs can seem rather juvenile, but there is some fascinating work out there, and much more change and turmoil than one would expect for a field focused on the long ago. Dinosaurs Without Bones is highly recommended.
jsonTable <- fromJSON(getURL("http://www.yasth.org/data607/books.json"))
kable(jsonTable)
| Authors | Published | ISBN-13 | Subjects | Title | Subtitle | Illustrators |
|---|---|---|---|---|---|---|
| Anthony J Martin | 18-03-13 | 978-1605984995 | c(“Fossils”, “Dinosaurs”, “Icythyology”) | Dinosaurs Without Bones | Dinosaur Lives Revealed by their Trace Fossils | NULL |
| Anthony J Martin | 18-03-06 | 978-1681776569 | c(“Dinosaurs”, “Biology & Life Sciences”, “Paleontolgy”) | The Evolution Underground | Burrows, Bunkers, and the Marvelous Subterranean World Beneath our Feet | NULL |
| c(“Robert Saruda”, “Matthew Reinhart”) | 12-07-05 | 978-0763622282 | c(“Children’s Exploration”, “Veterinary Encyclopedia”, “Children’s Fossil”) | Encyclopedia Prehistorica Dinosaurs | The Definitive Pop-Up | c(“Robert Saruda”, “Matthew Reinhart”) |
rawXML <- XML::xmlParse(getURL("http://www.yasth.org/data607/books.xml"))
xmlTable <- XML::xmlToDataFrame(rawXML)
kable(xmlTable)
| Authors | Published | ISBN-13 | Subjects | Title | Subtitle | Illustrators |
|---|---|---|---|---|---|---|
| Anthony J Martin | 18-03-13 | 978-1605984995 | FossilsDinosaursIcythyology | Dinosaurs Without Bones | Dinosaur Lives Revealed by their Trace Fossils | NA |
| Anthony J Martin | 18-03-06 | 978-1681776569 | DinosaursBiology & Life SciencesPaleontolgy | The Evolution Underground | Burrows, Bunkers, and the Marvelous Subterranean World Beneath our Feet | NA |
| Robert SarudaMatthew Reinhart | 12-07-05 | 978-0763622282 | Children’s ExplorationVeterinary EncyclopediaChildren’s Fossil | Encyclopedia Prehistorica Dinosaurs | The Definitive Pop-Up | Robert SarudaMatthew Reinhart |
rawHtml <- xml2::read_html("http://www.yasth.org/data607/books.html")
htmlTable <- rvest::html_table(rawHtml)[[1]]
kable(htmlTable)
| Title | Subtitle | Authors | Illustrators | Published | ISBN-13 | Subjects |
|---|---|---|---|---|---|---|
| Dinosaurs Without Bones | Dinosaur Lives Revealed by their Trace Fossils | Anthony J Martin | 13-Mar-18 | 978-1605984995 | Fossils, Dinosaurs, Icythyology | |
| The Evolution Underground | Burrows, Bunkers, and the Marvelous Subterranean World Beneath our Feet | Anthony J Martin | 6-Mar-14 | 978-1681776569 | Dinosaurs, Biology & Life Sciences, Paleontolgy | |
| Encyclopedia Prehistorica Dinosaurs | The Definitive Pop-Up | Robert Saruda, Matthew Reinhart | Robert Saruda, Matthew Reinhart | 12-Jul-05 | 978-0763622282 | Children’s Exploration, Veterinary Encyclopedia, Children’s Fossil |
For simple types they are all the same.
htmlTable$Title
## [1] "Dinosaurs Without Bones"
## [2] "The Evolution Underground"
## [3] "Encyclopedia Prehistorica Dinosaurs"
jsonTable$Title
## [1] "Dinosaurs Without Bones"
## [2] "The Evolution Underground"
## [3] "Encyclopedia Prehistorica Dinosaurs"
xmlTable$Title
## [1] "Dinosaurs Without Bones"
## [2] "The Evolution Underground"
## [3] "Encyclopedia Prehistorica Dinosaurs"
The HTML table as constructed has lost data as Authors and subjects are concatenated in comma seperated form. This is obviously a contrived example, but it shows that there might be data loss issues. We can of course correct for this to some extent, but in real world data sets and situations it will be error prone (some professionals have commas in their authorial names, etc.) The XML table has also because of the default data frame conversion a concatentation. One can of course use the actual representation of the data to fetch it in a more accurate form.
xmlTable$Authors
## [1] "Anthony J Martin" "Anthony J Martin"
## [3] "Robert SarudaMatthew Reinhart"
jsonTable$Authors
## [[1]]
## [1] "Anthony J Martin"
##
## [[2]]
## [1] "Anthony J Martin"
##
## [[3]]
## [1] "Robert Saruda" "Matthew Reinhart"
htmlTable$Authors
## [1] "Anthony J Martin" "Anthony J Martin"
## [3] "Robert Saruda, Matthew Reinhart"
unlist(strsplit(htmlTable$Authors,","))
## [1] "Anthony J Martin" "Anthony J Martin" "Robert Saruda"
## [4] " Matthew Reinhart"
purrr::map(XML::getNodeSet(rawXML, "//Books//Authors"), ~ XML::xmlToList(.x))
## [[1]]
## [[1]]$Author
## [1] "Anthony J Martin"
##
##
## [[2]]
## [[2]]$Author
## [1] "Anthony J Martin"
##
##
## [[3]]
## [[3]]$Author
## [1] "Robert Saruda"
##
## [[3]]$Author
## [1] "Matthew Reinhart"