The assignment involved creating tables of our favorite books in HTML, XML and json format. A few books on my shelf that have some meaning to me are Bill Bryson’s A Short History of Nearly Everything, Stephen Hawking’s A Brief History of Time: From the Big Bang to Black Holes and one of my first programming books, Data Structures and Other Objects Using C++ by Michael Main and Walter Savitch.
Data from the web can be in a wide array of formats, e.g., csv, HTML, XML and json. Being able to handle data from different sources and be able to coerce it into a table within R should be a fundamental skill.
The chunk below will read the HTML file from GitHub and coerce it into a data frame object with help from readHTMLTable from the XML package.
books_html <- readHTMLTable(
getURL(
"https://raw.githubusercontent.com/Liam-O/DATA607/master/HW7/books.html"),
header = TRUE, which = 1)
class(books_html)
## [1] "data.frame"
knitr::kable(books_html)
| bookID | Title | Author | ISBN-13 | Edition | Pages |
|---|---|---|---|---|---|
| 1 | A Short History of Nearly Everything | Bill Bryson | 978-0767908184 | 1 | 544 |
| 2 | A Brief History of Time: From the Big Bang to Black Holes | Stephen Hawking | 978-0553053401 | 1 | 198 |
| 3 | Data Structures and Other Objects Using C++ | Micahel Main, Walter Savitch | 978-0132129480 | 4 | 848 |
The XML is a little trickier to work with. We use xmlToList to coerce it into a list and read it into a data frame with the help of ldply, from the XML and plyr packages respectively.
books_xml <- ldply(
xmlToList(
getURL(
"https://raw.githubusercontent.com/Liam-O/DATA607/master/HW7/books.xml")), data.frame) %>%
select(-.id)
class(books_xml)
## [1] "data.frame"
knitr::kable(books_xml)
| bookID | title | author | ISBN.13 | Edition | Pages |
|---|---|---|---|---|---|
| 1 | A Short History of Nearly Everything | Bill Bryson | 978-0767908184 | 1 | 544 |
| 2 | A Brief History of Time: From the Big Bang to Black Holes | Stephen Hawking | 978-0553053401 | 1 | 198 |
| 3 | Data Structures and Other Objects Using C++ | Micahel Main, Walter Savitch | 978-0132129480 | 4 | 848 |
The json object ended up being the trickiest. It was the easiest to create an object with a subset of authors, but was an effort to extract the nested list to create a data table. No beautiful method could be found other than brute-forcing and casting to a list. A better solution would need to be established if the json data were more complex.
books_json <- fromJSON(
getURL(
"https://raw.githubusercontent.com/Liam-O/DATA607/master/HW7/books.json"))
books_json$Author <- as.list(
ldply(books_json$Author,
function(x) ifelse(
length(unlist(x))>1, paste(unlist(x), collapse = ", "),x)))
books_json <- as.data.frame(books_json)
colnames(books_json)[3] <- "Author"
knitr::kable(books_json)
| bookID | Title | Author | ISBN.13 | Edition | Pages |
|---|---|---|---|---|---|
| 1 | A Short History of Nearly Everything | Bill Bryson | 978-0767908184 | 1 | 544 |
| 2 | A Brief History of Time: From the Big Bang to Black Holes | Stephen Hawking | 978-0553053401 | 1 | 198 |
| 3 | Data Structures and Other Objects Using C++ | Micahel Main, Walter Savitch | 978-0132129480 | 4 | 848 |