Introduciton

The assignment involved creating tables of our favorite books in HTML, XML and json format. A few books on my shelf that have some meaning to me are Bill Bryson’s A Short History of Nearly Everything, Stephen Hawking’s A Brief History of Time: From the Big Bang to Black Holes and one of my first programming books, Data Structures and Other Objects Using C++ by Michael Main and Walter Savitch.

Relevance

Data from the web can be in a wide array of formats, e.g., csv, HTML, XML and json. Being able to handle data from different sources and be able to coerce it into a table within R should be a fundamental skill.


HTML

The chunk below will read the HTML file from GitHub and coerce it into a data frame object with help from readHTMLTable from the XML package.

books_html <- readHTMLTable(
    getURL(
        "https://raw.githubusercontent.com/Liam-O/DATA607/master/HW7/books.html"),
    header = TRUE, which = 1)

class(books_html)
## [1] "data.frame"
knitr::kable(books_html)
bookID Title Author ISBN-13 Edition Pages
1 A Short History of Nearly Everything Bill Bryson 978-0767908184 1 544
2 A Brief History of Time: From the Big Bang to Black Holes Stephen Hawking 978-0553053401 1 198
3 Data Structures and Other Objects Using C++ Micahel Main, Walter Savitch 978-0132129480 4 848

XML

The XML is a little trickier to work with. We use xmlToList to coerce it into a list and read it into a data frame with the help of ldply, from the XML and plyr packages respectively.

books_xml <- ldply(
    xmlToList(
        getURL(
            "https://raw.githubusercontent.com/Liam-O/DATA607/master/HW7/books.xml")), data.frame) %>%
    select(-.id)

class(books_xml)
## [1] "data.frame"
knitr::kable(books_xml)
bookID title author ISBN.13 Edition Pages
1 A Short History of Nearly Everything Bill Bryson 978-0767908184 1 544
2 A Brief History of Time: From the Big Bang to Black Holes Stephen Hawking 978-0553053401 1 198
3 Data Structures and Other Objects Using C++ Micahel Main, Walter Savitch 978-0132129480 4 848

json

The json object ended up being the trickiest. It was the easiest to create an object with a subset of authors, but was an effort to extract the nested list to create a data table. No beautiful method could be found other than brute-forcing and casting to a list. A better solution would need to be established if the json data were more complex.

books_json <- fromJSON(
        getURL(
            "https://raw.githubusercontent.com/Liam-O/DATA607/master/HW7/books.json"))

books_json$Author <- as.list(
    ldply(books_json$Author,
          function(x) ifelse(
            length(unlist(x))>1, paste(unlist(x), collapse = ", "),x)))
books_json <- as.data.frame(books_json)
colnames(books_json)[3] <- "Author"
knitr::kable(books_json)
bookID Title Author ISBN.13 Edition Pages
1 A Short History of Nearly Everything Bill Bryson 978-0767908184 1 544
2 A Brief History of Time: From the Big Bang to Black Holes Stephen Hawking 978-0553053401 1 198
3 Data Structures and Other Objects Using C++ Micahel Main, Walter Savitch 978-0132129480 4 848