Introduction

In this assignment we will connect to and load 3 different data formats into an R dataframe: HTML, XML, and JSON.

library(XML)
library(RCurl)
library(jsonlite)
library(plyr)
library(dplyr)
library(knitr)

HTML

Here we will load an HTML page containing a table into R. This is accomplished using the readHTMLTable function in the XML package. For each of the below examples, we’ll use the getURL function from RCurl package to create the connection via http.

file <- getURL("https://raw.githubusercontent.com/jreznyc/DATA607/master/HW/HW7/data/books.html")
books_html <- readHTMLTable(file,header = TRUE, which = 1)
kable(books_html)
bookID Title Author ISBN-13 Genre Pages
1 The Lean Startup Eric Ries 978-0307887894 Business 336
2 Astrophysics for People in a Hurry Neil deGrasse Tyson 978-0393609394 Astrophysics & Space Science 224
3 The Balanced Scorecard Robert Kaplan, David Norton 978-0875846514 Business 336

XML

Below we’ll perform the same task, but connecting to an XML file. I’ve found the most elegant solution to be using the xmlToList function together with ldply to join the list elements into a dataframe. Finally, we use the select function to remove the “.id” column that gets added by the xmlToList function.

file <- getURL("https://raw.githubusercontent.com/jreznyc/DATA607/master/HW/HW7/data/books.xml")
books_xml <- ldply(xmlToList(file),data.frame) %>% select(-.id)
kable(books_xml)
bookID title author ISBN.13 Genre Pages
1 The Lean Startup Eric Ries 978-0307887894 Business 336
2 Astrophysics for People in a Hurry Neil deGrasse Tyson 978-0393609394 Astrophysics & Space Science 224
3 The Balanced Scorecard Robert Kaplan, David Norton 978-0875846514 Business 336

JSON

Finally, we load in a JSON. R kept interpreting the column containing the two authors as a column of lists, so to turn that into a character column we use the mutate function.

file <- getURL("https://raw.githubusercontent.com/jreznyc/DATA607/master/HW/HW7/data/books.json")
books_json <- fromJSON(file)
books_json_df <- books_json$books %>% mutate(author = sapply(author, toString)) #To handle the book with two authors
kable(books_json_df)
bookID title author ISBN-13 Genre Pages
1 The Lean Startup Eric Ries 978-0307887894 Business 336
2 Astrophysics for People in a Hurry Neil deGrasse Tyson 978-0393609394 Astrophysics & Space Science 224
3 The Balanced Scorecard Robert Kaplan, David Norton 978-0875846514 Business 336

https://stackoverflow.com/questions/2067098/how-to-transform-xml-data-into-a-data-frame