In this assignment we will connect to and load 3 different data formats into an R dataframe: HTML, XML, and JSON.
library(XML)
library(RCurl)
library(jsonlite)
library(plyr)
library(dplyr)
library(knitr)
Here we will load an HTML page containing a table into R. This is accomplished using the readHTMLTable function in the XML package. For each of the below examples, we’ll use the getURL function from RCurl package to create the connection via http.
file <- getURL("https://raw.githubusercontent.com/jreznyc/DATA607/master/HW/HW7/data/books.html")
books_html <- readHTMLTable(file,header = TRUE, which = 1)
kable(books_html)
bookID | Title | Author | ISBN-13 | Genre | Pages |
---|---|---|---|---|---|
1 | The Lean Startup | Eric Ries | 978-0307887894 | Business | 336 |
2 | Astrophysics for People in a Hurry | Neil deGrasse Tyson | 978-0393609394 | Astrophysics & Space Science | 224 |
3 | The Balanced Scorecard | Robert Kaplan, David Norton | 978-0875846514 | Business | 336 |
Below we’ll perform the same task, but connecting to an XML file. I’ve found the most elegant solution to be using the xmlToList function together with ldply to join the list elements into a dataframe. Finally, we use the select function to remove the “.id” column that gets added by the xmlToList function.
file <- getURL("https://raw.githubusercontent.com/jreznyc/DATA607/master/HW/HW7/data/books.xml")
books_xml <- ldply(xmlToList(file),data.frame) %>% select(-.id)
kable(books_xml)
bookID | title | author | ISBN.13 | Genre | Pages |
---|---|---|---|---|---|
1 | The Lean Startup | Eric Ries | 978-0307887894 | Business | 336 |
2 | Astrophysics for People in a Hurry | Neil deGrasse Tyson | 978-0393609394 | Astrophysics & Space Science | 224 |
3 | The Balanced Scorecard | Robert Kaplan, David Norton | 978-0875846514 | Business | 336 |
Finally, we load in a JSON. R kept interpreting the column containing the two authors as a column of lists, so to turn that into a character column we use the mutate function.
file <- getURL("https://raw.githubusercontent.com/jreznyc/DATA607/master/HW/HW7/data/books.json")
books_json <- fromJSON(file)
books_json_df <- books_json$books %>% mutate(author = sapply(author, toString)) #To handle the book with two authors
kable(books_json_df)
bookID | title | author | ISBN-13 | Genre | Pages |
---|---|---|---|---|---|
1 | The Lean Startup | Eric Ries | 978-0307887894 | Business | 336 |
2 | Astrophysics for People in a Hurry | Neil deGrasse Tyson | 978-0393609394 | Astrophysics & Space Science | 224 |
3 | The Balanced Scorecard | Robert Kaplan, David Norton | 978-0875846514 | Business | 336 |
https://stackoverflow.com/questions/2067098/how-to-transform-xml-data-into-a-data-frame