The assignment is to load 3 different data structures (HTML, JSON and XML) into R and read the tables from these. They should all contain the same information: favorite books with title, authors and things we like about them. However, we need to identify any differences in data structure of the resulting data frame.
Load needed libraries
library(RCurl)
library(XML)
library(methods)
library(jsonlite)
library(rlist)
library(knitr)
create my github path
urlRemote <- "https://raw.githubusercontent.com/"
pathGithub <- "chilleundso/DATA607/master/Assignment7/"
We start of by downloading our HTML format from our Github account and saving it into a dataframe format:
#create HTML URL
fileNameHTML <- "HTMLtable.html"
HTML_URL <- paste0(urlRemote, pathGithub, fileNameHTML)
#We get and read HTML
HTML <- getURLContent(HTML_URL)
HTML <- readHTMLTable(HTML)
#make HTML into dataframe
HTML <- list.clean(HTML, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(HTML, function(t) dim(t)[1]))
HTML_table <- HTML[[which.max(n.rows)]]
#print HTML table
kable(HTML_table)
| Book Name | Book Author | What I like |
|---|---|---|
| Freakonomics | Steven D. Levitt, Stephen J. Dubner | fun examples, educational, thought provoking |
| The Trial | Franz Kafka | dystopia, justice and judgment, isolation |
| The Stranger | Albert Camus | “philosophy of the absurd”, existentialism |
Next we do the same with the JSON format:
#create JSON URL
fileNameJSON <- "JSONtable.json"
JSON_URL <- paste0(urlRemote, pathGithub, fileNameJSON)
#We get and read JSON
JSON <- fromJSON(JSON_URL)
#make JSON into dataframe
JSON_table <- JSON[[1]]
JSON_table <- as.data.frame(JSON_table)
#print JSON table
kable(JSON_table)
| Book Name | Book Author | What I like |
|---|---|---|
| Freakonomics | c(“Steven D. Levitt”, “Stephen J. Dubner”) | c(“fun examples”, “educational”, “thought provoking”) |
| The Trial | Franz Kafka | c(“dystopia”, “justice and judgment”, “isolation”) |
| The Stranger | Albert Camus | c(“philosophy of the absurd”, “existentialism”) |
And finally we do the same with the XML format:
#create XML URL
fileNameXML <- "XMLtable.xml"
XML_URL <- paste0(urlRemote, pathGithub, fileNameXML)
#We get and parse JSON
XML_data <- getURL(XML_URL)
XML_table <- xmlParse(XML_data)
#make XML into dataframe
XML_table <- xmlToDataFrame(XML_table)
#print XML table
kable(XML_table)
| Name | Author | What_I_Like |
|---|---|---|
| Freakonomics | Steven D. Levitt, Stephen J. Dubner | fun examples, educational, thought provoking |
| The Trial | Franz Kafka | dystopia, justice and judgment, isolation |
| The Stranger | Albert Camus | philosophy of the absurd, existentialism |
We can see that the HTML and XML result in the same data frame, with all fields being “simple” strings. The JSON file on the other hand has vectors for the fields where there are multiple inputs (multiple authors and multiple things I like about each book). This makes working with the JSON file easier down the road since we will not need to do further tidying if we would like to for example count how many books one specific author has written.
GitHub: https://github.com/chilleundso/DATA607/blob/master/Assignment7/Data607_Assignment7.Rmd