library(rvest)
library(jsonlite)
library(xml2)
library(XML)
library(stringr)
HTML, XML and JSON tables in R
The purpose of this assignment was to learn how to load html, xml and json files into r. HTML and JSON are straightforward, XML needs tweaking if there is any nesting.
1. HTML - with a couple of lines we can load an html table into a dataframe:
h <- read_html("D:\\RStudio\\CUNY_607\\Assigments\\Assignment 7\\Books.html.txt")
dfHtml <- data.frame(html_table(h))
knitr::kable(dfHtml)
| Zen Flesh, Zen Bones |
Paul Reps, Nyogen Senzaki |
Yes |
Zen |
| Awakening Compassion |
Pema Chodron |
Yes |
Tibetan |
| The Way of the Bodhisattva |
Shantideva |
No |
Indian |
| True Love |
Thich Nhat Hanh |
Yes |
Zen |
2. XML - the xml packages in r will, under certain circumstances, concatenate nested elements (what should be ‘a, b’ becomes ‘ab’). I wrote code to correct this in my table - however, we would need a generalizable function if we were to read xml tables more often.
library(XML)
x <- xmlParse("D:\\RStudio\\CUNY_607\\Assigments\\Assignment 7\\Books.nested.xml.txt")
dfXML <- xmlToDataFrame(x)
dfAuthors <- as.data.frame(t(xpathSApply(x,"//Authors",function(x) xmlSApply(x,xmlValue))))
auth <- as.data.frame(dfAuthors$V1)
dfXML[1,2] <- str_c(auth[1,1], ", ", auth[2,1])
knitr::kable(dfXML)
| Zen Flesh, Zen Bones |
Paul Reps, Nyogen Senzaki |
Yes |
Zen |
| Awakening Compassion |
Pema Chodron |
Yes |
Tibetan |
| The Way of the Bodhisattva |
Shantideva |
No |
Indian |
| True Love |
Thich Nhat Hanh |
Yes |
Zen |
3. JSON needs a line of code to bind the elements and one to remove a rownames column, but otherwise is straightfoward
j <- fromJSON("D:\\RStudio\\CUNY_607\\Assigments\\Assignment 7\\Books.nested.json.txt")
dfJSON <- do.call("rbind", lapply(j, data.frame, stringsAsFactors=FALSE))
rownames(dfJSON) <- NULL
knitr::kable(dfJSON)
| Zen Flesh, Zen Bones |
Paul Reps , Nyogen Senzaki |
Yes |
Zen |
| Awakening Compassion |
Pema Chodron |
Yes |
Tibetan |
| The Way of the Bodhisattva |
Shantideva |
no |
Indian |
| True Love |
Thich Nhat Hanh |
Yes |
Zen |
Are the tables the same? They look the same, but the all.equal function can help us discover differences.
all.equal(dfHtml, dfXML)
## [1] "Names: 1 string mismatch"
all.equal(dfHtml, dfJSON)
## [1] "Component \"Author\": Modes: character, list"
## [2] "Component \"Author\": target is character, current is list"
## [3] "Component \"Contemporary\": 1 string mismatch"
all.equal(dfJSON, dfXML)
## [1] "Names: 1 string mismatch"
## [2] "Component 2: Modes: list, character"
## [3] "Component 2: Component 1: Modes: list, character"
## [4] "Component 2: Component 1: names for target but not for current"
## [5] "Component 2: Component 1: Length mismatch: comparison on first 1 components"
## [6] "Component 2: Component 1: Component 1: 1 string mismatch"
## [7] "Component \"Contemporary\": 1 string mismatch"
We see a lot of differences! One is obvious - the HTML table column name is “Author” while for the others it’s Authors. For the other mismatches, this is because the multiple authors in JSN are being stored as a list, not characters.
Conclusion
HTML, XML, and JSON differ in the ease with which we can import them into r. But we need to learn them all since we may not have a choice.