Our goal in this lesson is to create a list of books stored in an HTML table, an XML file, and a JSON file and use R to read and parse the files and store in 3 data frames. We should then review the resulting data frames and note any differences.
if (!require('rvest')) install.packages('rvest')
if (!require('XML')) install.packages('XML')
if (!require('jsonlite')) install.packages('jsonlite')
h <- read_html("books.html")
h.df <- data.frame(html_table(h))
knitr::kable(h.df)
| title | year | authors | publisher | numpages | goodreadsrank |
|---|---|---|---|---|---|
| Advanced R | 2015 | Hadley Wickham | CRC Press | 476 | 4.7 |
| R in Action | 2011 | Robert I. Kabacoff | Manning Publications | 472 | 4.1 |
| Automated Data Collection in R | 2015 | Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis | Wiley Press | 480 | 4.0 |
x <- xmlParse("books.xml")
x.df <- xmlToDataFrame(x)
knitr::kable(x.df)
| title | year | authors | publisher | numpages | goodreadsrank |
|---|---|---|---|---|---|
| Advanced R | 2015 | Hadley Wickham | CRC Press | 476 | 4.7 |
| R in Action | 2011 | Robert I. Kabacoff | Manning Publications | 472 | 4.1 |
| Automated Data Collection in R | 2015 | Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis | Wiley Press | 480 | 4.0 |
j <- fromJSON("books.json")
j.df <- do.call("rbind", lapply(j, data.frame, stringsAsFactors=FALSE))
rownames(j.df) <- NULL
knitr::kable(j.df)
| title | year | authors | publisher | numpages | goodreadsrank |
|---|---|---|---|---|---|
| Advanced R | 2015 | Hadley Wickham | CRC Press | 476 | 4.7 |
| R in Action | 2011 | Robert I. Kabacoff | Manning Publications | 472 | 4.1 |
| Automated Data Collection in R | 2015 | Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis | Wiley Press | 480 | 4.0 |
They all look equivalent to the eye, but are they? We can use the base package function ‘all.equal’ to test.
all.equal(h.df, x.df)
## [1] "Component \"title\": Modes: character, numeric"
## [2] "Component \"title\": Attributes: < target is NULL, current is list >"
## [3] "Component \"title\": target is character, current is factor"
## [4] "Component \"year\": Attributes: < target is NULL, current is list >"
## [5] "Component \"year\": target is numeric, current is factor"
## [6] "Component \"authors\": Modes: character, numeric"
## [7] "Component \"authors\": Attributes: < target is NULL, current is list >"
## [8] "Component \"authors\": target is character, current is factor"
## [9] "Component \"publisher\": Modes: character, numeric"
## [10] "Component \"publisher\": Attributes: < target is NULL, current is list >"
## [11] "Component \"publisher\": target is character, current is factor"
## [12] "Component \"numpages\": Attributes: < target is NULL, current is list >"
## [13] "Component \"numpages\": target is numeric, current is factor"
## [14] "Component \"goodreadsrank\": Attributes: < target is NULL, current is list >"
## [15] "Component \"goodreadsrank\": target is numeric, current is factor"
all.equal(h.df, j.df)
## [1] "Component \"authors\": Modes: character, list"
## [2] "Component \"authors\": target is character, current is list"
all.equal(x.df, j.df)
## [1] "Component \"title\": 'current' is not a factor"
## [2] "Component \"year\": 'current' is not a factor"
## [3] "Component \"authors\": 'current' is not a factor"
## [4] "Component \"publisher\": 'current' is not a factor"
## [5] "Component \"numpages\": 'current' is not a factor"
## [6] "Component \"goodreadsrank\": 'current' is not a factor"
While the characters in each of the data frames are essentially the same, it would take some additional work to make them completely equivalent. The classes have slight differences, most notably the xml import method resulted in each variable being a factor.