Our goal is to create a list of books stored in an HTML table, an XML file, and a JSON file and use R to read and parse the files and store in 3 data frames.
if (!require('rvest')) install.packages('rvest')
if (!require('XML')) install.packages('XML')
if (!require('jsonlite')) install.packages('jsonlite')
library('DT')
h <- read_html("books.html")
h.df <- data.frame(html_table(h))
knitr::kable(h.df)
| title | year | authors | publisher | numpages | goodreadsrank |
|---|---|---|---|---|---|
| Advanced R | 2015 | Hadley Wickham | CRC Press | 476 | 4.7 |
| R in Action | 2011 | Robert I. Kabacoff | Manning Publications | 472 | 4.1 |
| Automated Data Collection in R | 2015 | Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis | Wiley Press | 480 | 4.0 |
x <- xmlParse("books.xml")
x.df <- xmlToDataFrame(x)
knitr::kable(x.df)
| title | year | authors | publisher | numpages | goodreadsrank |
|---|---|---|---|---|---|
| Advanced R | 2015 | Hadley Wickham | CRC Press | 476 | 4.7 |
| R in Action | 2011 | Robert I. Kabacoff | Manning Publications | 472 | 4.1 |
| Automated Data Collection in R | 2015 | Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis | Wiley Press | 480 | 4.0 |
j <- fromJSON("books.json")
j.df <- do.call("rbind", lapply(j, data.frame, stringsAsFactors=FALSE))
rownames(j.df) <- NULL
knitr::kable(j.df)
| title | year | authors | publisher | numpages | goodreadsrank |
|---|---|---|---|---|---|
| Advanced R | 2015 | Hadley Wickham | CRC Press | 476 | 4.7 |
| R in Action | 2011 | Robert I. Kabacoff | Manning Publications | 472 | 4.1 |
| Automated Data Collection in R | 2015 | Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis | Wiley Press | 480 | 4.0 |
Overall, all the datafram looks same. But we need some deterministic test to confirm if they are same. We can use the base package function ‘all.equal’ to test.
all.equal(h.df, x.df)
## [1] "Component \"title\": Modes: character, numeric"
## [2] "Component \"title\": Attributes: < target is NULL, current is list >"
## [3] "Component \"title\": target is character, current is factor"
## [4] "Component \"year\": Attributes: < target is NULL, current is list >"
## [5] "Component \"year\": target is numeric, current is factor"
## [6] "Component \"authors\": Modes: character, numeric"
## [7] "Component \"authors\": Attributes: < target is NULL, current is list >"
## [8] "Component \"authors\": target is character, current is factor"
## [9] "Component \"publisher\": Modes: character, numeric"
## [10] "Component \"publisher\": Attributes: < target is NULL, current is list >"
## [11] "Component \"publisher\": target is character, current is factor"
## [12] "Component \"numpages\": Attributes: < target is NULL, current is list >"
## [13] "Component \"numpages\": target is numeric, current is factor"
## [14] "Component \"goodreadsrank\": Attributes: < target is NULL, current is list >"
## [15] "Component \"goodreadsrank\": target is numeric, current is factor"
all.equal(h.df, j.df)
## [1] "Component \"authors\": Modes: character, list"
## [2] "Component \"authors\": target is character, current is list"
all.equal(x.df, j.df)
## [1] "Component \"title\": 'current' is not a factor"
## [2] "Component \"year\": 'current' is not a factor"
## [3] "Component \"authors\": 'current' is not a factor"
## [4] "Component \"publisher\": 'current' is not a factor"
## [5] "Component \"numpages\": 'current' is not a factor"
## [6] "Component \"goodreadsrank\": 'current' is not a factor"
Although the data frames look same, it would take some additional work to make them completely equivalent. The classes are different. Most importantly the xml import method mad each variable a factor.