Objective

Our goal is to create a list of books stored in an HTML table, an XML file, and a JSON file and use R to read and parse the files and store in 3 data frames.

Environment Prep

if (!require('rvest')) install.packages('rvest')
if (!require('XML')) install.packages('XML')
if (!require('jsonlite')) install.packages('jsonlite')
library('DT')

Read Data

HTML Table

h <- read_html("books.html") 
h.df <- data.frame(html_table(h))
knitr::kable(h.df)
title year authors publisher numpages goodreadsrank
Advanced R 2015 Hadley Wickham CRC Press 476 4.7
R in Action 2011 Robert I. Kabacoff Manning Publications 472 4.1
Automated Data Collection in R 2015 Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis Wiley Press 480 4.0

XML

x <- xmlParse("books.xml")
x.df <- xmlToDataFrame(x)
knitr::kable(x.df)
title year authors publisher numpages goodreadsrank
Advanced R 2015 Hadley Wickham CRC Press 476 4.7
R in Action 2011 Robert I. Kabacoff Manning Publications 472 4.1
Automated Data Collection in R 2015 Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis Wiley Press 480 4.0

JSON

j <- fromJSON("books.json")
j.df <- do.call("rbind", lapply(j, data.frame, stringsAsFactors=FALSE))
rownames(j.df) <- NULL
knitr::kable(j.df)
title year authors publisher numpages goodreadsrank
Advanced R 2015 Hadley Wickham CRC Press 476 4.7
R in Action 2011 Robert I. Kabacoff Manning Publications 472 4.1
Automated Data Collection in R 2015 Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis Wiley Press 480 4.0

Similarities

Overall, all the datafram looks same. But we need some deterministic test to confirm if they are same. We can use the base package function ‘all.equal’ to test.

HTML to XML

all.equal(h.df, x.df)
##  [1] "Component \"title\": Modes: character, numeric"                              
##  [2] "Component \"title\": Attributes: < target is NULL, current is list >"        
##  [3] "Component \"title\": target is character, current is factor"                 
##  [4] "Component \"year\": Attributes: < target is NULL, current is list >"         
##  [5] "Component \"year\": target is numeric, current is factor"                    
##  [6] "Component \"authors\": Modes: character, numeric"                            
##  [7] "Component \"authors\": Attributes: < target is NULL, current is list >"      
##  [8] "Component \"authors\": target is character, current is factor"               
##  [9] "Component \"publisher\": Modes: character, numeric"                          
## [10] "Component \"publisher\": Attributes: < target is NULL, current is list >"    
## [11] "Component \"publisher\": target is character, current is factor"             
## [12] "Component \"numpages\": Attributes: < target is NULL, current is list >"     
## [13] "Component \"numpages\": target is numeric, current is factor"                
## [14] "Component \"goodreadsrank\": Attributes: < target is NULL, current is list >"
## [15] "Component \"goodreadsrank\": target is numeric, current is factor"

HTML to JSON

all.equal(h.df, j.df)
## [1] "Component \"authors\": Modes: character, list"              
## [2] "Component \"authors\": target is character, current is list"

XML to JSON

all.equal(x.df, j.df)
## [1] "Component \"title\": 'current' is not a factor"        
## [2] "Component \"year\": 'current' is not a factor"         
## [3] "Component \"authors\": 'current' is not a factor"      
## [4] "Component \"publisher\": 'current' is not a factor"    
## [5] "Component \"numpages\": 'current' is not a factor"     
## [6] "Component \"goodreadsrank\": 'current' is not a factor"

Conclusion

Although the data frames look same, it would take some additional work to make them completely equivalent. The classes are different. Most importantly the xml import method mad each variable a factor.