Abstract

Our goal in this lesson is to create a list of books stored in an HTML table, an XML file, and a JSON file and use R to read and parse the files and store in 3 data frames. We should then review the resulting data frames and note any differences.

Environment Prep

if (!require('rvest')) install.packages('rvest')
if (!require('XML')) install.packages('XML')
if (!require('jsonlite')) install.packages('jsonlite')

Importing

HTML Table

h <- read_html("books.html") 
h.df <- data.frame(html_table(h))
knitr::kable(h.df)
title year authors publisher numpages goodreadsrank
Advanced R 2015 Hadley Wickham CRC Press 476 4.7
R in Action 2011 Robert I. Kabacoff Manning Publications 472 4.1
Automated Data Collection in R 2015 Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis Wiley Press 480 4.0

XML

x <- xmlParse("books.xml")
x.df <- xmlToDataFrame(x)
knitr::kable(x.df)
title year authors publisher numpages goodreadsrank
Advanced R 2015 Hadley Wickham CRC Press 476 4.7
R in Action 2011 Robert I. Kabacoff Manning Publications 472 4.1
Automated Data Collection in R 2015 Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis Wiley Press 480 4.0

JSON

j <- fromJSON("books.json")
j.df <- do.call("rbind", lapply(j, data.frame, stringsAsFactors=FALSE))
rownames(j.df) <- NULL
knitr::kable(j.df)
title year authors publisher numpages goodreadsrank
Advanced R 2015 Hadley Wickham CRC Press 476 4.7
R in Action 2011 Robert I. Kabacoff Manning Publications 472 4.1
Automated Data Collection in R 2015 Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis Wiley Press 480 4.0

Testing Similarity

They all look equivalent to the eye, but are they? We can use the base package function ‘all.equal’ to test.

HTML to XML

all.equal(h.df, x.df)
##  [1] "Component \"title\": Modes: character, numeric"                              
##  [2] "Component \"title\": Attributes: < target is NULL, current is list >"        
##  [3] "Component \"title\": target is character, current is factor"                 
##  [4] "Component \"year\": Attributes: < target is NULL, current is list >"         
##  [5] "Component \"year\": target is numeric, current is factor"                    
##  [6] "Component \"authors\": Modes: character, numeric"                            
##  [7] "Component \"authors\": Attributes: < target is NULL, current is list >"      
##  [8] "Component \"authors\": target is character, current is factor"               
##  [9] "Component \"publisher\": Modes: character, numeric"                          
## [10] "Component \"publisher\": Attributes: < target is NULL, current is list >"    
## [11] "Component \"publisher\": target is character, current is factor"             
## [12] "Component \"numpages\": Attributes: < target is NULL, current is list >"     
## [13] "Component \"numpages\": target is numeric, current is factor"                
## [14] "Component \"goodreadsrank\": Attributes: < target is NULL, current is list >"
## [15] "Component \"goodreadsrank\": target is numeric, current is factor"

HTML to JSON

all.equal(h.df, j.df)
## [1] "Component \"authors\": Modes: character, list"              
## [2] "Component \"authors\": target is character, current is list"

XML to JSON

all.equal(x.df, j.df)
## [1] "Component \"title\": 'current' is not a factor"        
## [2] "Component \"year\": 'current' is not a factor"         
## [3] "Component \"authors\": 'current' is not a factor"      
## [4] "Component \"publisher\": 'current' is not a factor"    
## [5] "Component \"numpages\": 'current' is not a factor"     
## [6] "Component \"goodreadsrank\": 'current' is not a factor"

Conclusion

While the characters in each of the data frames are essentially the same, it would take some additional work to make them completely equivalent. The classes have slight differences, most notably the xml import method resulted in each variable being a factor.