Abstract

Our goal in this lesson is to create a list of books stored in an HTML table, an XML file, and a JSON file and use R to read and parse the files and store in 3 data frames. We should then review the resulting data frames and note any differences.

Environment Prep

if (!require('rvest')) install.packages('rvest')
if (!require('XML')) install.packages('XML')
if (!require('jsonlite')) install.packages('jsonlite')

Importing

HTML Table

h <- read_html("books.html") 
h.df <- data.frame(html_table(h))
knitr::kable(h.df)

title	year	authors	publisher	numpages	goodreadsrank
Advanced R	2015	Hadley Wickham	CRC Press	476	4.7
R in Action	2011	Robert I. Kabacoff	Manning Publications	472	4.1
Automated Data Collection in R	2015	Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis	Wiley Press	480	4.0

XML

x <- xmlParse("books.xml")
x.df <- xmlToDataFrame(x)
knitr::kable(x.df)

title	year	authors	publisher	numpages	goodreadsrank
Advanced R	2015	Hadley Wickham	CRC Press	476	4.7
R in Action	2011	Robert I. Kabacoff	Manning Publications	472	4.1
Automated Data Collection in R	2015	Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis	Wiley Press	480	4.0

JSON

j <- fromJSON("books.json")
j.df <- do.call("rbind", lapply(j, data.frame, stringsAsFactors=FALSE))
rownames(j.df) <- NULL
knitr::kable(j.df)

title	year	authors	publisher	numpages	goodreadsrank
Advanced R	2015	Hadley Wickham	CRC Press	476	4.7
R in Action	2011	Robert I. Kabacoff	Manning Publications	472	4.1
Automated Data Collection in R	2015	Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis	Wiley Press	480	4.0

Testing Similarity

They all look equivalent to the eye, but are they? We can use the base package function ‘all.equal’ to test.

HTML to XML

all.equal(h.df, x.df)

##  [1] "Component \"title\": Modes: character, numeric"                              
##  [2] "Component \"title\": Attributes: < target is NULL, current is list >"        
##  [3] "Component \"title\": target is character, current is factor"                 
##  [4] "Component \"year\": Attributes: < target is NULL, current is list >"         
##  [5] "Component \"year\": target is numeric, current is factor"                    
##  [6] "Component \"authors\": Modes: character, numeric"                            
##  [7] "Component \"authors\": Attributes: < target is NULL, current is list >"      
##  [8] "Component \"authors\": target is character, current is factor"               
##  [9] "Component \"publisher\": Modes: character, numeric"                          
## [10] "Component \"publisher\": Attributes: < target is NULL, current is list >"    
## [11] "Component \"publisher\": target is character, current is factor"             
## [12] "Component \"numpages\": Attributes: < target is NULL, current is list >"     
## [13] "Component \"numpages\": target is numeric, current is factor"                
## [14] "Component \"goodreadsrank\": Attributes: < target is NULL, current is list >"
## [15] "Component \"goodreadsrank\": target is numeric, current is factor"

HTML to JSON

all.equal(h.df, j.df)

## [1] "Component \"authors\": Modes: character, list"              
## [2] "Component \"authors\": target is character, current is list"

XML to JSON

all.equal(x.df, j.df)

## [1] "Component \"title\": 'current' is not a factor"        
## [2] "Component \"year\": 'current' is not a factor"         
## [3] "Component \"authors\": 'current' is not a factor"      
## [4] "Component \"publisher\": 'current' is not a factor"    
## [5] "Component \"numpages\": 'current' is not a factor"     
## [6] "Component \"goodreadsrank\": 'current' is not a factor"

Conclusion

While the characters in each of the data frames are essentially the same, it would take some additional work to make them completely equivalent. The classes have slight differences, most notably the xml import method resulted in each variable being a factor.

DATA 607: HTML, XML, JSON; Week 7

Walt Wells, Fall 2016