Objective

Our goal is to create a list of books stored in an HTML table, an XML file, and a JSON file and use R to read and parse the files and store in 3 data frames.

Environment Prep

if (!require('rvest')) install.packages('rvest')
if (!require('XML')) install.packages('XML')
if (!require('jsonlite')) install.packages('jsonlite')
library('DT')

Read Data

HTML Table

h <- read_html("books.html") 
h.df <- data.frame(html_table(h))
knitr::kable(h.df)

title	year	authors	publisher	numpages	goodreadsrank
Advanced R	2015	Hadley Wickham	CRC Press	476	4.7
R in Action	2011	Robert I. Kabacoff	Manning Publications	472	4.1
Automated Data Collection in R	2015	Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis	Wiley Press	480	4.0

XML

x <- xmlParse("books.xml")
x.df <- xmlToDataFrame(x)
knitr::kable(x.df)

title	year	authors	publisher	numpages	goodreadsrank
Advanced R	2015	Hadley Wickham	CRC Press	476	4.7
R in Action	2011	Robert I. Kabacoff	Manning Publications	472	4.1
Automated Data Collection in R	2015	Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis	Wiley Press	480	4.0

JSON

j <- fromJSON("books.json")
j.df <- do.call("rbind", lapply(j, data.frame, stringsAsFactors=FALSE))
rownames(j.df) <- NULL
knitr::kable(j.df)

title	year	authors	publisher	numpages	goodreadsrank
Advanced R	2015	Hadley Wickham	CRC Press	476	4.7
R in Action	2011	Robert I. Kabacoff	Manning Publications	472	4.1
Automated Data Collection in R	2015	Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis	Wiley Press	480	4.0

Similarities

Overall, all the datafram looks same. But we need some deterministic test to confirm if they are same. We can use the base package function ‘all.equal’ to test.

HTML to XML

all.equal(h.df, x.df)

##  [1] "Component \"title\": Modes: character, numeric"                              
##  [2] "Component \"title\": Attributes: < target is NULL, current is list >"        
##  [3] "Component \"title\": target is character, current is factor"                 
##  [4] "Component \"year\": Attributes: < target is NULL, current is list >"         
##  [5] "Component \"year\": target is numeric, current is factor"                    
##  [6] "Component \"authors\": Modes: character, numeric"                            
##  [7] "Component \"authors\": Attributes: < target is NULL, current is list >"      
##  [8] "Component \"authors\": target is character, current is factor"               
##  [9] "Component \"publisher\": Modes: character, numeric"                          
## [10] "Component \"publisher\": Attributes: < target is NULL, current is list >"    
## [11] "Component \"publisher\": target is character, current is factor"             
## [12] "Component \"numpages\": Attributes: < target is NULL, current is list >"     
## [13] "Component \"numpages\": target is numeric, current is factor"                
## [14] "Component \"goodreadsrank\": Attributes: < target is NULL, current is list >"
## [15] "Component \"goodreadsrank\": target is numeric, current is factor"

HTML to JSON

all.equal(h.df, j.df)

## [1] "Component \"authors\": Modes: character, list"              
## [2] "Component \"authors\": target is character, current is list"

XML to JSON

all.equal(x.df, j.df)

## [1] "Component \"title\": 'current' is not a factor"        
## [2] "Component \"year\": 'current' is not a factor"         
## [3] "Component \"authors\": 'current' is not a factor"      
## [4] "Component \"publisher\": 'current' is not a factor"    
## [5] "Component \"numpages\": 'current' is not a factor"     
## [6] "Component \"goodreadsrank\": 'current' is not a factor"

Conclusion

Although the data frames look same, it would take some additional work to make them completely equivalent. The classes are different. Most importantly the xml import method mad each variable a factor.

Work_with_html_xml_json 7

anjal hussan

3/18/2018