DATA607_Home_Work_7

Dilip Ganesan
03/18/2017

DATA 607 Home Work

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Environment Setup

if (!require('rvest')) install.packages('rvest')

## Loading required package: rvest

## Loading required package: xml2

if (!require('XML')) install.packages('XML')

## Loading required package: XML

## 
## Attaching package: 'XML'

## The following object is masked from 'package:rvest':
## 
##     xml

if (!require('jsonlite')) install.packages('jsonlite')

## Loading required package: jsonlite

Loading files into R

# Loading Html into R using rvest.
html=read_html("books.html")
html_tab=html_table(html)
html_df = data.frame(html_tab)
knitr::kable(html_df)

title	year	authors	publisher	numpages	goodreadsrank
R for Data Sciene	2016	Hadley Wickham	O Reilly	492	4.7
Data Structures and Algorithms	2003	Robert Lafore	SAMS	780	4.1
Automated Data Collection in R	2015	Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis	Wiley Press	480	4.0

# Loading JSON into R using jsonlite.
json=fromJSON("books.json")

# Reference from strack trace to use the function.
json = lapply(json, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})
json_df=do.call("rbind", json)
json_df=t(json_df)

knitr::kable(json_df)

	rbooklist
title1	R for Data Sciene
title2	Data Structures and Algorithms
title3	Automated Data Collection in R
year1	2016
year2	2003
year3	2015
authors1	Hadley Wickham
authors2	Robert Lafore
authors3	Simon Munzert
authors4	Christian Rubba
authors5	Peter Meissner
authors6	Dominic Nyhuis
publisher1	O Reilly
publisher2	SAMS
publisher3	Wiley Press
numpages1	492
numpages2	780
numpages3	480
goodreadsrank1	4.7
goodreadsrank2	4.1
goodreadsrank3	4

xml_file=xmlParse("books.xml")
xml_df=xmlToDataFrame(xml_file)

knitr::kable(xml_df)

title	year	authors	publisher	numpages	goodreadsrank
R for Data Sciene	2016	Hadley Wickham	O Reilly	492	4.7
Data Structures and Algorithms	2003	Robert Lafore	SAMS	780	4.1
Automated Data Collection in R	2015	Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis	Wiley Press	480	4.0

Comparing Data Frames

# Test Case 1 : HTML vs JSON

all.equal(html_df,json_df)

##  [1] "Modes: list, character"                                              
##  [2] "Lengths: 6, 21"                                                      
##  [3] "names for target but not for current"                                
##  [4] "Attributes: < Names: 2 string mismatches >"                          
##  [5] "Attributes: < Component 1: Modes: character, numeric >"              
##  [6] "Attributes: < Component 1: Lengths: 1, 2 >"                          
##  [7] "Attributes: < Component 1: target is character, current is numeric >"
##  [8] "Attributes: < Component 2: Modes: numeric, list >"                   
##  [9] "Attributes: < Component 2: Lengths: 3, 2 >"                          
## [10] "Attributes: < Component 2: target is numeric, current is list >"     
## [11] "current is not list-like"

# Test Case 1 : HTML vs XML

all.equal(html_df,xml_df)

##  [1] "Component \"title\": Modes: character, numeric"                              
##  [2] "Component \"title\": Attributes: < target is NULL, current is list >"        
##  [3] "Component \"title\": target is character, current is factor"                 
##  [4] "Component \"year\": Attributes: < target is NULL, current is list >"         
##  [5] "Component \"year\": target is numeric, current is factor"                    
##  [6] "Component \"authors\": Modes: character, numeric"                            
##  [7] "Component \"authors\": Attributes: < target is NULL, current is list >"      
##  [8] "Component \"authors\": target is character, current is factor"               
##  [9] "Component \"publisher\": Modes: character, numeric"                          
## [10] "Component \"publisher\": Attributes: < target is NULL, current is list >"    
## [11] "Component \"publisher\": target is character, current is factor"             
## [12] "Component \"numpages\": Attributes: < target is NULL, current is list >"     
## [13] "Component \"numpages\": target is numeric, current is factor"                
## [14] "Component \"goodreadsrank\": Attributes: < target is NULL, current is list >"
## [15] "Component \"goodreadsrank\": target is numeric, current is factor"

# Test Case 1 : JSON vs XML

all.equal(json_df,xml_df)

##  [1] "Modes: character, list"                                                           
##  [2] "Lengths: 21, 6"                                                                   
##  [3] "names for current but not for target"                                             
##  [4] "Attributes: < Names: 2 string mismatches >"                                       
##  [5] "Attributes: < Component 1: Modes: numeric, character >"                           
##  [6] "Attributes: < Component 1: Lengths: 2, 1 >"                                       
##  [7] "Attributes: < Component 1: target is numeric, current is character >"             
##  [8] "Attributes: < Component 2: Modes: list, numeric >"                                
##  [9] "Attributes: < Component 2: Length mismatch: comparison on first 2 components >"   
## [10] "Attributes: < Component 2: Component 1: Modes: character, numeric >"              
## [11] "Attributes: < Component 2: Component 1: Lengths: 21, 1 >"                         
## [12] "Attributes: < Component 2: Component 1: target is character, current is numeric >"
## [13] "Attributes: < Component 2: Component 2: Modes: character, numeric >"              
## [14] "Attributes: < Component 2: Component 2: target is character, current is numeric >"
## [15] "target is matrix, current is data.frame"

Test Result.

Only the values in the data frames are same, but the structures are different. Test Case 1 and 3 looks somewhat identical.