Problem Statement

Working with XML and JSON in R - Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). ——————————————————————————–

Acceptance Criteria

  • Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames.
  • Are the three data frames identical?
  • Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com

Analysis

  • In the first record, Player Number, Player Name, Total Points, and the Round 1 - 7 results can be found.
  • In the second record, Player’s State, USCF ID, Pre-Rating and Post-Rating values can be found.
  • Remaining other fields not relevant for this project

Approach

  • Source .CSV, .HTML and .XML files from AWS S3
  • Parse each file separately to extract desired data
  • Compare each data frame for data analysis and conclusion

library(DT)
library(dplyr)
library(htmltools)
library (readr)
library(aws.s3)
library(jsonlite)
library(rvest)
library(XML)

Using AWS S3

  • Using AWS S3 to export .CSV, .HTML and .XML files
  • AWS S3 is available to access data files from anywhere on internet

Using laresbernardo/lares

  • Using laresbernardo/lares R package to protect AWS S3 credentials
  • A config.yaml file maintained to retain AWS S3 access keys
library(lares)
bucket<-get_creds()$`aws.s3`$bucket
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = get_creds()$`aws.s3`$accessKeyId,
  "AWS_SECRET_ACCESS_KEY" = get_creds()$`aws.s3`$accessKey,
  "AWS_DEFAULT_REGION" = get_creds()$`aws.s3`$region)

Book Catalog

Following 3 books with respective writers are used for this assignment

# CSV
books.csv<-s3read_using(FUN = read.csv, object = "Books.csv", bucket = bucket)
datatable(books.csv, 
          options = list(pageLength = 10,searching = FALSE,filter=FALSE, 
                         pageLength = FALSE),rownames = FALSE)

JSON (JavaScript Object Notation)

Load Book.json from S3

# JSON
books.vector <-get_object(bucket = bucket,object = "Books.json")
books.json <- fromJSON(rawToChar(books.vector))

XML (Extensible Markup Language )

Load Book.xml from S3

# XML
books.vector <-get_object(bucket = bucket,object = "Books.xml")
books.xml <- xmlParse(rawToChar(books.vector)) 
books.xml<- xmlSApply(xmlRoot(books.xml), function(x) xmlSApply(x, xmlValue))

HTML (Hypertext Markup Language )

Load HTML from S3

# HTML
books.vector <-get_object(bucket = bucket,object = "Books.html")
books.vector<-rawToChar(books.vector)
# ignore 0th element (as NULL)
dataframe_html <- (readHTMLTable(books.vector))[[1]]

Compare for similarities

To find different resulting data frames were identical

  • Compare JSON and XML
  • Compare JSON and HTML
  • Compare XML to HTML

Compare JSON and XML

all.equal(dataframe_json, dataframe_xml)
##  [1] "Modes: list, character"                                              
##  [2] "Lengths: 6, 18"                                                      
##  [3] "names for target but not for current"                                
##  [4] "Attributes: < Names: 2 string mismatches >"                          
##  [5] "Attributes: < Component 1: Modes: character, numeric >"              
##  [6] "Attributes: < Component 1: Lengths: 1, 2 >"                          
##  [7] "Attributes: < Component 1: target is character, current is numeric >"
##  [8] "Attributes: < Component 2: Modes: numeric, list >"                   
##  [9] "Attributes: < Component 2: Lengths: 3, 2 >"                          
## [10] "Attributes: < Component 2: target is numeric, current is list >"     
## [11] "current is not list-like"

Findings

  • There are mismatch on data types - list, character
  • XML file retain a list of the authors if book is written by more than one author
  • Differences on Publisher mentioned due to ASCII character for O’Reilly Media
  • As per programmatic comparison - these data frames are not identical.
  • However, visual display of data in table shows data frames are identical

Compare JSON and HTML

all.equal(dataframe_json, dataframe_html)
## [1] "Component \"Author\": Modes: list, character"                                        
## [2] "Component \"Author\": Component 2: Lengths (2, 1) differ (string compare on first 1)"
## [3] "Component \"Author\": Component 2: 1 string mismatch"                                
## [4] "Component \"Author\": Component 3: Lengths (3, 1) differ (string compare on first 1)"
## [5] "Component \"Author\": Component 3: 1 string mismatch"                                
## [6] "Component \"Publisher\": 1 string mismatch"                                          
## [7] "Component \"Year\": Modes: numeric, character"                                       
## [8] "Component \"Year\": target is numeric, current is character"

Findings

  • There are mismatch on data types - author, year and publisher are not compatible
  • As per programmatic comparison - these data frames are not identical.
  • However, Vsual display of data in table shows data frames are identical

Compare xml and HTML

all.equal(dataframe_xml, dataframe_html)
##  [1] "Modes: character, list"                                                           
##  [2] "Lengths: 18, 6"                                                                   
##  [3] "names for current but not for target"                                             
##  [4] "Attributes: < Names: 2 string mismatches >"                                       
##  [5] "Attributes: < Component 1: Modes: numeric, character >"                           
##  [6] "Attributes: < Component 1: Lengths: 2, 1 >"                                       
##  [7] "Attributes: < Component 1: target is numeric, current is character >"             
##  [8] "Attributes: < Component 2: Modes: list, numeric >"                                
##  [9] "Attributes: < Component 2: Length mismatch: comparison on first 2 components >"   
## [10] "Attributes: < Component 2: Component 1: Modes: character, numeric >"              
## [11] "Attributes: < Component 2: Component 1: Lengths: 3, 1 >"                          
## [12] "Attributes: < Component 2: Component 1: target is character, current is numeric >"
## [13] "Attributes: < Component 2: Component 2: Modes: character, numeric >"              
## [14] "Attributes: < Component 2: Component 2: Lengths: 6, 1 >"                          
## [15] "Attributes: < Component 2: Component 2: target is character, current is numeric >"
## [16] "target is matrix, current is data.frame"

Findings

  • There are mismatch on data types - list, character
  • XML file retain a list of the authors if book is written by more than one author
  • Differences on Publisher mentioned due to ASCII character for O’Reilly Media
  • As per programmatic comparison - these data frames are not identical
  • HTML data set is matrix, that can be a factor of this difference

Conclusion

  • Despite the very different file types, HTML and JSON files were read into R using the same R code and functions
  • After all of the scraping was complete, all of the files produced data tables that were identical
  • While the characters in each of the data frames are essentially the same, it would take some additional work to make them completely equivalent