Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web]


INDEX (Step by Step)

STEP 1. Load Libraries
STEP 2. Load the file
STEP 3. Create HTML Table
STEP 4. Create XML Table
STEP 5. Create JSON Table
STEP 6. Compare the dataframes
STEP 7. Conclusion

STEP 0 : Good Practise

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 482134 25.8     940480 50.3   750400 40.1
## Vcells 864339  6.6    1650153 12.6  1113040  8.5

STEP 1 : Load your libraries

# Load the libraries
library(RCurl)      #For File Operations
## Loading required package: bitops
library(XML)        #For XML Operations
library(jsonlite)   #For JSon operations
library(rjson)      #For JSon Operations - This one did not work well
## 
## Attaching package: 'rjson'
## The following objects are masked from 'package:jsonlite':
## 
##     fromJSON, toJSON
library(DT)         #For Data table package
library(plyr)       #For lplyr operation

STEP 2 : Load the File

# Good Practise: Set up the Working Directory when working with a file system
setwd("C:\\CUNY\\607Data\\Assignments")

# Set the File URL
html.url <- "https://raw.githubusercontent.com/rajk11040/CUNY607/master/07assignment_favbooks.html"

xml.url <- "https://raw.githubusercontent.com/rajk11040/CUNY607/master/07assignment_favbooks.xml"

json.url <- "https://raw.githubusercontent.com/rajk11040/CUNY607/master/07assignment_favbooks.json"

STEP 3. Create HTML Table

  1. HTML Table Create a HTML Table.

# Read the File
html.book <- getURL(html.url)
html.book <- readHTMLTable(html.book, header = TRUE)
html.book <- data.frame(html.book)
names(html.book)
## [1] "NULL.Name"      "NULL.Author"    "NULL.Publisher" "NULL.Year"
names(html.book) <- c("Book", "Author", "Publisher", "Year")

datatable(html.book)

STEP 4. Create XML Table

  1. XML Table Create a XML Table.

# Read the File
xml.book <- getURL(xml.url)
xml.book <- xmlTreeParse(xml.book, useInternal = TRUE)
xml.book <- xmlToDataFrame(xml.book)
names(xml.book) <- c("Book", "Author", "Publisher", "Year")

datatable(xml.book)

STEP 5. Create JSON Table

  1. JSON Table Create a JSON Table.

# Read the File
json.book <- getURL(json.url)

# If had issues with rjson as it did not readh the elements correctly into data.frame 
# I switched to jsonlite as it seemed to wokr better
json.book <- jsonlite::fromJSON(json.book)
json.book <- ldply(json.book, data.frame)

#remove the .id column
json.book <- json.book %>% dplyr::select(2:5)
names(json.book) <- c("Book", "Author", "Publisher", "Year")
datatable(json.book)

STEP 6. Compare the data frames

library(diffobj)
## Warning: package 'diffobj' was built under R version 3.4.4
diffObj(html.book, xml.book)
## < html.book                                                    
## > xml.book                                                     
## @@ 7,4 / 7,4 @@                                                
## ~                         Author                 Publisher Year
##   1                   James Alen Value Publishing Reprints 2017
##   2                 Trina Paulus         Paulus Publishing 1973
## < 3 Richard Bach, Russell Munson                  Scribner 2014
## > 3 Richard BachRussell Munson                  Scribner 2014  
##   4                 Paulo Coelho                 HarperOne 2014
diffObj(xml.book, json.book)
## < xml.book                                                     
## > json.book                                                    
## @@ 7,4 / 7,4 @@                                                
## ~                       Author                 Publisher Year  
##   1                 James Alen Value Publishing Reprints 2017  
##   2               Trina Paulus         Paulus Publishing 1973  
## < 3 Richard BachRussell Munson                  Scribner 2014  
## > 3 Richard Bach, Russell Munson                  Scribner 2014
##   4               Paulo Coelho                 HarperOne 2014
diffObj(json.book, html.book)
## < str(json.book)                                                           
## > str(html.book)                                                           
## @@ 1,9 / 1,5 @@                                                            
##   'data.frame':   4 obs. of  4 variables:                                  
## <  $ Book     : chr  "As a Man Thinketh" "Hope for the Flowers" "Jonathan L
## : ivingston Seagull: The Complete Edition" "The Alchemist"                 
## <  $ Author   :List of 4                                                   
## <   ..$ : chr "James Alen"                                                 
## <   ..$ : chr "Trina Paulus"                                               
## <   ..$ : chr  "Richard Bach" "Russell Munson"                             
## <   ..$ : chr "Paulo Coelho"                                               
## <  $ Publisher: chr  "Value Publishing Reprints" "Paulus Publishing" "Scrib
## : ner" "HarperOne"                                                         
## <  $ Year     : int  2017 1973 2014 2014                                   
## >  $ Book     : Factor w/ 4 levels "As a Man Thinketh",..: 1 2 3 4         
## >  $ Author   : Factor w/ 4 levels "James Alen","Paulo Coelho",..: 1 4 3 2 
## >  $ Publisher: Factor w/ 4 levels "HarperOne","Paulus Publishing",..: 4 2 
## : 3 1                                                                      
## >  $ Year     : Factor w/ 3 levels "1973","2014",..: 3 1 2 2

STEP 7. Conclusion

Conclusion - Differences

1. **Overall:**  We can see mostly the dataframes were similar except the XML added 2 elements together causing a difference. Also Json added .id field which was removed. 


1. **HTML Table:**  readHTMLTable returns a list and it needs to be converted to data frame. Converting to dataframe added NULL to the column names, which had to be cleaned. 


2. **XML Table:** This seemed the easiest way to handle data as XML is the best way to tag and handle data. XML does add additional size to the data as it has ending tags. This was the cleanest way to handle multiple authors, as the result dataframe combined these elements automatically. 


3. **JSON Table:** This is similar to XML but JSON does not have any eding tags. This has fast become a widely used way in web api's to communicate data. This did required bit more lines of code and added a .id wrapper column to data which I removed later. I also had issue with RJSON library as it did not read my files correctly.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.