RK_607_Assignment07

Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web]

INDEX (Step by Step)

STEP 1. Load Libraries

STEP 2. Load the file

STEP 3. Create HTML Table

STEP 4. Create XML Table

STEP 5. Create JSON Table

STEP 6. Compare the dataframes

STEP 7. Conclusion

STEP 0 : Good Practise

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 482134 25.8     940480 50.3   750400 40.1
## Vcells 864339  6.6    1650153 12.6  1113040  8.5

STEP 1 : Load your libraries

# Load the libraries
library(RCurl)      #For File Operations

## Loading required package: bitops

library(XML)        #For XML Operations
library(jsonlite)   #For JSon operations
library(rjson)      #For JSon Operations - This one did not work well

## 
## Attaching package: 'rjson'

## The following objects are masked from 'package:jsonlite':
## 
##     fromJSON, toJSON

library(DT)         #For Data table package
library(plyr)       #For lplyr operation

STEP 2 : Load the File

# Good Practise: Set up the Working Directory when working with a file system
setwd("C:\\CUNY\\607Data\\Assignments")

# Set the File URL
html.url <- "https://raw.githubusercontent.com/rajk11040/CUNY607/master/07assignment_favbooks.html"

xml.url <- "https://raw.githubusercontent.com/rajk11040/CUNY607/master/07assignment_favbooks.xml"

json.url <- "https://raw.githubusercontent.com/rajk11040/CUNY607/master/07assignment_favbooks.json"

STEP 3. Create HTML Table

HTML Table Create a HTML Table.

# Read the File
html.book <- getURL(html.url)
html.book <- readHTMLTable(html.book, header = TRUE)
html.book <- data.frame(html.book)
names(html.book)

## [1] "NULL.Name"      "NULL.Author"    "NULL.Publisher" "NULL.Year"

names(html.book) <- c("Book", "Author", "Publisher", "Year")

datatable(html.book)

STEP 4. Create XML Table

XML Table Create a XML Table.

# Read the File
xml.book <- getURL(xml.url)
xml.book <- xmlTreeParse(xml.book, useInternal = TRUE)
xml.book <- xmlToDataFrame(xml.book)
names(xml.book) <- c("Book", "Author", "Publisher", "Year")

datatable(xml.book)

STEP 5. Create JSON Table

JSON Table Create a JSON Table.

# Read the File
json.book <- getURL(json.url)

# If had issues with rjson as it did not readh the elements correctly into data.frame 
# I switched to jsonlite as it seemed to wokr better
json.book <- jsonlite::fromJSON(json.book)
json.book <- ldply(json.book, data.frame)

#remove the .id column
json.book <- json.book %>% dplyr::select(2:5)
names(json.book) <- c("Book", "Author", "Publisher", "Year")
datatable(json.book)

STEP 6. Compare the data frames

library(diffobj)

## Warning: package 'diffobj' was built under R version 3.4.4

diffObj(html.book, xml.book)

## < html.book                                                    
## > xml.book                                                     
## @@ 7,4 / 7,4 @@                                                
## ~                         Author                 Publisher Year
##   1                   James Alen Value Publishing Reprints 2017
##   2                 Trina Paulus         Paulus Publishing 1973
## < 3 Richard Bach, Russell Munson                  Scribner 2014
## > 3 Richard BachRussell Munson                  Scribner 2014  
##   4                 Paulo Coelho                 HarperOne 2014

diffObj(xml.book, json.book)

## < xml.book                                                     
## > json.book                                                    
## @@ 7,4 / 7,4 @@                                                
## ~                       Author                 Publisher Year  
##   1                 James Alen Value Publishing Reprints 2017  
##   2               Trina Paulus         Paulus Publishing 1973  
## < 3 Richard BachRussell Munson                  Scribner 2014  
## > 3 Richard Bach, Russell Munson                  Scribner 2014
##   4               Paulo Coelho                 HarperOne 2014

diffObj(json.book, html.book)

## < str(json.book)                                                           
## > str(html.book)                                                           
## @@ 1,9 / 1,5 @@                                                            
##   'data.frame':   4 obs. of  4 variables:                                  
## <  $ Book     : chr  "As a Man Thinketh" "Hope for the Flowers" "Jonathan L
## : ivingston Seagull: The Complete Edition" "The Alchemist"                 
## <  $ Author   :List of 4                                                   
## <   ..$ : chr "James Alen"                                                 
## <   ..$ : chr "Trina Paulus"                                               
## <   ..$ : chr  "Richard Bach" "Russell Munson"                             
## <   ..$ : chr "Paulo Coelho"                                               
## <  $ Publisher: chr  "Value Publishing Reprints" "Paulus Publishing" "Scrib
## : ner" "HarperOne"                                                         
## <  $ Year     : int  2017 1973 2014 2014                                   
## >  $ Book     : Factor w/ 4 levels "As a Man Thinketh",..: 1 2 3 4         
## >  $ Author   : Factor w/ 4 levels "James Alen","Paulo Coelho",..: 1 4 3 2 
## >  $ Publisher: Factor w/ 4 levels "HarperOne","Paulus Publishing",..: 4 2 
## : 3 1                                                                      
## >  $ Year     : Factor w/ 3 levels "1973","2014",..: 3 1 2 2

STEP 7. Conclusion

Conclusion - Differences

1. **Overall:**  We can see mostly the dataframes were similar except the XML added 2 elements together causing a difference. Also Json added .id field which was removed.

1. **HTML Table:**  readHTMLTable returns a list and it needs to be converted to data frame. Converting to dataframe added NULL to the column names, which had to be cleaned.

2. **XML Table:** This seemed the easiest way to handle data as XML is the best way to tag and handle data. XML does add additional size to the data as it has ending tags. This was the cleanest way to handle multiple authors, as the result dataframe combined these elements automatically.

3. **JSON Table:** This is similar to XML but JSON does not have any eding tags. This has fast become a widely used way in web api's to communicate data. This did required bit more lines of code and added a .id wrapper column to data which I removed later. I also had issue with RJSON library as it did not read my files correctly.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

RK_607_Assignment07

Raj Kumar

March 17th, 2018

Working with XML and JSON in R

INDEX (Step by Step)

STEP 1. Load Libraries

STEP 2. Load the file

STEP 3. Create HTML Table

STEP 4. Create XML Table

STEP 5. Create JSON Table

STEP 6. Compare the dataframes

STEP 7. Conclusion

STEP 0 : Good Practise

STEP 1 : Load your libraries

STEP 2 : Load the File

STEP 3. Create HTML Table

STEP 4. Create XML Table

STEP 5. Create JSON Table

STEP 6. Compare the data frames

STEP 7. Conclusion