Working with XML and JSON in R

CUNY MSDS - DATA607 - Home Work-7

James Kuruvilla

October 13, 2017

Assignment

————————————————————————————————————
————————————————————————————————————

Library Definition

library(knitr)
library(XML)
library(RCurl)
library(jsonlite)

Following link is a good resource to do this assigment

https://www.datacamp.com/community/tutorials/r-data-import-tutorial#data

Read URLS for XML, JSON and HTML from GitHub

xml_url <- "https://raw.githubusercontent.com/jameskuruvilla/DATA607/master/books.xml"
json_url <- "https://raw.githubusercontent.com/jameskuruvilla/DATA607/master/books.json"
html_url <- "https://raw.githubusercontent.com/jameskuruvilla/DATA607/master/books.html"

HTML

html_file<- getURL(html_url)

html_df <- readHTMLTable(html_file, which = 1)

html_df

XML

xml_file <- getURL(xml_url)

xml_df <- xmlToDataFrame(xml_file)

xml_df

JSON

# fromJSON from the package RJASONIO is different from jasonlite
# For documentation of jasonlite, goto https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf

json_df <- as.data.frame(fromJSON(json_url))

#Change Column names to match with other data frames
names(json_df) <- c("ID","Title","Author","ISBN-13","Publisher","Publication_date","Pages","Related_Subject")

json_df

Compare Data Frames made out of JSON, HTML and XML Files

all.equal(html_df,xml_df)
## [1] TRUE

Data frames formed from HTML and XML files are identical

all.equal(html_df,json_df)
## [1] "Component \"ID\": 'current' is not a factor"              
## [2] "Component \"Title\": 'current' is not a factor"           
## [3] "Component \"Author\": 'current' is not a factor"          
## [4] "Component \"ISBN-13\": 'current' is not a factor"         
## [5] "Component \"Publisher\": 'current' is not a factor"       
## [6] "Component \"Publication_date\": 'current' is not a factor"
## [7] "Component \"Pages\": 'current' is not a factor"           
## [8] "Component \"Related_Subject\": 'current' is not a factor"

Lets look into the difference between the data frames created from HTML and JSON files

str(html_df)
## 'data.frame':    3 obs. of  8 variables:
##  $ ID              : Factor w/ 3 levels "01","02","03": 1 2 3
##  $ Title           : Factor w/ 3 levels "Data Science For Business",..: 1 3 2
##  $ Author          : Factor w/ 3 levels "Foster Provost, Tom Fawcett",..: 1 3 2
##  $ ISBN-13         : Factor w/ 3 levels "9780321888037",..: 2 1 3
##  $ Publisher       : Factor w/ 2 levels "O'Reilly Media",..: 1 2 1
##  $ Publication_date: Factor w/ 3 levels "09/23/2013","12/06/2016",..: 3 1 2
##  $ Pages           : Factor w/ 3 levels "386","432","492": 1 2 3
##  $ Related_Subject : Factor w/ 2 levels "Data Science",..: 1 2 2
str(json_df)
## 'data.frame':    3 obs. of  8 variables:
##  $ ID              : chr  "01" "02" "03"
##  $ Title           : chr  "Data Science For Business" "R for Everyone" "R for Data Science"
##  $ Author          : chr  "Foster Provost, Tom Fawcett" "Jared P. Lander" "Hadley Wickham and Garrett Grolemund"
##  $ ISBN-13         : chr  "9781449361327" "9780321888037" "9781491910399"
##  $ Publisher       : chr  "O'Reilly Media" "Pearson Education" "O'Reilly Media"
##  $ Publication_date: chr  "7/25/2013" "09/23/2013" "12/06/2016"
##  $ Pages           : chr  "386" "432" "492"
##  $ Related_Subject : chr  "Data Science" "R Programming" "R Programming"

The default data types of the columns are different in both cases. Also I have changed the column names of the data frame created using JSON file.

Conclusion

Data frames created from HTML and XML files are identical. But the structure of the data frame created from JSON file is different even though the content visually looks identical. The default data type of the columns in case of JSON is ‘chr’ where as the data type of the columns in case of both HTML and XML are ‘Factor’