Assignment
————————————————————————————————————
————————————————————————————————————
Library Definition
library(knitr)
library(XML)
library(RCurl)
library(jsonlite)
Following link is a good resource to do this assigment
https://www.datacamp.com/community/tutorials/r-data-import-tutorial#data
Read URLS for XML, JSON and HTML from GitHub
xml_url <- "https://raw.githubusercontent.com/jameskuruvilla/DATA607/master/books.xml"
json_url <- "https://raw.githubusercontent.com/jameskuruvilla/DATA607/master/books.json"
html_url <- "https://raw.githubusercontent.com/jameskuruvilla/DATA607/master/books.html"
HTML
html_file<- getURL(html_url)
html_df <- readHTMLTable(html_file, which = 1)
html_df
XML
xml_file <- getURL(xml_url)
xml_df <- xmlToDataFrame(xml_file)
xml_df
JSON
# fromJSON from the package RJASONIO is different from jasonlite
# For documentation of jasonlite, goto https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf
json_df <- as.data.frame(fromJSON(json_url))
#Change Column names to match with other data frames
names(json_df) <- c("ID","Title","Author","ISBN-13","Publisher","Publication_date","Pages","Related_Subject")
json_df
Compare Data Frames made out of JSON, HTML and XML Files
all.equal(html_df,xml_df)
## [1] TRUE
Data frames formed from HTML and XML files are identical
all.equal(html_df,json_df)
## [1] "Component \"ID\": 'current' is not a factor"
## [2] "Component \"Title\": 'current' is not a factor"
## [3] "Component \"Author\": 'current' is not a factor"
## [4] "Component \"ISBN-13\": 'current' is not a factor"
## [5] "Component \"Publisher\": 'current' is not a factor"
## [6] "Component \"Publication_date\": 'current' is not a factor"
## [7] "Component \"Pages\": 'current' is not a factor"
## [8] "Component \"Related_Subject\": 'current' is not a factor"
Lets look into the difference between the data frames created from HTML and JSON files
str(html_df)
## 'data.frame': 3 obs. of 8 variables:
## $ ID : Factor w/ 3 levels "01","02","03": 1 2 3
## $ Title : Factor w/ 3 levels "Data Science For Business",..: 1 3 2
## $ Author : Factor w/ 3 levels "Foster Provost, Tom Fawcett",..: 1 3 2
## $ ISBN-13 : Factor w/ 3 levels "9780321888037",..: 2 1 3
## $ Publisher : Factor w/ 2 levels "O'Reilly Media",..: 1 2 1
## $ Publication_date: Factor w/ 3 levels "09/23/2013","12/06/2016",..: 3 1 2
## $ Pages : Factor w/ 3 levels "386","432","492": 1 2 3
## $ Related_Subject : Factor w/ 2 levels "Data Science",..: 1 2 2
str(json_df)
## 'data.frame': 3 obs. of 8 variables:
## $ ID : chr "01" "02" "03"
## $ Title : chr "Data Science For Business" "R for Everyone" "R for Data Science"
## $ Author : chr "Foster Provost, Tom Fawcett" "Jared P. Lander" "Hadley Wickham and Garrett Grolemund"
## $ ISBN-13 : chr "9781449361327" "9780321888037" "9781491910399"
## $ Publisher : chr "O'Reilly Media" "Pearson Education" "O'Reilly Media"
## $ Publication_date: chr "7/25/2013" "09/23/2013" "12/06/2016"
## $ Pages : chr "386" "432" "492"
## $ Related_Subject : chr "Data Science" "R Programming" "R Programming"
The default data types of the columns are different in both cases. Also I have changed the column names of the data frame created using JSON file.
Conclusion
Data frames created from HTML and XML files are identical. But the structure of the data frame created from JSON file is different even though the content visually looks identical. The default data type of the columns in case of JSON is ‘chr’ where as the data type of the columns in case of both HTML and XML are ‘Factor’