For this assignment I have created three files that stores information of my three current favorite books in HTML, XML and JSON formats and uploaded in github. I will then load the information from this three source as three separate data frames and at the end I will test to see if they are identical.

Install required packages

##install.packages("XML")
##install.packages("RJSONIO")
##install.packages("RCurl")
##install.packages("plyr")

Load library

library(XML)
library(RJSONIO)
library(RCurl)
## Loading required package: bitops
library(plyr)

Load HTML, XML and JSON data from Github

html.url <- getURL("https://raw.githubusercontent.com/choudhury1023/Data-607/gh-pages/books.html")
xml.url <- getURL("https://raw.githubusercontent.com/choudhury1023/Data-607/gh-pages/books.xml")
json.url <- getURL("https://raw.githubusercontent.com/choudhury1023/Data-607/gh-pages/books.json")

Data frame from HTML file

books.html <- readHTMLTable(html.url, head=TRUE, as.data.frame=TRUE, stringsAsFactors = FALSE)
books.html
## $`NULL`
##                                            Title
## 1                            The Language of SQL
## 2 R for Everyone Advanced Analytics and Graphics
## 3                      Data Science for Business
##                        Author              ISBN         Publisher
## 1               Larry Rockoff 978-1-4354-5751-5 Course Technology
## 2             Jared P. Lander 978-0-321-88803-7    Addison-Wesley
## 3 Foster Provost, Tom Fawcett 978-1-449-36132-7          O'Reilly

Data frame from XML file

books.parse <- xmlParse(xml.url)
books.root <- xmlRoot(books.parse)
books.xml <- xmlToDataFrame(books.root, stringsAsFactors = FALSE)
books.xml
##                                             Title
## 1                             The Language of SQL
## 2  R for Everyone Advanced Analytics and Graphics
## 3                       Data Science for Business
##                        Author              ISBN         Publisher
## 1               Larry Rockoff 978-1-4354-5751-5 Course Technology
## 2             Jared P. Lander 978-0-321-88803-7    Addison-Wesley
## 3 Foster Provost, Tom Fawcett 978-1-449-36132-7          O'Reilly

Data frame from JSON file

raw.json <- fromJSON(json.url, simplifyVector = FALSE, as.data.frame=TRUE)
unlist.json <- sapply(raw.json[[1]], unlist)
books.json <- do.call("rbind.fill", lapply(lapply(unlist.json, t), data.frame, stringsAsFactors = FALSE))
books.json
##                                            Title          Author
## 1                            The Language of SQL   Larry Rockoff
## 2 R for Everyone Advanced Analytics and Graphics Jared P. Lander
## 3                      Data Science for Business            <NA>
##                ISBN         Publisher        Author1     Author2
## 1 978-1-4354-5751-5 Course Technology           <NA>        <NA>
## 2 978-0-321-88803-7    Addison-Wesley           <NA>        <NA>
## 3 978-1-449-36132-7          O'Reilly Foster Provost Tom Fawcett

Are they identical?

identical(books.html, books.xml)
## [1] FALSE
identical(books.html, books.json)
## [1] FALSE
identical(books.xml, books.json)
## [1] FALSE

The data frames generated from HTML and XML are not identical but similar, the JSON data frame is quite diffrent. The diffrence is apparent on how the JSON data frame handles the book with multiple authors.