For this assignment I have created three files that stores information of my three current favorite books in HTML, XML and JSON formats and uploaded in github. I will then load the information from this three source as three separate data frames and at the end I will test to see if they are identical.
Install required packages
##install.packages("XML")
##install.packages("RJSONIO")
##install.packages("RCurl")
##install.packages("plyr")
Load library
library(XML)
library(RJSONIO)
library(RCurl)
## Loading required package: bitops
library(plyr)
Load HTML, XML and JSON data from Github
html.url <- getURL("https://raw.githubusercontent.com/choudhury1023/Data-607/gh-pages/books.html")
xml.url <- getURL("https://raw.githubusercontent.com/choudhury1023/Data-607/gh-pages/books.xml")
json.url <- getURL("https://raw.githubusercontent.com/choudhury1023/Data-607/gh-pages/books.json")
Data frame from HTML file
books.html <- readHTMLTable(html.url, head=TRUE, as.data.frame=TRUE, stringsAsFactors = FALSE)
books.html
## $`NULL`
## Title
## 1 The Language of SQL
## 2 R for Everyone Advanced Analytics and Graphics
## 3 Data Science for Business
## Author ISBN Publisher
## 1 Larry Rockoff 978-1-4354-5751-5 Course Technology
## 2 Jared P. Lander 978-0-321-88803-7 Addison-Wesley
## 3 Foster Provost, Tom Fawcett 978-1-449-36132-7 O'Reilly
Data frame from XML file
books.parse <- xmlParse(xml.url)
books.root <- xmlRoot(books.parse)
books.xml <- xmlToDataFrame(books.root, stringsAsFactors = FALSE)
books.xml
## Title
## 1 The Language of SQL
## 2 R for Everyone Advanced Analytics and Graphics
## 3 Data Science for Business
## Author ISBN Publisher
## 1 Larry Rockoff 978-1-4354-5751-5 Course Technology
## 2 Jared P. Lander 978-0-321-88803-7 Addison-Wesley
## 3 Foster Provost, Tom Fawcett 978-1-449-36132-7 O'Reilly
Data frame from JSON file
raw.json <- fromJSON(json.url, simplifyVector = FALSE, as.data.frame=TRUE)
unlist.json <- sapply(raw.json[[1]], unlist)
books.json <- do.call("rbind.fill", lapply(lapply(unlist.json, t), data.frame, stringsAsFactors = FALSE))
books.json
## Title Author
## 1 The Language of SQL Larry Rockoff
## 2 R for Everyone Advanced Analytics and Graphics Jared P. Lander
## 3 Data Science for Business <NA>
## ISBN Publisher Author1 Author2
## 1 978-1-4354-5751-5 Course Technology <NA> <NA>
## 2 978-0-321-88803-7 Addison-Wesley <NA> <NA>
## 3 978-1-449-36132-7 O'Reilly Foster Provost Tom Fawcett
Are they identical?
identical(books.html, books.xml)
## [1] FALSE
identical(books.html, books.json)
## [1] FALSE
identical(books.xml, books.json)
## [1] FALSE
The data frames generated from HTML and XML are not identical but similar, the JSON data frame is quite diffrent. The diffrence is apparent on how the JSON data frame handles the book with multiple authors.