Different Data Formats


Summary

We make an attempt to acquire data in different formats and use appropriate packages to acquire / load and process them into R readable format of data frame.


R Code :

Environment Setup

Loading Packages Used

knitr::opts_chunk$set(message = FALSE, echo = TRUE)

# Library to read data file
library(RCurl)

library(knitr)

# For loading XML, JSON Data files
library(XML)
library(jsonlite)
library(htmltab)

# Library for data display in tabular format
library(DT)

Different Data Formats

Data available in HTML, JSON and XML formats are loaded.

The data files considered are in following format. Below are the github links to the data files.

Data Files

HTML

JSON

XML


HTML Data Acquisition

html.giturl <- "https://raw.githubusercontent.com/DataDriven-MSDA/DATA607/master/Week7A/myfavbookshtml.html"

# The htmltab() from package HTMLTAB directly gives data frame as output.

book.html <- htmltab::htmltab(html.giturl)

class(book.html)
## [1] "data.frame"
# Verifying records and variables

nrow(book.html)
## [1] 5
ncol(book.html)
## [1] 8
# Renaming Columns
colnames(book.html) <- c("Title", "Authors", "Genre", "Language", "Publisher", "Pages", 
    "Rating", "Price(INR)")

# Display data frame content
datatable(book.html, rownames = FALSE)

JSON Data Acquisition

json.giturl <- "https://raw.githubusercontent.com/DataDriven-MSDA/DATA607/master/Week7A/myfavbooksjson.json"

json.baseurlstr <- paste(readLines(json.giturl), collapse = "")


book.jsondata <- fromJSON(json.baseurlstr)

class(book.jsondata)
## [1] "list"
# converting to Data Frame.

book.json <- as.data.frame(book.jsondata)

# Verifying records and variables
nrow(book.json)
## [1] 5
ncol(book.json)
## [1] 8
# Renaming Columns
colnames(book.json) <- c("Title", "Authors", "Genre", "Language", "Publisher", "Pages", 
    "Rating", "Price(INR)")

# Display data frame content
datatable(book.json, rownames = FALSE)

XML Data Acquisition

xml.giturl <- "https://raw.githubusercontent.com/DataDriven-MSDA/DATA607/master/Week7A/myfavbooksxml.xml"

book.xmldata <- xmlParse(getURL(xml.giturl))  # get XML file contents

class(book.xmldata)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
xmlSize(book.xmldata)
## [1] 1
# Converting to data frame

book.xml <- xmlToDataFrame(book.xmldata)

# Verifying records and variables

nrow(book.xml)
## [1] 5
ncol(book.xml)
## [1] 8
head(book.xml)
##                         Title                                      Authors
## 1            Birbal the Witty                           Kamala Chandrakant
## 2 Tales from the Panchatantra                Anant Pai, Kamala Chandrakant
## 3                Oliver Twist                              Charles Dickens
## 4              The Road Ahead Bill Gates, Nathan Myhrvold, Peter Rinearson
## 5                 R In Action                           Robert L. Kabacoff
##                   Genre Language                   Publisher Pages Rating
## 1      Children's Books  English           Amar Chitra Katha    32      5
## 2      Children's Books  English           Amar Chitra Katha    96      5
## 3           Young Adult  English Vintage Children's Classics   736      5
## 4 Young Adult, Business  English               Penguin Books   352    4.5
## 5      Computer Science  English                     Manning   608      4
##   PriceINR
## 1       67
## 2      250
## 3      150
## 4     1250
## 5      575
# Renaming Columns

colnames(book.xml) <- c("Title", "Authors", "Genre", "Language", "Publisher", "Pages", 
    "Rating", "Price(INR)")

# Display data frame content
datatable(book.xml, rownames = FALSE)

The identical data frames are obtained although the source data formats were different.

Thus we observe that data can be acquired and loaded in different formats with usage of appropriate loading libraries and processing steps.