Big Data Analytics with R

A Statistical, Analytical and Visualisation Tool

Dhaneshwar Lal Batheja, Mohit Prem Dialani
Masters of Information Technology

Agenda

  • Introduction
  • What is R?
  • Capabilities of R
  • R Packages
  • How we used R in our project
  • R Visualisations
  • A Bit About Shiny
  • A Bit About Slidify
  • Questions

Introduction

About Ourselves



About our Project

What is R?

No Image

Definition


  • "R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS" - The R Project Organisation

  • "R is a data analysis software; R is a programming language; R is an environment for statistical analysis; R is open-source; R is a Community" - www.inside-r.org

  • "While R was initially a statistical computing language, in 2012 you could call it a complete analytical environment" - R for Business Analytics by Ajay Ohri

  • "R is extremely powerful" - Creators of R also called Revolution Analytics

What is R?

The Original GUI

No Image

What is R?

The Advanced GUI

No Image

What is R

Capabilities of R

R Features


  • The R Language is widely used among statisticians and data miners for developing statistical software and analysing data.

  • The capabilities of R are extended through user-created packages, which allow statistical techniques, graphical devices, import/export, reporting tools etc

Feature Type Feature Name
Analytics Basic Mathematics; Basic Statistics; Probability Distribution; Big Data Analytics; Machine Learning; Statistical Modelling
Graphics and Visualisation Static Graphics; Dynamic Graphics
Programming Language Features Input/Output; Object-Oriented Programming; Distributed Computing; Included R Packages

Capabilities of R

R Data Types


R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

Data Types Description
Vectors Numerical, Character or Logical Values.
Matrices All columns in a matrix must have the same mode(numeric, character, etc.) and the same length.
Data Frames A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.)
Arrarys Arrays are similar to matrices but can have more than two dimensions.
Lists An ordered collection of objects (components).

Capabilities of R

R Data Types - Vectors

a <- c(1, 2, 5.3, 6, -2, 4)  # numeric vector
a
## [1]  1.0  2.0  5.3  6.0 -2.0  4.0
b <- c("one", "two", "three")  # character vector
b
## [1] "one"   "two"   "three"

Capabilities of R

R Data Types - Matrices

y <- matrix(1:8, nrow = 2, ncol = 4)  # generates 2 x 4 numeric matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
z <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)  # generates 2 x 3 numeric matrix
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Capabilities of R

R Data Types - Data Frames

d <- c(1, 2, 3, 4)  # Assigning numerical vector to d
e <- c("red", "white", "red", NA)  # Assigning character vector to e
f <- c(TRUE, TRUE, TRUE, FALSE)  # Assigning logical vector to f
mydata <- data.frame(d, e, f)  # Creating Data Frame from vectors
##   ID Color Passed
## 1  1   red   TRUE
## 2  2 white   TRUE
## 3  3   red   TRUE
## 4  4  <NA>  FALSE

Capabilities of R

R Functions


Almost everything in R is done through functions. Some of the built-in functions are:

Category Function
Numerical abs(x); sqrt(x); trunc(x); log(x); etc
Character substr(x, start=n1, stop=n2); paste(..., sep=""); toupper(x) etc
Statistical mean(x); median(x); dnorm(x); rnorm(n, m=0,sd=1) etc
Data Frame fix(x); dim(x); rbind(df1,df2); data.frame(df1,df2); aggregate(df ,by=list(),fun=sum) etc
Other Functions setwd('path'); getwd(); install.packages("package_name"); update.packages(); ?anything; ??anything(Help from Internet) etc

Questions

No Image

R Packages

Introduction


  • More than 5000 packages available today.

  • More than one package for one type of Analytical Task

  • There is a high probability that the algorithm you are looking for is already in the repository.

  • You can build your own algorithm into a package.

  • Packages: Packages are collection of R functions, data and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation.

R Packages

Most Common Packages

Some of the most used packages.

Package Package Function
sqldf Selecting from Data Frames using SQL
forecast For easy forecasting of time series
plyr Data aggregation
RMySQL, ROracle, RSQLite Database connection packages
Rattle A Simple GUI to perform analytical tasks
ggplot2 Data visualization
JGR Java GUI for R
randomForest Random forest predictive models
googleVis Visualisation of Data through Google API
Shiny Building Web Applications
Slidify Create Interactive HTML5 Slideshows

R Packages

What we used?


  • Rattle
  • JGR
  • RMySQL and ROracle
  • googleVis
  • shiny
  • slidify
  • Dependent Packages

R Packages

R in our project

Why?


We chose R over other analytical softwares because:

  • R is an Open Source Project.

  • Good integration with Programming Language.

  • Graphics and Data Visualisation.

  • New and Upcoming.

  • Frequent Package Releases and Upgrades.

R in our project

How?

No Image

R Visualisation

Introduction - googleVis


  • googleVis is a package for R and provides an interface between R and the Google Chart Tools

  • The functions of the package allow users to visualise data with the Google Chart Tools without uploading their data to Google

  • The output of googleVis functions is html code that contains the data and references to JavaScript functions hosted by Google

  • Create wrapper functions in R which generate html files with references to Google's Chart Tools API

  • Run demo(googleVis) to see examples of all charts and read the vignette for more details.

R Visualisation

Example - googleVisTable - Tabular Chart


A simple Table with Data.

require(googleVis)  ##Load Library and Data Sets
## Loading required package: googleVis
## Welcome to googleVis version 0.4.3
## 
## Please read the Google API Terms of Use before you use the package:
## https://developers.google.com/terms/
## 
## Type ?googleVis to access the overall documentation and
## vignette('googleVis') for the package vignette. You can execute a demo of
## the package via: demo(googleVis)
## 
## More information is available on the googleVis project web-site:
## http://code.google.com/p/google-motion-charts-with-r/
## 
## Contact: <rvisualisation@gmail.com>
## 
## To suppress the this message use:
## suppressPackageStartupMessages(library(googleVis))
table1 <- gvisTable(Population, options = list(width = 1000, height = 250))  ## Assign gvisTable function to table1
print(table1, tag = "chart")  ## Plot table1 (only for slidify)

R Visualisation

Example - gvisColumnChart - Column Chart


A Simple Column Chart - Suncorp Case Study

print(tableRC, "chart")
aa <- read.csv("Minmaxriskperstate.csv", header = TRUE, sep = ",")

R Visualisation

Example - gvisColumnChart - Column Chart

print(RiskChart, "chart")

This Map shows the Minimum and Maximum Risk in each State.

R Visualisation

Example - gvisScatterChart - Scatter Chart


R Visualisation

Example - gvisScatterChart - Scatter Chart

ScatterWomen <- gvisScatterChart(women, options = list(pointSize = 4, vAxis = "{title:'weight (lbs)'}", 
    hAxis = "{title:'height (in)'}", width = 500, height = 430))

R Visualisation

Example - Motion Chart (Time Variance)

M5 <- gvisMotionChart(Fruits, "Fruit", timevar = "Year", options = list(height = 350))

R Visualisation

Example - gvisTreeMap - Tree Map Chart (Heirarchical View)


R Visualisation

Example - gvisTreeMap - Tree Map Chart (Heirarchical View)

treeRegions <- gvisTreeMap(Regions, idvar = "Region", parentvar = "Parent", 
    sizevar = "Val", colorvar = "Fac", options = list(showScale = TRUE, width = 600, 
        height = 350))

R Visualisation

Example - gvisOrgChart - Organizational Chart (Heirarchical View)

Org <- gvisOrgChart(Regions, idvar = "Region", parentvar = "Parent", tipvar = "TipVal", 
    options = list(allowCollapse = TRUE, allowHTML = TRUE))

R Visualisation

Example - gvisGeoMap - Map Chart (Markers)


R Visualisation

Example - gvisGeoChart - Map Chart (Regions)

GeoMap <- gvisGeoChart(am, locationvar = "STATE", colorvar = "TOTALCOUNT", options = list(region = "AU", 
    displayMode = "regions", resolution = "provinces", height = 300, width = 1000, 
    gvis.editor = "Edit Me!"))

This Map shows the Total Claims found in each State

R Visualization

Example - gvisMerge - Merging Charts

PieC <- gvisPieChart(ac, labelvar = "STATE", numvar = "TOTAL", options = list(width = 500, 
    height = 300))
table2 <- gvisTable(ac, options = list(width = 500, height = 300))
M1 <- gvisMerge(PieC, table2, horizontal = TRUE)

R Visualization

Example - gvisMerge - Merging Charts

A Bit About Shiny

Build Web Applications


  • Easily build your reports into a dynamic web application.

  • Customize your reports.

  • Let users choose input parameters using sliders, drop-downs etc.

  • No HTML or Java Script necessary. Only a little bit of R knowledge required to turn your analysis into interactive applications.

A Bit About Slidify

A package to create HTML5 Slides from R


  • Create Dynamic Powerful Slideshows.

  • Automatically include your dynamic charts and maps without using another tool.

  • Publish directly on the web.

  • HTML Coding

  • In its development stage, but a very powerful tool.


    This presentation was completely built on R

Credits


Big Thanks to-

Ms. Richi Nayak (Project Supervisor, Motivator)

Suncorp Board (Industry Partners)

Eric Tang (Big Data Lab Expert)

Ajay Ohri (Author of R for Business Analytics)

Ramnath Vaidyanathan (Creator of Package Slidify)

Rest of the Team Lin Chen, Sejung (Group Members)

Questions

No Image

Thank You

Thank you for attending our presentation