Fraud Analytics - Detection of Financial Manipulators ( Part 1)

The Dataset consists of companies who have manipulated their books.
We will try to identify a manipulater from a non manipulater based on the financial ratios reported by the companies.

There are a total of 1239 observations in the data set.
Out of these 1239 observations, there are 1200 non manipulaters and 39 manipulaters

Libraries Required

library(caret)
library(ROCR)
library(ggplot2)
library(readxl)
library(Hmisc)
library(AppliedPredictiveModeling)
library(corrplot)
library(DMwR)
library(ROSE)
library(reprtree)
library(randomForest)
library(rattle)           
library(inTrees)          #THE GOD

Read Data Set

fraud <- read_excel("C:/Users/aaum/Desktop/Projects/Fraud Analytics/Fraud Data.xlsx",
                    sheet = "DATA")
fraud <-as.data.frame(fraud[,2:10]) # Remove Company ID
fraud$Manipulator <- factor(fraud$Manipulator, levels = c(0,1), labels = c("No","Yes"))
str(fraud)

## 'data.frame':    1239 obs. of  9 variables:
##  $ DSRI       : num  1.62 1 1 1.49 1 ...
##  $ GMI        : num  1.13 1.61 1.02 1 1.37 ...
##  $ AQI        : num  7.185 1.005 1.241 0.466 0.637 ...
##  $ SGI        : num  0.366 13.081 1.475 0.673 0.861 ...
##  $ DEPI       : num  1.38 0.4 1.17 2 1.45 ...
##  $ SGAI       : num  1.6241 5.1982 0.6477 0.0929 1.7415 ...
##  $ ACCR       : num  -0.1668 0.0605 0.0367 0.2734 0.123 ...
##  $ LEVI       : num  1.161 0.987 1.264 0.681 0.939 ...
##  $ Manipulator: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...

Data Visualization

Box Plots

transparentTheme(trans = .4)
featurePlot(x = fraud[, 1:8], 
            y = fraud$Manipulator, 
            plot = "box", 
            ## Pass in options to bwplot() 
            scales = list(y = list(relation="free"),
                          x = list(rot = 90)),  
            layout = c(2,1), 
            auto.key = list(columns = 2))

Histograms

par(mfrow=c(1,2))
for(i in 1:8) {
    hist(fraud[,i], main=names(fraud)[i])
}

Data Preprocessing

Identiying Variables which have Zero Variance or Near Zero Variance

zeroNrzerovar <- nearZeroVar(fraud[,1:8], saveMetrics= TRUE)
zeroNrzerovar

##      freqRatio percentUnique zeroVar   nzv
## DSRI        16      98.78935   FALSE FALSE
## GMI         92      92.65537   FALSE FALSE
## AQI          2      99.91929   FALSE FALSE
## SGI          1     100.00000   FALSE FALSE
## DEPI        13      99.03148   FALSE FALSE
## SGAI         6      97.90153   FALSE FALSE
## ACCR         1     100.00000   FALSE FALSE
## LEVI         2      99.91929   FALSE FALSE

Identifying Collinearity and Multicollinearity

## Correlated Variables
corVariables <- cor(fraud[,1:8])
corrplot(corVariables)

## Multicollinearity
highlyCorVariables <- findCorrelation(corVariables, cutoff = .75)
highlyCorVariables

## integer(0)

Findings * None of the variables have Zero or Near Zero Variance
* None of the variables are correlated or show linear dependencies

Create Data Partition

set.seed(1408)
splitIndex <- createDataPartition(fraud$Manipulator, p = .6,list = FALSE, times = 1)
trainData <- fraud[splitIndex,]
testData <- fraud[-splitIndex,]

Check for Proportion of Fraudulent Companies in Train and Test Data

describe(trainData$Manipulator)

## trainData$Manipulator 
##        n  missing distinct 
##      744        0        2 
##                       
## Value         No   Yes
## Frequency    720    24
## Proportion 0.968 0.032

describe(testData$Manipulator)

## testData$Manipulator 
##        n  missing distinct 
##      495        0        2 
##                     
## Value        No  Yes
## Frequency   480   15
## Proportion 0.97 0.03

Proportion of Fraudulent Companies maintained in Train and Test Data

Methodology

Create Balanced Data using different Sampling Methods

ML algorithms struggle with accuracy because of the unequal distribution in dependent variable. This causes the performance of existing classifiers to get biased towards majority class. Hence we need to first create balanced datasets.

1. Oversampling : Randomly sample (with replacement) the minority class to be the same size as the majority class

set.seed(1408)
over_train <- ovun.sample(Manipulator ~ ., data = trainData, method = "over")$data
describe(over_train$Manipulator)

## over_train$Manipulator 
##        n  missing distinct 
##     1443        0        2 
##                       
## Value         No   Yes
## Frequency    720   723
## Proportion 0.499 0.501

We have upsampled the no.of positive observations from 24 to 723. Lets expect Overfitting !!!

1. Undersampling: Randomly subset the majority class to the same size as the minority class

set.seed(1408)
under_train <- ovun.sample(Manipulator ~ ., data = trainData, method = "under")$data
describe(under_train$Manipulator)

## under_train$Manipulator 
##        n  missing distinct 
##       50        0        2 
##                     
## Value        No  Yes
## Frequency    26   24
## Proportion 0.52 0.48

1. Synthetic Data Generation using SMOTE (Syntetic Minority Oversampling Technique): Down-sample the majority class and synthesize new data points in the minority class

set.seed(1408)
smote_train <- SMOTE(Manipulator ~.,trainData,perc.over = 1000,perc.under = 100)
describe(smote_train$Manipulator)

## smote_train$Manipulator 
##        n  missing distinct 
##      504        0        2 
##                       
## Value         No   Yes
## Frequency    240   264
## Proportion 0.476 0.524

Under each sampling method, we will try different machine learning algorithm an come up with the best model
Build Models and Assess them through Evaluation Parameter “ROC”

seed   <- 1408   # Keeping seed constant for all models
metric <- "ROC"  # Evaluation metric for Model selection

Random Forest Models

** Setting control parameters and tuning grid**

rfControl <- trainControl(method = "boot", # Bootstrap Sampling with 25 Repetations
                          returnResamp='final', 
                          summaryFunction = twoClassSummary,
                          savePredictions = TRUE,
                          classProbs = TRUE)

rfGrid <- expand.grid(.mtry=c(1:7)) 

# mtry : Number of variables randomly sampled as candidates at each split
# Default no.of trees is 500 in Caret. 
# Hence we wouldnt be able to find the optimal no. of trees using direct grid search. 
# However (if required) we can extend Caret to create a new random forest algorithm to support grid search for no. of trees

** Models created using Oversampling and Undersampling methods were Overfitting.**

** Hence we will restrict analysis to SMOTEd Train Data. We will also work with Original Train Data..Just to compare**

## Random Forest Model from Original Train Data
set.seed(seed)
rfOrigFit <- train(Manipulator ~.,trainData,
                   method='rf', 
                   trControl= rfControl,
                   metric = metric,
                   tuneGrid = rfGrid,
                   prox=TRUE,allowParallel=TRUE)
print(rfOrigFit)

## Random Forest 
## 
## 744 samples
##   8 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 744, 744, 744, 744, 744, 744, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec      
##   1     0.8931678  0.9996916  0.01763636
##   2     0.8656787  0.9975628  0.08300289
##   3     0.8409601  0.9945242  0.08700289
##   4     0.8380801  0.9919491  0.08933622
##   5     0.8091521  0.9917846  0.10979798
##   6     0.8190316  0.9905881  0.13510390
##   7     0.8065067  0.9901140  0.12899278
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 1.

## Random Forest Model from Oversampled Train Data
#set.seed(seed)
#rfUpFit <- train(Manipulator~.,over_train,
#                   method='rf', 
#                   trControl=rfControl,  
#                   metric = metric,
#                   tuneGrid = rfGrid,
#                   prox=TRUE,allowParallel=TRUE)
#print(rfUpFit)

## Random Forest Model from Undersampled Train Data
#set.seed(seed)
#rfdownFit <- train(Manipulator~.,under_train,
#                   method='rf', 
#                   trControl=rfControl,  
#                   metric = metric,
#                   tuneGrid = rfGrid,
#                   prox=TRUE,allowParallel=TRUE)
#print(rfdownFit)

## Random Forest Model from Synthetic Train Data (SMOTE)
set.seed(seed)
rfSMOTEFit <- train(Manipulator~.,smote_train,
                   method='rf', 
                   trControl=rfControl,  
                   metric = "ROC",
                   tuneGrid = rfGrid,
                   prox=TRUE,allowParallel=TRUE)
print(rfSMOTEFit)

## Random Forest 
## 
## 504 samples
##   8 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 504, 504, 504, 504, 504, 504, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   1     0.9895389  0.9164035  0.9795607
##   2     0.9879765  0.9117107  0.9766618
##   3     0.9864202  0.9112393  0.9728689
##   4     0.9844702  0.9094768  0.9712027
##   5     0.9831152  0.9080661  0.9673813
##   6     0.9812527  0.9040850  0.9658996
##   7     0.9797642  0.9007457  0.9609499
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 1.

** Comparison of Random Forest Models **

rfModels <- list(Original_Model = rfOrigFit,
                 #Undersampled_Train = rfdownFit,
                 #Oversampled_Train = rfUpFit,
                 SMOTEd_Train = rfSMOTEFit)

rfResamples <- resamples(rfModels)
summary(rfResamples)

## 
## Call:
## summary.resamples(object = rfResamples)
## 
## Models: Original_Model, SMOTEd_Train 
## Number of resamples: 25 
## 
## ROC 
##                  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## Original_Model 0.7821  0.8818 0.9101 0.8932  0.9278 0.9498    0
## SMOTEd_Train   0.9783  0.9855 0.9900 0.9895  0.9926 0.9990    0
## 
## Sens 
##                  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## Original_Model 0.9961  1.0000 1.0000 0.9997   1.000 1.0000    0
## SMOTEd_Train   0.8721  0.8889 0.9176 0.9164   0.931 0.9759    0
## 
## Spec 
##                  Min. 1st Qu. Median    Mean 3rd Qu.  Max. NA's
## Original_Model 0.0000  0.0000 0.0000 0.01764  0.0000 0.125    0
## SMOTEd_Train   0.9167  0.9714 0.9798 0.97960  0.9907 1.000    0

scales <- list(x=list(relation="free"), y=list(relation="free"))
dotplot(rfResamples, scales=scales)

As expected, model created using original train dataset suffers on one of the evaluation parameters (Specificity)
SMOTE has given us better results.
Lets see how SMOTEd Model performs on Test Data !!!

rfSMOTEPred <- predict(rfSMOTEFit, testData, type = "raw")
confusionMatrix(rfSMOTEPred, testData$Manipulator)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  430   4
##        Yes  50  11
##                                          
##                Accuracy : 0.8909         
##                  95% CI : (0.8601, 0.917)
##     No Information Rate : 0.9697         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2531         
##  Mcnemar's Test P-Value : 9.141e-10      
##                                          
##             Sensitivity : 0.8958         
##             Specificity : 0.7333         
##          Pos Pred Value : 0.9908         
##          Neg Pred Value : 0.1803         
##              Prevalence : 0.9697         
##          Detection Rate : 0.8687         
##    Detection Prevalence : 0.8768         
##       Balanced Accuracy : 0.8146         
##                                          
##        'Positive' Class : No             
##

Paravailla !!! Nallavey Peform Panraan

Aduthu chumma oru ROC.

rfSMOTEPred <- predict(rfSMOTEFit, testData, type = "prob")[,2]

# Calculating the values for ROC curve
rfSMOTEprediction <- prediction(rfSMOTEPred,testData$Manipulator)
rfSMOTEperf <- performance(rfSMOTEprediction, "tpr","fpr")

# Plotting the ROC curve
plot(rfSMOTEperf,main="ROC Curve for SMOTEd Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=3,col="black")

# calculating AUC
rfSMOTEfit.auc <- performance(rfSMOTEprediction, "auc")

# Converting S4 class to vector
rfSMOTEfit.auc <- unlist(slot(rfSMOTEfit.auc, "y.values"))
rfSMOTEfit.auc

## [1] 0.8950694

The Best Model is

print(rfSMOTEFit)

## Random Forest 
## 
## 504 samples
##   8 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 504, 504, 504, 504, 504, 504, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   1     0.9895389  0.9164035  0.9795607
##   2     0.9879765  0.9117107  0.9766618
##   3     0.9864202  0.9112393  0.9728689
##   4     0.9844702  0.9094768  0.9712027
##   5     0.9831152  0.9080661  0.9673813
##   6     0.9812527  0.9040850  0.9658996
##   7     0.9797642  0.9007457  0.9609499
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 1.

Lets also plot the important variables

rfSMOTEvarImp <- varImp(rfSMOTEFit)
plot(rfSMOTEvarImp)

Interpreting Random Forest Results using inTrees package..THE SUPERSTAR !!!

So what does inTrees do ?

It helps us to interpret the rules from all the trees we created out of Random Forest / GBM etc etc

Sema !!!

How does inTrees Work?

1. inTrees extracts the rules that govern the splits in each of the ensembles trees
1. It then compares these rules to the true class label (dependent variable)
1. The Rules are then processed to removing duplicates
1. Rule Lengths are then measures and pruned the rules down to their simple form
1. Finally these processed rules are summarized into a simple set of if/then rules.This in known as a Simplified Tree Ensemble Learner (STEL)

Lets get the juice out of our model using inTrees

Extract a Single Tree From Random Forest

Tree_Number <- 5 # Enter Tree Number
rf_singletree <- getTree(rfSMOTEFit$finalModel,k=Tree_Number, labelVar = TRUE)

print(rf_singletree)

##     left daughter right daughter split var split point status prediction
## 1               2              3      DEPI  0.77252572      1       <NA>
## 2               4              5      ACCR -0.03877641      1       <NA>
## 3               6              7      SGAI  0.79843644      1       <NA>
## 4               8              9       GMI  0.79615469      1       <NA>
## 5              10             11       GMI  0.81415599      1       <NA>
## 6              12             13      ACCR -0.02063350      1       <NA>
## 7              14             15       AQI  3.19034748      1       <NA>
## 8               0              0      <NA>  0.00000000     -1        Yes
## 9              16             17      DEPI  0.71593420      1       <NA>
## 10             18             19       SGI  1.74930483      1       <NA>
## 11             20             21       AQI  0.74707600      1       <NA>
## 12             22             23      DEPI  0.98563919      1       <NA>
## 13             24             25      DSRI  1.60316630      1       <NA>
## 14             26             27      DEPI  1.09285066      1       <NA>
## 15             28             29      LEVI  1.03457363      1       <NA>
## 16              0              0      <NA>  0.00000000     -1         No
## 17             30             31       SGI  1.60852019      1       <NA>
## 18             32             33      LEVI  1.00702704      1       <NA>
## 19              0              0      <NA>  0.00000000     -1        Yes
## 20              0              0      <NA>  0.00000000     -1        Yes
## 21             34             35       SGI  2.39803609      1       <NA>
## 22              0              0      <NA>  0.00000000     -1         No
## 23             36             37       AQI  1.10330111      1       <NA>
## 24             38             39      ACCR  0.01055985      1       <NA>
## 25             40             41      LEVI  0.77810226      1       <NA>
## 26             42             43       GMI  0.88214988      1       <NA>
## 27             44             45       GMI  1.20678268      1       <NA>
## 28             46             47      DSRI  0.96292337      1       <NA>
## 29              0              0      <NA>  0.00000000     -1        Yes
## 30              0              0      <NA>  0.00000000     -1        Yes
## 31              0              0      <NA>  0.00000000     -1         No
## 32              0              0      <NA>  0.00000000     -1        Yes
## 33              0              0      <NA>  0.00000000     -1         No
## 34              0              0      <NA>  0.00000000     -1         No
## 35              0              0      <NA>  0.00000000     -1        Yes
## 36             48             49      SGAI  0.27315450      1       <NA>
## 37              0              0      <NA>  0.00000000     -1         No
## 38             50             51      DEPI  0.99006982      1       <NA>
## 39             52             53      DSRI  0.68987732      1       <NA>
## 40              0              0      <NA>  0.00000000     -1        Yes
## 41              0              0      <NA>  0.00000000     -1         No
## 42             54             55      LEVI  0.72345263      1       <NA>
## 43             56             57       AQI  1.16222724      1       <NA>
## 44             58             59       GMI  1.08397011      1       <NA>
## 45             60             61       GMI  1.51017743      1       <NA>
## 46              0              0      <NA>  0.00000000     -1         No
## 47              0              0      <NA>  0.00000000     -1        Yes
## 48              0              0      <NA>  0.00000000     -1         No
## 49              0              0      <NA>  0.00000000     -1        Yes
## 50              0              0      <NA>  0.00000000     -1        Yes
## 51             62             63       SGI  1.06747106      1       <NA>
## 52             64             65      DSRI  0.58778369      1       <NA>
## 53             66             67      DSRI  1.06169573      1       <NA>
## 54              0              0      <NA>  0.00000000     -1        Yes
## 55             68             69      LEVI  1.00697650      1       <NA>
## 56             70             71      DSRI  1.31772676      1       <NA>
## 57             72             73       AQI  1.45309813      1       <NA>
## 58              0              0      <NA>  0.00000000     -1         No
## 59             74             75       SGI  1.18234918      1       <NA>
## 60             76             77       GMI  1.32030003      1       <NA>
## 61              0              0      <NA>  0.00000000     -1         No
## 62             78             79      DSRI  0.99169944      1       <NA>
## 63             80             81      DSRI  0.56854110      1       <NA>
## 64              0              0      <NA>  0.00000000     -1        Yes
## 65              0              0      <NA>  0.00000000     -1         No
## 66              0              0      <NA>  0.00000000     -1        Yes
## 67             82             83      DSRI  1.07178969      1       <NA>
## 68              0              0      <NA>  0.00000000     -1         No
## 69             84             85      DEPI  0.96943950      1       <NA>
## 70             86             87      DSRI  0.87661870      1       <NA>
## 71             88             89       AQI -0.33606078      1       <NA>
## 72             90             91      LEVI  1.05642704      1       <NA>
## 73             92             93       SGI  1.28054787      1       <NA>
## 74             94             95      DSRI  1.52103848      1       <NA>
## 75              0              0      <NA>  0.00000000     -1         No
## 76             96             97      LEVI  0.87832471      1       <NA>
## 77             98             99      LEVI  1.66703384      1       <NA>
## 78              0              0      <NA>  0.00000000     -1         No
## 79            100            101       AQI  1.16069738      1       <NA>
## 80              0              0      <NA>  0.00000000     -1        Yes
## 81              0              0      <NA>  0.00000000     -1         No
## 82              0              0      <NA>  0.00000000     -1         No
## 83              0              0      <NA>  0.00000000     -1        Yes
## 84              0              0      <NA>  0.00000000     -1        Yes
## 85            102            103      DEPI  0.99794376      1       <NA>
## 86            104            105      DEPI  0.88864699      1       <NA>
## 87            106            107       AQI  0.90879805      1       <NA>
## 88              0              0      <NA>  0.00000000     -1        Yes
## 89            108            109      SGAI  0.85982065      1       <NA>
## 90              0              0      <NA>  0.00000000     -1         No
## 91            110            111       SGI  1.10826917      1       <NA>
## 92              0              0      <NA>  0.00000000     -1         No
## 93              0              0      <NA>  0.00000000     -1        Yes
## 94              0              0      <NA>  0.00000000     -1        Yes
## 95              0              0      <NA>  0.00000000     -1         No
## 96              0              0      <NA>  0.00000000     -1        Yes
## 97            112            113       GMI  1.29349896      1       <NA>
## 98              0              0      <NA>  0.00000000     -1        Yes
## 99              0              0      <NA>  0.00000000     -1         No
## 100             0              0      <NA>  0.00000000     -1        Yes
## 101           114            115      DEPI  1.17255006      1       <NA>
## 102             0              0      <NA>  0.00000000     -1         No
## 103           116            117      DSRI  1.14240373      1       <NA>
## 104           118            119      ACCR  0.09549754      1       <NA>
## 105           120            121      ACCR  0.09115667      1       <NA>
## 106           122            123      DEPI  0.98941302      1       <NA>
## 107           124            125      DEPI  1.03185302      1       <NA>
## 108           126            127      DSRI  1.66446562      1       <NA>
## 109           128            129      LEVI  1.27484562      1       <NA>
## 110             0              0      <NA>  0.00000000     -1         No
## 111             0              0      <NA>  0.00000000     -1        Yes
## 112             0              0      <NA>  0.00000000     -1        Yes
## 113             0              0      <NA>  0.00000000     -1         No
## 114             0              0      <NA>  0.00000000     -1         No
## 115             0              0      <NA>  0.00000000     -1        Yes
## 116           130            131      LEVI  1.08149188      1       <NA>
## 117             0              0      <NA>  0.00000000     -1        Yes
## 118           132            133      ACCR -0.02134921      1       <NA>
## 119             0              0      <NA>  0.00000000     -1         No
## 120             0              0      <NA>  0.00000000     -1         No
## 121             0              0      <NA>  0.00000000     -1        Yes
## 122             0              0      <NA>  0.00000000     -1         No
## 123           134            135       SGI  1.30859395      1       <NA>
## 124           136            137      DSRI  0.99504151      1       <NA>
## 125           138            139       AQI  0.95646039      1       <NA>
## 126             0              0      <NA>  0.00000000     -1         No
## 127             0              0      <NA>  0.00000000     -1        Yes
## 128           140            141       AQI  0.87474790      1       <NA>
## 129           142            143      DEPI  1.01301468      1       <NA>
## 130             0              0      <NA>  0.00000000     -1        Yes
## 131             0              0      <NA>  0.00000000     -1         No
## 132             0              0      <NA>  0.00000000     -1         No
## 133             0              0      <NA>  0.00000000     -1        Yes
## 134             0              0      <NA>  0.00000000     -1         No
## 135             0              0      <NA>  0.00000000     -1        Yes
## 136             0              0      <NA>  0.00000000     -1         No
## 137           144            145      SGAI  1.01291285      1       <NA>
## 138           146            147       AQI  0.94793768      1       <NA>
## 139             0              0      <NA>  0.00000000     -1         No
## 140             0              0      <NA>  0.00000000     -1        Yes
## 141           148            149      DSRI  1.46266027      1       <NA>
## 142           150            151       AQI  0.83062774      1       <NA>
## 143             0              0      <NA>  0.00000000     -1         No
## 144           152            153       GMI  0.99897290      1       <NA>
## 145           154            155      DSRI  1.19857455      1       <NA>
## 146             0              0      <NA>  0.00000000     -1         No
## 147             0              0      <NA>  0.00000000     -1        Yes
## 148             0              0      <NA>  0.00000000     -1        Yes
## 149           156            157      LEVI  0.97533355      1       <NA>
## 150             0              0      <NA>  0.00000000     -1         No
## 151           158            159      DEPI  0.96767678      1       <NA>
## 152             0              0      <NA>  0.00000000     -1         No
## 153           160            161       SGI  1.23380978      1       <NA>
## 154           162            163      ACCR -0.03744539      1       <NA>
## 155             0              0      <NA>  0.00000000     -1        Yes
## 156           164            165      ACCR -0.01972793      1       <NA>
## 157           166            167      SGAI  1.34360663      1       <NA>
## 158             0              0      <NA>  0.00000000     -1         No
## 159             0              0      <NA>  0.00000000     -1        Yes
## 160             0              0      <NA>  0.00000000     -1         No
## 161           168            169      SGAI  0.95319057      1       <NA>
## 162             0              0      <NA>  0.00000000     -1         No
## 163             0              0      <NA>  0.00000000     -1        Yes
## 164             0              0      <NA>  0.00000000     -1        Yes
## 165             0              0      <NA>  0.00000000     -1         No
## 166           170            171       GMI  1.04256095      1       <NA>
## 167             0              0      <NA>  0.00000000     -1        Yes
## 168             0              0      <NA>  0.00000000     -1        Yes
## 169             0              0      <NA>  0.00000000     -1         No
## 170           172            173       SGI  1.13482843      1       <NA>
## 171             0              0      <NA>  0.00000000     -1        Yes
## 172             0              0      <NA>  0.00000000     -1         No
## 173             0              0      <NA>  0.00000000     -1        Yes

Extracting all the Rules from all the trees built using Random Forest

# Transforming Random Forest object to a list of trees
rfSMOTEtreelist <- RF2List(rfSMOTEFit$finalModel)

# Extracting rules from a list of trees
rfSMOTErules <- extractRules(rfSMOTEtreelist, smote_train[,c(1:8)])

## 2841 rules (length<=6) were extracted from the first 100 trees.

# Assigning outcomes to a conditions, and measure the rules
rfSMOTErulesmetric <- getRuleMetric(rfSMOTErules, smote_train[,c(1:8)], smote_train[,9])

# Pruning irrevant variable-value pair from a rule condition
rfSMOTErulesmetric <- pruneRule(rfSMOTErulesmetric, smote_train[,c(1:8)], smote_train[,9])

# Selecting a set of relevant and non-redundant rules
rfSMOTErulesmetric <- selectRuleRRF(rfSMOTErulesmetric, smote_train[,c(1:8)], smote_train[,9])

# Present a learner using column names instead of X[i,]
presentablerules <- presentRules(rfSMOTErulesmetric, colnames(smote_train[,c(1:8)]))
print(as.data.frame(presentablerules))

##   len  freq                 err
## 1   4 0.353  0.0669999999999999
## 2   4 0.179                   0
## 3   4 0.111               0.018
## 4   3 0.119  0.0669999999999999
## 5   3 0.218 0.00900000000000001
## 6   3 0.026                   0
## 7   3 0.016                   0
##                                                                                          condition
## 1   DSRI<=1.29990117249215 & GMI<=1.15465350645863 & AQI<=5.23533496331984 & SGI<=1.34901480793138
## 2 GMI<=1.3045359804137 & DEPI<=0.996429590132862 & SGAI<=1.17686793901743 & LEVI>0.760934628603765
## 3 SGI>1.20182269540422 & DEPI>0.99204493039678 & SGAI>0.931651687982791 & ACCR>-0.0298248401569786
## 4                         AQI<=3.25882092035874 & DEPI>1.08738102250767 & ACCR<=0.0297860563181046
## 5                      DSRI<=1.31341511560337 & ACCR<=-0.0174723081573545 & LEVI<=1.45486331869587
## 6                           AQI<=3.7800924277716 & SGI<=0.652829488100357 & DEPI>0.778331242394198
## 7                              GMI>1.01635374518407 & AQI>2.22899015137713 & AQI<=3.46258785002568
##   pred             impRRF
## 1   No                  1
## 2   No  0.353142114916692
## 3  Yes  0.153915696130407
## 4   No  0.149982486819147
## 5   No 0.0995218204642667
## 6   No 0.0979456761710641
## 7  Yes 0.0481375932870851

What is impRRF ? Avalo aaraichi theva illa !!!

Building a Simplified Tree Ensembler from the Rules we got

rfSMOTElearner <- buildLearner(rfSMOTErulesmetric,smote_train[,c(1:8)],smote_train[,9])

## Print Rules in Presentable Format 
print(as.data.frame(presentRules(rfSMOTElearner, colnames(smote_train[,c(1:8)]))))

##   len               freq                err
## 1   4  0.178571428571429                  0
## 2   3 0.0257936507936508                  0
## 3   3 0.0158730158730159                  0
## 4   3  0.101190476190476                  0
## 5   4  0.107142857142857 0.0185185185185185
## 6   4  0.132936507936508 0.0149253731343284
## 7   1  0.438492063492063 0.0859728506787331
##                                                                                          condition
## 1 GMI<=1.3045359804137 & DEPI<=0.996429590132862 & SGAI<=1.17686793901743 & LEVI>0.760934628603765
## 2                           AQI<=3.7800924277716 & SGI<=0.652829488100357 & DEPI>0.778331242394198
## 3                              GMI>1.01635374518407 & AQI>2.22899015137713 & AQI<=3.46258785002568
## 4                         AQI<=3.25882092035874 & DEPI>1.08738102250767 & ACCR<=0.0297860563181046
## 5 SGI>1.20182269540422 & DEPI>0.99204493039678 & SGAI>0.931651687982791 & ACCR>-0.0298248401569786
## 6   DSRI<=1.29990117249215 & GMI<=1.15465350645863 & AQI<=5.23533496331984 & SGI<=1.34901480793138
## 7                                                                                             Else
##   pred
## 1   No
## 2   No
## 3  Yes
## 4   No
## 5  Yes
## 6   No
## 7  Yes

** Now that we have the STEL, we can simply use it to apply on any data. We can create multiple STEL (for GBM, XGBoost, etc) and mebbe come up with a Stacked STEL.. It can work**

We will keep that for later.. But for now..Lets do GBM

Stochastic Gradient Boosting Models

gbmControl <- trainControl(method='boot',
                           returnResamp='final', 
                           summaryFunction = twoClassSummary,
                           savePredictions = TRUE,
                           classProbs = TRUE)

gbmGrid <- expand.grid(interaction.depth = c(1,2,3), 
                           n.trees = (1:20)*50, 
                           shrinkage = 0.1,
                           n.minobsinnode = 2)

## Gradient Boosting with Original Train Data
set.seed(seed)
gbmOrigFit <- train(Manipulator ~., trainData, 
                  method='gbm', 
                  trControl=gbmControl,
                  verbose=FALSE,
                  tuneGrid = gbmGrid,
                  metric = metric)

## Loading required package: gbm

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.1

## Loading required package: plyr

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:DMwR':
## 
##     join

## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize

## Gradient Boosting with Upsampled Train Data
#set.seed(seed)
#gbmUpFit <- train(Manipulator ~., over_train, 
#                  method='gbm', 
#                  trControl=gbmControl,
#                  verbose=FALSE,
#                  tuneGrid = gbmGrid,
#                  metric = metric)

## Gradient Boosting with Downsampled Data
#set.seed(seed)
#gbmDownFit <- train(Manipulator ~., under_train, 
#                  method='gbm', 
#                  trControl=gbmControl,
#                  verbose=FALSE,
#                  tuneGrid = gbmGrid,
#                  metric = metric)

## Gradient Boosting with Synthetically Generated Data (SMOTE)
set.seed(seed)
gbmSMOTEFit <- train(Manipulator ~., smote_train, 
                  method='gbm', 
                  trControl=gbmControl,
                  verbose=FALSE,
                  tuneGrid = gbmGrid,
                  metric = metric)

gbmModels <- list(Original = gbmOrigFit,
                 #Undersample = gbmDownFit,
                 #Oversample = gbmUpFit,
                 SMOTE = gbmSMOTEFit)

gbmResamples <- resamples(gbmModels)
summary(rfResamples)

## 
## Call:
## summary.resamples(object = rfResamples)
## 
## Models: Original_Model, SMOTEd_Train 
## Number of resamples: 25 
## 
## ROC 
##                  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## Original_Model 0.7821  0.8818 0.9101 0.8932  0.9278 0.9498    0
## SMOTEd_Train   0.9783  0.9855 0.9900 0.9895  0.9926 0.9990    0
## 
## Sens 
##                  Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## Original_Model 0.9961  1.0000 1.0000 0.9997   1.000 1.0000    0
## SMOTEd_Train   0.8721  0.8889 0.9176 0.9164   0.931 0.9759    0
## 
## Spec 
##                  Min. 1st Qu. Median    Mean 3rd Qu.  Max. NA's
## Original_Model 0.0000  0.0000 0.0000 0.01764  0.0000 0.125    0
## SMOTEd_Train   0.9167  0.9714 0.9798 0.97960  0.9907 1.000    0

scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(gbmResamples, scales=scales)

gbmSMOTEPred <- predict(gbmSMOTEFit, testData, type = "raw")
confusionMatrix(gbmSMOTEPred, testData$Manipulator)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  422   4
##        Yes  58  11
##                                           
##                Accuracy : 0.8747          
##                  95% CI : (0.8423, 0.9026)
##     No Information Rate : 0.9697          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2232          
##  Mcnemar's Test P-Value : 1.685e-11       
##                                           
##             Sensitivity : 0.8792          
##             Specificity : 0.7333          
##          Pos Pred Value : 0.9906          
##          Neg Pred Value : 0.1594          
##              Prevalence : 0.9697          
##          Detection Rate : 0.8525          
##    Detection Prevalence : 0.8606          
##       Balanced Accuracy : 0.8062          
##                                           
##        'Positive' Class : No              
##

SMOTEd GBM is pretty good. But as of now SMOTEd RF has done well

Lets build Simplified Tree Ensembler from the SMOTEd GBM Model we got

# Transforming Random Forest object to a list of trees
gbmSMOTEtreelist <- GBM2List(gbmSMOTEFit$finalModel, smote_train[,c(1:8)])

# Extracting rules from a list of trees
gbmSMOTErules <- extractRules(gbmSMOTEtreelist, smote_train[,c(1:8)])

## 400 rules (length<=6) were extracted from the first 100 trees.

# Assigning outcomes to a conditions, and measure the rules
gbmSMOTErulesmetric <- getRuleMetric(gbmSMOTErules, smote_train[,c(1:8)], smote_train[,9])

# Pruning irrevant variable-value pair from a rule condition
gbmSMOTErulesmetric <- pruneRule(gbmSMOTErulesmetric, smote_train[,c(1:8)], smote_train[,9])

# Selecting a set of relevant and non-redundant rules
gbmSMOTErulesmetric <- selectRuleRRF(gbmSMOTErulesmetric, smote_train[,c(1:8)], smote_train[,9])

# Present a learner using column names instead of X[i,]
presentablerules <- presentRules(gbmSMOTErulesmetric, colnames(smote_train[,c(1:8)]))
#print(as.data.frame(presentablerules))

gbmSMOTElearner <- buildLearner(gbmSMOTErulesmetric,smote_train[,c(1:8)],smote_train[,9])

## Print Rules in Presentable Format 
print(as.data.frame(presentRules(gbmSMOTElearner, colnames(smote_train[,c(1:8)]))))

##    len               freq                err
## 1    2  0.202380952380952 0.0196078431372549
## 2    1 0.0595238095238095 0.0333333333333333
## 3    2  0.109126984126984 0.0363636363636364
## 4    1  0.180555555555556  0.021978021978022
## 5    3 0.0734126984126984  0.027027027027027
## 6    1 0.0178571428571429                  0
## 7    2  0.136904761904762  0.072463768115942
## 8    3 0.0436507936507936                  0
## 9    2  0.101190476190476  0.156862745098039
## 10   1 0.0198412698412698                0.2
## 11   3 0.0158730158730159                  0
## 12   1 0.0396825396825397               0.15
##                                                                   condition
## 1                             DSRI<=0.99342606241954 & SGI<=1.4591091339807
## 2                                                      AQI>4.15844822075123
## 3                        DSRI<=1.27488438307794 & ACCR<=-0.0119870060493451
## 4                                                      SGI>1.39598241824885
## 5     DSRI<=1.3178234185123 & GMI<=1.08019408549228 & SGI<=1.25693022253482
## 6                                                      AQI>2.27688688422796
## 7                             SGI>1.16677602090122 & DEPI>0.992595550556234
## 8   GMI<=1.66923007083038 & DEPI<=0.996642080536026 & SGAI<=1.2817691727504
## 9                             DSRI>0.932115157471951 & GMI>1.15405500798293
## 10                                                    SGAI>1.30338880786323
## 11 GMI<=1.3045359804137 & ACCR<=0.0536334948233885 & LEVI>0.822185436946245
## 12                                                                     Else
##    pred
## 1    No
## 2   Yes
## 3    No
## 4   Yes
## 5    No
## 6   Yes
## 7   Yes
## 8    No
## 9   Yes
## 10  Yes
## 11   No
## 12  Yes

** Ippodhaikku idhu poadhum…. Next’tu…Rest’u **