The Dataset consists of companies who have manipulated their books.
We will try to identify a manipulater from a non manipulater based on the financial ratios reported by the companies.
There are a total of 1239 observations in the data set.
Out of these 1239 observations, there are 1200 non manipulaters and 39 manipulaters
Libraries Required
library(caret)
library(ROCR)
library(ggplot2)
library(readxl)
library(Hmisc)
library(AppliedPredictiveModeling)
library(corrplot)
library(DMwR)
library(ROSE)
library(reprtree)
library(randomForest)
library(rattle)
library(inTrees) #THE GOD
Read Data Set
fraud <- read_excel("C:/Users/aaum/Desktop/Projects/Fraud Analytics/Fraud Data.xlsx",
sheet = "DATA")
fraud <-as.data.frame(fraud[,2:10]) # Remove Company ID
fraud$Manipulator <- factor(fraud$Manipulator, levels = c(0,1), labels = c("No","Yes"))
str(fraud)
## 'data.frame': 1239 obs. of 9 variables:
## $ DSRI : num 1.62 1 1 1.49 1 ...
## $ GMI : num 1.13 1.61 1.02 1 1.37 ...
## $ AQI : num 7.185 1.005 1.241 0.466 0.637 ...
## $ SGI : num 0.366 13.081 1.475 0.673 0.861 ...
## $ DEPI : num 1.38 0.4 1.17 2 1.45 ...
## $ SGAI : num 1.6241 5.1982 0.6477 0.0929 1.7415 ...
## $ ACCR : num -0.1668 0.0605 0.0367 0.2734 0.123 ...
## $ LEVI : num 1.161 0.987 1.264 0.681 0.939 ...
## $ Manipulator: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
Box Plots
transparentTheme(trans = .4)
featurePlot(x = fraud[, 1:8],
y = fraud$Manipulator,
plot = "box",
## Pass in options to bwplot()
scales = list(y = list(relation="free"),
x = list(rot = 90)),
layout = c(2,1),
auto.key = list(columns = 2))
Histograms
par(mfrow=c(1,2))
for(i in 1:8) {
hist(fraud[,i], main=names(fraud)[i])
}
Identiying Variables which have Zero Variance or Near Zero Variance
zeroNrzerovar <- nearZeroVar(fraud[,1:8], saveMetrics= TRUE)
zeroNrzerovar
## freqRatio percentUnique zeroVar nzv
## DSRI 16 98.78935 FALSE FALSE
## GMI 92 92.65537 FALSE FALSE
## AQI 2 99.91929 FALSE FALSE
## SGI 1 100.00000 FALSE FALSE
## DEPI 13 99.03148 FALSE FALSE
## SGAI 6 97.90153 FALSE FALSE
## ACCR 1 100.00000 FALSE FALSE
## LEVI 2 99.91929 FALSE FALSE
Identifying Collinearity and Multicollinearity
## Correlated Variables
corVariables <- cor(fraud[,1:8])
corrplot(corVariables)
## Multicollinearity
highlyCorVariables <- findCorrelation(corVariables, cutoff = .75)
highlyCorVariables
## integer(0)
Findings * None of the variables have Zero or Near Zero Variance
* None of the variables are correlated or show linear dependencies
Create Data Partition
set.seed(1408)
splitIndex <- createDataPartition(fraud$Manipulator, p = .6,list = FALSE, times = 1)
trainData <- fraud[splitIndex,]
testData <- fraud[-splitIndex,]
Check for Proportion of Fraudulent Companies in Train and Test Data
describe(trainData$Manipulator)
## trainData$Manipulator
## n missing distinct
## 744 0 2
##
## Value No Yes
## Frequency 720 24
## Proportion 0.968 0.032
describe(testData$Manipulator)
## testData$Manipulator
## n missing distinct
## 495 0 2
##
## Value No Yes
## Frequency 480 15
## Proportion 0.97 0.03
Proportion of Fraudulent Companies maintained in Train and Test Data
Create Balanced Data using different Sampling Methods
ML algorithms struggle with accuracy because of the unequal distribution in dependent variable. This causes the performance of existing classifiers to get biased towards majority class. Hence we need to first create balanced datasets.
set.seed(1408)
over_train <- ovun.sample(Manipulator ~ ., data = trainData, method = "over")$data
describe(over_train$Manipulator)
## over_train$Manipulator
## n missing distinct
## 1443 0 2
##
## Value No Yes
## Frequency 720 723
## Proportion 0.499 0.501
We have upsampled the no.of positive observations from 24 to 723. Lets expect Overfitting !!!
set.seed(1408)
under_train <- ovun.sample(Manipulator ~ ., data = trainData, method = "under")$data
describe(under_train$Manipulator)
## under_train$Manipulator
## n missing distinct
## 50 0 2
##
## Value No Yes
## Frequency 26 24
## Proportion 0.52 0.48
set.seed(1408)
smote_train <- SMOTE(Manipulator ~.,trainData,perc.over = 1000,perc.under = 100)
describe(smote_train$Manipulator)
## smote_train$Manipulator
## n missing distinct
## 504 0 2
##
## Value No Yes
## Frequency 240 264
## Proportion 0.476 0.524
Under each sampling method, we will try different machine learning algorithm an come up with the best model
Build Models and Assess them through Evaluation Parameter “ROC”
seed <- 1408 # Keeping seed constant for all models
metric <- "ROC" # Evaluation metric for Model selection
** Setting control parameters and tuning grid**
rfControl <- trainControl(method = "boot", # Bootstrap Sampling with 25 Repetations
returnResamp='final',
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
rfGrid <- expand.grid(.mtry=c(1:7))
# mtry : Number of variables randomly sampled as candidates at each split
# Default no.of trees is 500 in Caret.
# Hence we wouldnt be able to find the optimal no. of trees using direct grid search.
# However (if required) we can extend Caret to create a new random forest algorithm to support grid search for no. of trees
** Models created using Oversampling and Undersampling methods were Overfitting.**
** Hence we will restrict analysis to SMOTEd Train Data. We will also work with Original Train Data..Just to compare**
## Random Forest Model from Original Train Data
set.seed(seed)
rfOrigFit <- train(Manipulator ~.,trainData,
method='rf',
trControl= rfControl,
metric = metric,
tuneGrid = rfGrid,
prox=TRUE,allowParallel=TRUE)
print(rfOrigFit)
## Random Forest
##
## 744 samples
## 8 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 744, 744, 744, 744, 744, 744, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.8931678 0.9996916 0.01763636
## 2 0.8656787 0.9975628 0.08300289
## 3 0.8409601 0.9945242 0.08700289
## 4 0.8380801 0.9919491 0.08933622
## 5 0.8091521 0.9917846 0.10979798
## 6 0.8190316 0.9905881 0.13510390
## 7 0.8065067 0.9901140 0.12899278
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.
## Random Forest Model from Oversampled Train Data
#set.seed(seed)
#rfUpFit <- train(Manipulator~.,over_train,
# method='rf',
# trControl=rfControl,
# metric = metric,
# tuneGrid = rfGrid,
# prox=TRUE,allowParallel=TRUE)
#print(rfUpFit)
## Random Forest Model from Undersampled Train Data
#set.seed(seed)
#rfdownFit <- train(Manipulator~.,under_train,
# method='rf',
# trControl=rfControl,
# metric = metric,
# tuneGrid = rfGrid,
# prox=TRUE,allowParallel=TRUE)
#print(rfdownFit)
## Random Forest Model from Synthetic Train Data (SMOTE)
set.seed(seed)
rfSMOTEFit <- train(Manipulator~.,smote_train,
method='rf',
trControl=rfControl,
metric = "ROC",
tuneGrid = rfGrid,
prox=TRUE,allowParallel=TRUE)
print(rfSMOTEFit)
## Random Forest
##
## 504 samples
## 8 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 504, 504, 504, 504, 504, 504, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.9895389 0.9164035 0.9795607
## 2 0.9879765 0.9117107 0.9766618
## 3 0.9864202 0.9112393 0.9728689
## 4 0.9844702 0.9094768 0.9712027
## 5 0.9831152 0.9080661 0.9673813
## 6 0.9812527 0.9040850 0.9658996
## 7 0.9797642 0.9007457 0.9609499
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.
** Comparison of Random Forest Models **
rfModels <- list(Original_Model = rfOrigFit,
#Undersampled_Train = rfdownFit,
#Oversampled_Train = rfUpFit,
SMOTEd_Train = rfSMOTEFit)
rfResamples <- resamples(rfModels)
summary(rfResamples)
##
## Call:
## summary.resamples(object = rfResamples)
##
## Models: Original_Model, SMOTEd_Train
## Number of resamples: 25
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Original_Model 0.7821 0.8818 0.9101 0.8932 0.9278 0.9498 0
## SMOTEd_Train 0.9783 0.9855 0.9900 0.9895 0.9926 0.9990 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Original_Model 0.9961 1.0000 1.0000 0.9997 1.000 1.0000 0
## SMOTEd_Train 0.8721 0.8889 0.9176 0.9164 0.931 0.9759 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Original_Model 0.0000 0.0000 0.0000 0.01764 0.0000 0.125 0
## SMOTEd_Train 0.9167 0.9714 0.9798 0.97960 0.9907 1.000 0
scales <- list(x=list(relation="free"), y=list(relation="free"))
dotplot(rfResamples, scales=scales)
As expected, model created using original train dataset suffers on one of the evaluation parameters (Specificity)
SMOTE has given us better results.
Lets see how SMOTEd Model performs on Test Data !!!
rfSMOTEPred <- predict(rfSMOTEFit, testData, type = "raw")
confusionMatrix(rfSMOTEPred, testData$Manipulator)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 430 4
## Yes 50 11
##
## Accuracy : 0.8909
## 95% CI : (0.8601, 0.917)
## No Information Rate : 0.9697
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2531
## Mcnemar's Test P-Value : 9.141e-10
##
## Sensitivity : 0.8958
## Specificity : 0.7333
## Pos Pred Value : 0.9908
## Neg Pred Value : 0.1803
## Prevalence : 0.9697
## Detection Rate : 0.8687
## Detection Prevalence : 0.8768
## Balanced Accuracy : 0.8146
##
## 'Positive' Class : No
##
Paravailla !!! Nallavey Peform Panraan
Aduthu chumma oru ROC.
rfSMOTEPred <- predict(rfSMOTEFit, testData, type = "prob")[,2]
# Calculating the values for ROC curve
rfSMOTEprediction <- prediction(rfSMOTEPred,testData$Manipulator)
rfSMOTEperf <- performance(rfSMOTEprediction, "tpr","fpr")
# Plotting the ROC curve
plot(rfSMOTEperf,main="ROC Curve for SMOTEd Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=3,col="black")
# calculating AUC
rfSMOTEfit.auc <- performance(rfSMOTEprediction, "auc")
# Converting S4 class to vector
rfSMOTEfit.auc <- unlist(slot(rfSMOTEfit.auc, "y.values"))
rfSMOTEfit.auc
## [1] 0.8950694
The Best Model is
print(rfSMOTEFit)
## Random Forest
##
## 504 samples
## 8 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 504, 504, 504, 504, 504, 504, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.9895389 0.9164035 0.9795607
## 2 0.9879765 0.9117107 0.9766618
## 3 0.9864202 0.9112393 0.9728689
## 4 0.9844702 0.9094768 0.9712027
## 5 0.9831152 0.9080661 0.9673813
## 6 0.9812527 0.9040850 0.9658996
## 7 0.9797642 0.9007457 0.9609499
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.
Lets also plot the important variables
rfSMOTEvarImp <- varImp(rfSMOTEFit)
plot(rfSMOTEvarImp)
So what does inTrees do ?
It helps us to interpret the rules from all the trees we created out of Random Forest / GBM etc etc
Sema !!!
How does inTrees Work?
Lets get the juice out of our model using inTrees
Extract a Single Tree From Random Forest
Tree_Number <- 5 # Enter Tree Number
rf_singletree <- getTree(rfSMOTEFit$finalModel,k=Tree_Number, labelVar = TRUE)
print(rf_singletree)
## left daughter right daughter split var split point status prediction
## 1 2 3 DEPI 0.77252572 1 <NA>
## 2 4 5 ACCR -0.03877641 1 <NA>
## 3 6 7 SGAI 0.79843644 1 <NA>
## 4 8 9 GMI 0.79615469 1 <NA>
## 5 10 11 GMI 0.81415599 1 <NA>
## 6 12 13 ACCR -0.02063350 1 <NA>
## 7 14 15 AQI 3.19034748 1 <NA>
## 8 0 0 <NA> 0.00000000 -1 Yes
## 9 16 17 DEPI 0.71593420 1 <NA>
## 10 18 19 SGI 1.74930483 1 <NA>
## 11 20 21 AQI 0.74707600 1 <NA>
## 12 22 23 DEPI 0.98563919 1 <NA>
## 13 24 25 DSRI 1.60316630 1 <NA>
## 14 26 27 DEPI 1.09285066 1 <NA>
## 15 28 29 LEVI 1.03457363 1 <NA>
## 16 0 0 <NA> 0.00000000 -1 No
## 17 30 31 SGI 1.60852019 1 <NA>
## 18 32 33 LEVI 1.00702704 1 <NA>
## 19 0 0 <NA> 0.00000000 -1 Yes
## 20 0 0 <NA> 0.00000000 -1 Yes
## 21 34 35 SGI 2.39803609 1 <NA>
## 22 0 0 <NA> 0.00000000 -1 No
## 23 36 37 AQI 1.10330111 1 <NA>
## 24 38 39 ACCR 0.01055985 1 <NA>
## 25 40 41 LEVI 0.77810226 1 <NA>
## 26 42 43 GMI 0.88214988 1 <NA>
## 27 44 45 GMI 1.20678268 1 <NA>
## 28 46 47 DSRI 0.96292337 1 <NA>
## 29 0 0 <NA> 0.00000000 -1 Yes
## 30 0 0 <NA> 0.00000000 -1 Yes
## 31 0 0 <NA> 0.00000000 -1 No
## 32 0 0 <NA> 0.00000000 -1 Yes
## 33 0 0 <NA> 0.00000000 -1 No
## 34 0 0 <NA> 0.00000000 -1 No
## 35 0 0 <NA> 0.00000000 -1 Yes
## 36 48 49 SGAI 0.27315450 1 <NA>
## 37 0 0 <NA> 0.00000000 -1 No
## 38 50 51 DEPI 0.99006982 1 <NA>
## 39 52 53 DSRI 0.68987732 1 <NA>
## 40 0 0 <NA> 0.00000000 -1 Yes
## 41 0 0 <NA> 0.00000000 -1 No
## 42 54 55 LEVI 0.72345263 1 <NA>
## 43 56 57 AQI 1.16222724 1 <NA>
## 44 58 59 GMI 1.08397011 1 <NA>
## 45 60 61 GMI 1.51017743 1 <NA>
## 46 0 0 <NA> 0.00000000 -1 No
## 47 0 0 <NA> 0.00000000 -1 Yes
## 48 0 0 <NA> 0.00000000 -1 No
## 49 0 0 <NA> 0.00000000 -1 Yes
## 50 0 0 <NA> 0.00000000 -1 Yes
## 51 62 63 SGI 1.06747106 1 <NA>
## 52 64 65 DSRI 0.58778369 1 <NA>
## 53 66 67 DSRI 1.06169573 1 <NA>
## 54 0 0 <NA> 0.00000000 -1 Yes
## 55 68 69 LEVI 1.00697650 1 <NA>
## 56 70 71 DSRI 1.31772676 1 <NA>
## 57 72 73 AQI 1.45309813 1 <NA>
## 58 0 0 <NA> 0.00000000 -1 No
## 59 74 75 SGI 1.18234918 1 <NA>
## 60 76 77 GMI 1.32030003 1 <NA>
## 61 0 0 <NA> 0.00000000 -1 No
## 62 78 79 DSRI 0.99169944 1 <NA>
## 63 80 81 DSRI 0.56854110 1 <NA>
## 64 0 0 <NA> 0.00000000 -1 Yes
## 65 0 0 <NA> 0.00000000 -1 No
## 66 0 0 <NA> 0.00000000 -1 Yes
## 67 82 83 DSRI 1.07178969 1 <NA>
## 68 0 0 <NA> 0.00000000 -1 No
## 69 84 85 DEPI 0.96943950 1 <NA>
## 70 86 87 DSRI 0.87661870 1 <NA>
## 71 88 89 AQI -0.33606078 1 <NA>
## 72 90 91 LEVI 1.05642704 1 <NA>
## 73 92 93 SGI 1.28054787 1 <NA>
## 74 94 95 DSRI 1.52103848 1 <NA>
## 75 0 0 <NA> 0.00000000 -1 No
## 76 96 97 LEVI 0.87832471 1 <NA>
## 77 98 99 LEVI 1.66703384 1 <NA>
## 78 0 0 <NA> 0.00000000 -1 No
## 79 100 101 AQI 1.16069738 1 <NA>
## 80 0 0 <NA> 0.00000000 -1 Yes
## 81 0 0 <NA> 0.00000000 -1 No
## 82 0 0 <NA> 0.00000000 -1 No
## 83 0 0 <NA> 0.00000000 -1 Yes
## 84 0 0 <NA> 0.00000000 -1 Yes
## 85 102 103 DEPI 0.99794376 1 <NA>
## 86 104 105 DEPI 0.88864699 1 <NA>
## 87 106 107 AQI 0.90879805 1 <NA>
## 88 0 0 <NA> 0.00000000 -1 Yes
## 89 108 109 SGAI 0.85982065 1 <NA>
## 90 0 0 <NA> 0.00000000 -1 No
## 91 110 111 SGI 1.10826917 1 <NA>
## 92 0 0 <NA> 0.00000000 -1 No
## 93 0 0 <NA> 0.00000000 -1 Yes
## 94 0 0 <NA> 0.00000000 -1 Yes
## 95 0 0 <NA> 0.00000000 -1 No
## 96 0 0 <NA> 0.00000000 -1 Yes
## 97 112 113 GMI 1.29349896 1 <NA>
## 98 0 0 <NA> 0.00000000 -1 Yes
## 99 0 0 <NA> 0.00000000 -1 No
## 100 0 0 <NA> 0.00000000 -1 Yes
## 101 114 115 DEPI 1.17255006 1 <NA>
## 102 0 0 <NA> 0.00000000 -1 No
## 103 116 117 DSRI 1.14240373 1 <NA>
## 104 118 119 ACCR 0.09549754 1 <NA>
## 105 120 121 ACCR 0.09115667 1 <NA>
## 106 122 123 DEPI 0.98941302 1 <NA>
## 107 124 125 DEPI 1.03185302 1 <NA>
## 108 126 127 DSRI 1.66446562 1 <NA>
## 109 128 129 LEVI 1.27484562 1 <NA>
## 110 0 0 <NA> 0.00000000 -1 No
## 111 0 0 <NA> 0.00000000 -1 Yes
## 112 0 0 <NA> 0.00000000 -1 Yes
## 113 0 0 <NA> 0.00000000 -1 No
## 114 0 0 <NA> 0.00000000 -1 No
## 115 0 0 <NA> 0.00000000 -1 Yes
## 116 130 131 LEVI 1.08149188 1 <NA>
## 117 0 0 <NA> 0.00000000 -1 Yes
## 118 132 133 ACCR -0.02134921 1 <NA>
## 119 0 0 <NA> 0.00000000 -1 No
## 120 0 0 <NA> 0.00000000 -1 No
## 121 0 0 <NA> 0.00000000 -1 Yes
## 122 0 0 <NA> 0.00000000 -1 No
## 123 134 135 SGI 1.30859395 1 <NA>
## 124 136 137 DSRI 0.99504151 1 <NA>
## 125 138 139 AQI 0.95646039 1 <NA>
## 126 0 0 <NA> 0.00000000 -1 No
## 127 0 0 <NA> 0.00000000 -1 Yes
## 128 140 141 AQI 0.87474790 1 <NA>
## 129 142 143 DEPI 1.01301468 1 <NA>
## 130 0 0 <NA> 0.00000000 -1 Yes
## 131 0 0 <NA> 0.00000000 -1 No
## 132 0 0 <NA> 0.00000000 -1 No
## 133 0 0 <NA> 0.00000000 -1 Yes
## 134 0 0 <NA> 0.00000000 -1 No
## 135 0 0 <NA> 0.00000000 -1 Yes
## 136 0 0 <NA> 0.00000000 -1 No
## 137 144 145 SGAI 1.01291285 1 <NA>
## 138 146 147 AQI 0.94793768 1 <NA>
## 139 0 0 <NA> 0.00000000 -1 No
## 140 0 0 <NA> 0.00000000 -1 Yes
## 141 148 149 DSRI 1.46266027 1 <NA>
## 142 150 151 AQI 0.83062774 1 <NA>
## 143 0 0 <NA> 0.00000000 -1 No
## 144 152 153 GMI 0.99897290 1 <NA>
## 145 154 155 DSRI 1.19857455 1 <NA>
## 146 0 0 <NA> 0.00000000 -1 No
## 147 0 0 <NA> 0.00000000 -1 Yes
## 148 0 0 <NA> 0.00000000 -1 Yes
## 149 156 157 LEVI 0.97533355 1 <NA>
## 150 0 0 <NA> 0.00000000 -1 No
## 151 158 159 DEPI 0.96767678 1 <NA>
## 152 0 0 <NA> 0.00000000 -1 No
## 153 160 161 SGI 1.23380978 1 <NA>
## 154 162 163 ACCR -0.03744539 1 <NA>
## 155 0 0 <NA> 0.00000000 -1 Yes
## 156 164 165 ACCR -0.01972793 1 <NA>
## 157 166 167 SGAI 1.34360663 1 <NA>
## 158 0 0 <NA> 0.00000000 -1 No
## 159 0 0 <NA> 0.00000000 -1 Yes
## 160 0 0 <NA> 0.00000000 -1 No
## 161 168 169 SGAI 0.95319057 1 <NA>
## 162 0 0 <NA> 0.00000000 -1 No
## 163 0 0 <NA> 0.00000000 -1 Yes
## 164 0 0 <NA> 0.00000000 -1 Yes
## 165 0 0 <NA> 0.00000000 -1 No
## 166 170 171 GMI 1.04256095 1 <NA>
## 167 0 0 <NA> 0.00000000 -1 Yes
## 168 0 0 <NA> 0.00000000 -1 Yes
## 169 0 0 <NA> 0.00000000 -1 No
## 170 172 173 SGI 1.13482843 1 <NA>
## 171 0 0 <NA> 0.00000000 -1 Yes
## 172 0 0 <NA> 0.00000000 -1 No
## 173 0 0 <NA> 0.00000000 -1 Yes
Extracting all the Rules from all the trees built using Random Forest
# Transforming Random Forest object to a list of trees
rfSMOTEtreelist <- RF2List(rfSMOTEFit$finalModel)
# Extracting rules from a list of trees
rfSMOTErules <- extractRules(rfSMOTEtreelist, smote_train[,c(1:8)])
## 2841 rules (length<=6) were extracted from the first 100 trees.
# Assigning outcomes to a conditions, and measure the rules
rfSMOTErulesmetric <- getRuleMetric(rfSMOTErules, smote_train[,c(1:8)], smote_train[,9])
# Pruning irrevant variable-value pair from a rule condition
rfSMOTErulesmetric <- pruneRule(rfSMOTErulesmetric, smote_train[,c(1:8)], smote_train[,9])
# Selecting a set of relevant and non-redundant rules
rfSMOTErulesmetric <- selectRuleRRF(rfSMOTErulesmetric, smote_train[,c(1:8)], smote_train[,9])
# Present a learner using column names instead of X[i,]
presentablerules <- presentRules(rfSMOTErulesmetric, colnames(smote_train[,c(1:8)]))
print(as.data.frame(presentablerules))
## len freq err
## 1 4 0.353 0.0669999999999999
## 2 4 0.179 0
## 3 4 0.111 0.018
## 4 3 0.119 0.0669999999999999
## 5 3 0.218 0.00900000000000001
## 6 3 0.026 0
## 7 3 0.016 0
## condition
## 1 DSRI<=1.29990117249215 & GMI<=1.15465350645863 & AQI<=5.23533496331984 & SGI<=1.34901480793138
## 2 GMI<=1.3045359804137 & DEPI<=0.996429590132862 & SGAI<=1.17686793901743 & LEVI>0.760934628603765
## 3 SGI>1.20182269540422 & DEPI>0.99204493039678 & SGAI>0.931651687982791 & ACCR>-0.0298248401569786
## 4 AQI<=3.25882092035874 & DEPI>1.08738102250767 & ACCR<=0.0297860563181046
## 5 DSRI<=1.31341511560337 & ACCR<=-0.0174723081573545 & LEVI<=1.45486331869587
## 6 AQI<=3.7800924277716 & SGI<=0.652829488100357 & DEPI>0.778331242394198
## 7 GMI>1.01635374518407 & AQI>2.22899015137713 & AQI<=3.46258785002568
## pred impRRF
## 1 No 1
## 2 No 0.353142114916692
## 3 Yes 0.153915696130407
## 4 No 0.149982486819147
## 5 No 0.0995218204642667
## 6 No 0.0979456761710641
## 7 Yes 0.0481375932870851
What is impRRF ? Avalo aaraichi theva illa !!!
Building a Simplified Tree Ensembler from the Rules we got
rfSMOTElearner <- buildLearner(rfSMOTErulesmetric,smote_train[,c(1:8)],smote_train[,9])
## Print Rules in Presentable Format
print(as.data.frame(presentRules(rfSMOTElearner, colnames(smote_train[,c(1:8)]))))
## len freq err
## 1 4 0.178571428571429 0
## 2 3 0.0257936507936508 0
## 3 3 0.0158730158730159 0
## 4 3 0.101190476190476 0
## 5 4 0.107142857142857 0.0185185185185185
## 6 4 0.132936507936508 0.0149253731343284
## 7 1 0.438492063492063 0.0859728506787331
## condition
## 1 GMI<=1.3045359804137 & DEPI<=0.996429590132862 & SGAI<=1.17686793901743 & LEVI>0.760934628603765
## 2 AQI<=3.7800924277716 & SGI<=0.652829488100357 & DEPI>0.778331242394198
## 3 GMI>1.01635374518407 & AQI>2.22899015137713 & AQI<=3.46258785002568
## 4 AQI<=3.25882092035874 & DEPI>1.08738102250767 & ACCR<=0.0297860563181046
## 5 SGI>1.20182269540422 & DEPI>0.99204493039678 & SGAI>0.931651687982791 & ACCR>-0.0298248401569786
## 6 DSRI<=1.29990117249215 & GMI<=1.15465350645863 & AQI<=5.23533496331984 & SGI<=1.34901480793138
## 7 Else
## pred
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 No
## 7 Yes
** Now that we have the STEL, we can simply use it to apply on any data. We can create multiple STEL (for GBM, XGBoost, etc) and mebbe come up with a Stacked STEL.. It can work**
We will keep that for later.. But for now..Lets do GBM
gbmControl <- trainControl(method='boot',
returnResamp='final',
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
gbmGrid <- expand.grid(interaction.depth = c(1,2,3),
n.trees = (1:20)*50,
shrinkage = 0.1,
n.minobsinnode = 2)
## Gradient Boosting with Original Train Data
set.seed(seed)
gbmOrigFit <- train(Manipulator ~., trainData,
method='gbm',
trControl=gbmControl,
verbose=FALSE,
tuneGrid = gbmGrid,
metric = metric)
## Loading required package: gbm
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
##
## Attaching package: 'plyr'
## The following object is masked from 'package:DMwR':
##
## join
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
## Gradient Boosting with Upsampled Train Data
#set.seed(seed)
#gbmUpFit <- train(Manipulator ~., over_train,
# method='gbm',
# trControl=gbmControl,
# verbose=FALSE,
# tuneGrid = gbmGrid,
# metric = metric)
## Gradient Boosting with Downsampled Data
#set.seed(seed)
#gbmDownFit <- train(Manipulator ~., under_train,
# method='gbm',
# trControl=gbmControl,
# verbose=FALSE,
# tuneGrid = gbmGrid,
# metric = metric)
## Gradient Boosting with Synthetically Generated Data (SMOTE)
set.seed(seed)
gbmSMOTEFit <- train(Manipulator ~., smote_train,
method='gbm',
trControl=gbmControl,
verbose=FALSE,
tuneGrid = gbmGrid,
metric = metric)
gbmModels <- list(Original = gbmOrigFit,
#Undersample = gbmDownFit,
#Oversample = gbmUpFit,
SMOTE = gbmSMOTEFit)
gbmResamples <- resamples(gbmModels)
summary(rfResamples)
##
## Call:
## summary.resamples(object = rfResamples)
##
## Models: Original_Model, SMOTEd_Train
## Number of resamples: 25
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Original_Model 0.7821 0.8818 0.9101 0.8932 0.9278 0.9498 0
## SMOTEd_Train 0.9783 0.9855 0.9900 0.9895 0.9926 0.9990 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Original_Model 0.9961 1.0000 1.0000 0.9997 1.000 1.0000 0
## SMOTEd_Train 0.8721 0.8889 0.9176 0.9164 0.931 0.9759 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Original_Model 0.0000 0.0000 0.0000 0.01764 0.0000 0.125 0
## SMOTEd_Train 0.9167 0.9714 0.9798 0.97960 0.9907 1.000 0
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(gbmResamples, scales=scales)
gbmSMOTEPred <- predict(gbmSMOTEFit, testData, type = "raw")
confusionMatrix(gbmSMOTEPred, testData$Manipulator)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 422 4
## Yes 58 11
##
## Accuracy : 0.8747
## 95% CI : (0.8423, 0.9026)
## No Information Rate : 0.9697
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2232
## Mcnemar's Test P-Value : 1.685e-11
##
## Sensitivity : 0.8792
## Specificity : 0.7333
## Pos Pred Value : 0.9906
## Neg Pred Value : 0.1594
## Prevalence : 0.9697
## Detection Rate : 0.8525
## Detection Prevalence : 0.8606
## Balanced Accuracy : 0.8062
##
## 'Positive' Class : No
##
SMOTEd GBM is pretty good. But as of now SMOTEd RF has done well
Lets build Simplified Tree Ensembler from the SMOTEd GBM Model we got
# Transforming Random Forest object to a list of trees
gbmSMOTEtreelist <- GBM2List(gbmSMOTEFit$finalModel, smote_train[,c(1:8)])
# Extracting rules from a list of trees
gbmSMOTErules <- extractRules(gbmSMOTEtreelist, smote_train[,c(1:8)])
## 400 rules (length<=6) were extracted from the first 100 trees.
# Assigning outcomes to a conditions, and measure the rules
gbmSMOTErulesmetric <- getRuleMetric(gbmSMOTErules, smote_train[,c(1:8)], smote_train[,9])
# Pruning irrevant variable-value pair from a rule condition
gbmSMOTErulesmetric <- pruneRule(gbmSMOTErulesmetric, smote_train[,c(1:8)], smote_train[,9])
# Selecting a set of relevant and non-redundant rules
gbmSMOTErulesmetric <- selectRuleRRF(gbmSMOTErulesmetric, smote_train[,c(1:8)], smote_train[,9])
# Present a learner using column names instead of X[i,]
presentablerules <- presentRules(gbmSMOTErulesmetric, colnames(smote_train[,c(1:8)]))
#print(as.data.frame(presentablerules))
gbmSMOTElearner <- buildLearner(gbmSMOTErulesmetric,smote_train[,c(1:8)],smote_train[,9])
## Print Rules in Presentable Format
print(as.data.frame(presentRules(gbmSMOTElearner, colnames(smote_train[,c(1:8)]))))
## len freq err
## 1 2 0.202380952380952 0.0196078431372549
## 2 1 0.0595238095238095 0.0333333333333333
## 3 2 0.109126984126984 0.0363636363636364
## 4 1 0.180555555555556 0.021978021978022
## 5 3 0.0734126984126984 0.027027027027027
## 6 1 0.0178571428571429 0
## 7 2 0.136904761904762 0.072463768115942
## 8 3 0.0436507936507936 0
## 9 2 0.101190476190476 0.156862745098039
## 10 1 0.0198412698412698 0.2
## 11 3 0.0158730158730159 0
## 12 1 0.0396825396825397 0.15
## condition
## 1 DSRI<=0.99342606241954 & SGI<=1.4591091339807
## 2 AQI>4.15844822075123
## 3 DSRI<=1.27488438307794 & ACCR<=-0.0119870060493451
## 4 SGI>1.39598241824885
## 5 DSRI<=1.3178234185123 & GMI<=1.08019408549228 & SGI<=1.25693022253482
## 6 AQI>2.27688688422796
## 7 SGI>1.16677602090122 & DEPI>0.992595550556234
## 8 GMI<=1.66923007083038 & DEPI<=0.996642080536026 & SGAI<=1.2817691727504
## 9 DSRI>0.932115157471951 & GMI>1.15405500798293
## 10 SGAI>1.30338880786323
## 11 GMI<=1.3045359804137 & ACCR<=0.0536334948233885 & LEVI>0.822185436946245
## 12 Else
## pred
## 1 No
## 2 Yes
## 3 No
## 4 Yes
## 5 No
## 6 Yes
## 7 Yes
## 8 No
## 9 Yes
## 10 Yes
## 11 No
## 12 Yes
** Ippodhaikku idhu poadhum…. Next’tu…Rest’u **