The data-set provides details from MyBank about a Personal Loan Campaign that was executed by the bank. 20000 customers were targeted with an offer of personal loan on 10% interest rate, out of which 2512 customers responded positively. The data needs to be used to create classification model(s) in order to predict the response of new set of customers in the future, depending on the attributes available in the data.
Classification Models using following Supervised Machine Learning Techniques:
1. Classification and Regression Tree
2. Random Forest
3. Artificial Neural Network
Once the Classification Models are built, compare the three models with each other and provide your observations / recommendations.
#install.packages("caret, repos = http://cran.us.r-project.org")
#install.packages("rpart, repos = http://cran.us.r-project.org")
#install.packages("rpart.plot, repos = http://cran.us.r-project.org")
#install.packages("randomForest, repos = http://cran.us.r-project.org")
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
Load Data
PL_X_SELL = read.table("D:/Analytics/BACP-Dec2017/10_MachineLearning/PL_X_SELL.csv", sep = ",", header = T)
attach(PL_X_SELL)
# Find out Total Number of Rows and Columns
dim(PL_X_SELL)
## [1] 20000 40
# Find out Names of the Columns (Features)
names(PL_X_SELL)
## [1] "CUST_ID" "TARGET"
## [3] "AGE" "GENDER"
## [5] "BALANCE" "OCCUPATION"
## [7] "AGE_BKT" "SCR"
## [9] "HOLDING_PERIOD" "ACC_TYPE"
## [11] "ACC_OP_DATE" "LEN_OF_RLTN_IN_MNTH"
## [13] "NO_OF_L_CR_TXNS" "NO_OF_L_DR_TXNS"
## [15] "TOT_NO_OF_L_TXNS" "NO_OF_BR_CSH_WDL_DR_TXNS"
## [17] "NO_OF_ATM_DR_TXNS" "NO_OF_NET_DR_TXNS"
## [19] "NO_OF_MOB_DR_TXNS" "NO_OF_CHQ_DR_TXNS"
## [21] "FLG_HAS_CC" "AMT_ATM_DR"
## [23] "AMT_BR_CSH_WDL_DR" "AMT_CHQ_DR"
## [25] "AMT_NET_DR" "AMT_MOB_DR"
## [27] "AMT_L_DR" "FLG_HAS_ANY_CHGS"
## [29] "AMT_OTH_BK_ATM_USG_CHGS" "AMT_MIN_BAL_NMC_CHGS"
## [31] "NO_OF_IW_CHQ_BNC_TXNS" "NO_OF_OW_CHQ_BNC_TXNS"
## [33] "AVG_AMT_PER_ATM_TXN" "AVG_AMT_PER_CSH_WDL_TXN"
## [35] "AVG_AMT_PER_CHQ_TXN" "AVG_AMT_PER_NET_TXN"
## [37] "AVG_AMT_PER_MOB_TXN" "FLG_HAS_NOMINEE"
## [39] "FLG_HAS_OLD_LOAN" "random"
# Find out Class of each Feature, along with internal structure
str(PL_X_SELL)
## 'data.frame': 20000 obs. of 40 variables:
## $ CUST_ID : Factor w/ 20000 levels "C1","C10","C100",..: 17699 16532 11027 17984 2363 11747 18115 15556 15216 12494 ...
## $ TARGET : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AGE : int 27 47 40 53 36 42 30 53 42 30 ...
## $ GENDER : Factor w/ 3 levels "F","M","O": 2 2 2 2 2 1 2 1 1 2 ...
## $ BALANCE : num 3384 287489 18217 71720 1671623 ...
## $ OCCUPATION : Factor w/ 4 levels "PROF","SAL","SELF-EMP",..: 3 2 3 2 1 1 1 2 3 1 ...
## $ AGE_BKT : Factor w/ 7 levels "<25",">50","26-30",..: 3 7 5 2 5 6 3 2 6 3 ...
## $ SCR : int 776 324 603 196 167 493 479 562 105 170 ...
## $ HOLDING_PERIOD : int 30 28 2 13 24 26 14 25 15 13 ...
## $ ACC_TYPE : Factor w/ 2 levels "CA","SA": 2 2 2 1 2 2 2 1 2 2 ...
## $ ACC_OP_DATE : Factor w/ 4869 levels "01-01-00","01-01-01",..: 3270 1806 3575 993 2861 862 4533 3160 257 334 ...
## $ LEN_OF_RLTN_IN_MNTH : int 146 104 61 107 185 192 177 99 88 111 ...
## $ NO_OF_L_CR_TXNS : int 7 8 10 36 20 5 6 14 18 14 ...
## $ NO_OF_L_DR_TXNS : int 3 2 5 14 1 2 6 3 14 8 ...
## $ TOT_NO_OF_L_TXNS : int 10 10 15 50 21 7 12 17 32 22 ...
## $ NO_OF_BR_CSH_WDL_DR_TXNS: int 0 0 1 4 1 1 0 3 6 3 ...
## $ NO_OF_ATM_DR_TXNS : int 1 1 1 2 0 1 1 0 2 1 ...
## $ NO_OF_NET_DR_TXNS : int 2 1 1 3 0 0 1 0 4 0 ...
## $ NO_OF_MOB_DR_TXNS : int 0 0 0 1 0 0 0 0 1 0 ...
## $ NO_OF_CHQ_DR_TXNS : int 0 0 2 4 0 0 4 0 1 4 ...
## $ FLG_HAS_CC : int 0 0 0 0 0 1 0 0 1 0 ...
## $ AMT_ATM_DR : int 13100 6600 11200 26100 0 18500 6200 0 35400 18000 ...
## $ AMT_BR_CSH_WDL_DR : int 0 0 561120 673590 808480 379310 0 945160 198430 869880 ...
## $ AMT_CHQ_DR : int 0 0 49320 60780 0 0 10580 0 51490 32610 ...
## $ AMT_NET_DR : num 973557 799813 997570 741506 0 ...
## $ AMT_MOB_DR : int 0 0 0 71388 0 0 0 0 170332 0 ...
## $ AMT_L_DR : num 986657 806413 1619210 1573364 808480 ...
## $ FLG_HAS_ANY_CHGS : int 0 1 1 0 0 0 1 0 0 0 ...
## $ AMT_OTH_BK_ATM_USG_CHGS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AMT_MIN_BAL_NMC_CHGS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NO_OF_IW_CHQ_BNC_TXNS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NO_OF_OW_CHQ_BNC_TXNS : int 0 0 1 0 0 0 0 0 0 0 ...
## $ AVG_AMT_PER_ATM_TXN : num 13100 6600 11200 13050 0 ...
## $ AVG_AMT_PER_CSH_WDL_TXN : num 0 0 561120 168398 808480 ...
## $ AVG_AMT_PER_CHQ_TXN : num 0 0 24660 15195 0 ...
## $ AVG_AMT_PER_NET_TXN : num 486779 799813 997570 247169 0 ...
## $ AVG_AMT_PER_MOB_TXN : num 0 0 0 71388 0 ...
## $ FLG_HAS_NOMINEE : int 1 1 1 1 1 1 0 1 1 0 ...
## $ FLG_HAS_OLD_LOAN : int 1 0 1 0 0 1 1 1 1 0 ...
## $ random : num 1.14e-05 1.11e-04 1.20e-04 1.37e-04 1.74e-04 ...
# Check top 10 and bottom 10 Rows of the Dataset
head(PL_X_SELL,5)
## CUST_ID TARGET AGE GENDER BALANCE OCCUPATION AGE_BKT SCR
## 1 C7927 0 27 M 3383.75 SELF-EMP 26-30 776
## 2 C6877 0 47 M 287489.04 SAL 46-50 324
## 3 C19922 0 40 M 18216.88 SELF-EMP 36-40 603
## 4 C8183 0 53 M 71720.48 SAL >50 196
## 5 C12123 0 36 M 1671622.89 PROF 36-40 167
## HOLDING_PERIOD ACC_TYPE ACC_OP_DATE LEN_OF_RLTN_IN_MNTH NO_OF_L_CR_TXNS
## 1 30 SA 3/23/2005 146 7
## 2 28 SA 10-11-08 104 8
## 3 2 SA 4/26/2012 61 10
## 4 13 CA 07-04-08 107 36
## 5 24 SA 12/29/2001 185 20
## NO_OF_L_DR_TXNS TOT_NO_OF_L_TXNS NO_OF_BR_CSH_WDL_DR_TXNS
## 1 3 10 0
## 2 2 10 0
## 3 5 15 1
## 4 14 50 4
## 5 1 21 1
## NO_OF_ATM_DR_TXNS NO_OF_NET_DR_TXNS NO_OF_MOB_DR_TXNS NO_OF_CHQ_DR_TXNS
## 1 1 2 0 0
## 2 1 1 0 0
## 3 1 1 0 2
## 4 2 3 1 4
## 5 0 0 0 0
## FLG_HAS_CC AMT_ATM_DR AMT_BR_CSH_WDL_DR AMT_CHQ_DR AMT_NET_DR AMT_MOB_DR
## 1 0 13100 0 0 973557 0
## 2 0 6600 0 0 799813 0
## 3 0 11200 561120 49320 997570 0
## 4 0 26100 673590 60780 741506 71388
## 5 0 0 808480 0 0 0
## AMT_L_DR FLG_HAS_ANY_CHGS AMT_OTH_BK_ATM_USG_CHGS AMT_MIN_BAL_NMC_CHGS
## 1 986657 0 0 0
## 2 806413 1 0 0
## 3 1619210 1 0 0
## 4 1573364 0 0 0
## 5 808480 0 0 0
## NO_OF_IW_CHQ_BNC_TXNS NO_OF_OW_CHQ_BNC_TXNS AVG_AMT_PER_ATM_TXN
## 1 0 0 13100
## 2 0 0 6600
## 3 0 1 11200
## 4 0 0 13050
## 5 0 0 0
## AVG_AMT_PER_CSH_WDL_TXN AVG_AMT_PER_CHQ_TXN AVG_AMT_PER_NET_TXN
## 1 0.0 0 486778.5
## 2 0.0 0 799813.0
## 3 561120.0 24660 997570.0
## 4 168397.5 15195 247168.7
## 5 808480.0 0 0.0
## AVG_AMT_PER_MOB_TXN FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN random
## 1 0 1 1 0.000011400
## 2 0 1 0 0.000111373
## 3 0 1 1 0.000119954
## 4 71388 1 0 0.000136825
## 5 0 1 0 0.000173976
#tail(PL_X_SELL,5)
# Check for Missing Values
colSums(is.na(PL_X_SELL))
## CUST_ID TARGET AGE
## 0 0 0
## GENDER BALANCE OCCUPATION
## 0 0 0
## AGE_BKT SCR HOLDING_PERIOD
## 0 0 0
## ACC_TYPE ACC_OP_DATE LEN_OF_RLTN_IN_MNTH
## 0 0 0
## NO_OF_L_CR_TXNS NO_OF_L_DR_TXNS TOT_NO_OF_L_TXNS
## 0 0 0
## NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_ATM_DR_TXNS NO_OF_NET_DR_TXNS
## 0 0 0
## NO_OF_MOB_DR_TXNS NO_OF_CHQ_DR_TXNS FLG_HAS_CC
## 0 0 0
## AMT_ATM_DR AMT_BR_CSH_WDL_DR AMT_CHQ_DR
## 0 0 0
## AMT_NET_DR AMT_MOB_DR AMT_L_DR
## 0 0 0
## FLG_HAS_ANY_CHGS AMT_OTH_BK_ATM_USG_CHGS AMT_MIN_BAL_NMC_CHGS
## 0 0 0
## NO_OF_IW_CHQ_BNC_TXNS NO_OF_OW_CHQ_BNC_TXNS AVG_AMT_PER_ATM_TXN
## 0 0 0
## AVG_AMT_PER_CSH_WDL_TXN AVG_AMT_PER_CHQ_TXN AVG_AMT_PER_NET_TXN
## 0 0 0
## AVG_AMT_PER_MOB_TXN FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN
## 0 0 0
## random
## 0
table(TARGET)
## TARGET
## 0 1
## 17488 2512
prop.table(table(TARGET))
## TARGET
## 0 1
## 0.8744 0.1256
Number of Rows and Columns:
> The number of rows in the dataset is 20,000
> The number of columns (Features) in the dataset is 40
Missing values check:
No Missing values present in the dataset.
Proportion of Responders Vs Non Responders:
> Total Responder Records: 2512 (12.56%)
> Total Non-Responder Records: 17,488 (87.44%)
Inference:
Non Responder class is highly dominating. We might need to Balance out the dataset to achieve better classification results.
# install.packages("plotrix, repos = http://cran.us.r-project.org"" )
library(plotrix)
Data Visualization
pie3D(prop.table((table(PL_X_SELL$TARGET))),
main='Resp Vs Non Resp in Input Data set',
#explode=0.1,
labels=c("Non Resp", "Resp"),
col = c("Turquoise", "Medium Sea Green")
)
pie3D(prop.table((table(PL_X_SELL$GENDER))),
main='Gender in Input Data set',
#explode=0.1,
labels=c("F", "M","O"),
col = c("Turquoise", "Medium Sea Green")
)
Summary statistics
summary(PL_X_SELL)
## CUST_ID TARGET AGE GENDER
## C1 : 1 Min. :0.0000 Min. :21.00 F: 5433
## C10 : 1 1st Qu.:0.0000 1st Qu.:30.00 M:14376
## C100 : 1 Median :0.0000 Median :38.00 O: 191
## C1000 : 1 Mean :0.1256 Mean :38.42
## C10000 : 1 3rd Qu.:0.0000 3rd Qu.:46.00
## C10001 : 1 Max. :1.0000 Max. :55.00
## (Other):19994
## BALANCE OCCUPATION AGE_BKT SCR
## Min. : 0 PROF :5417 <25 :1753 Min. :100.0
## 1st Qu.: 64754 SAL :5855 >50 :3035 1st Qu.:227.0
## Median : 231676 SELF-EMP:3568 26-30:3434 Median :364.0
## Mean : 511362 SENP :5160 31-35:3404 Mean :440.2
## 3rd Qu.: 653877 36-40:2814 3rd Qu.:644.0
## Max. :8360431 41-45:3067 Max. :999.0
## 46-50:2493
## HOLDING_PERIOD ACC_TYPE ACC_OP_DATE LEN_OF_RLTN_IN_MNTH
## Min. : 1.00 CA: 4241 11/16/2010: 24 Min. : 29.0
## 1st Qu.: 7.00 SA:15759 04-03-09 : 23 1st Qu.: 79.0
## Median :15.00 7/25/2010 : 22 Median :125.0
## Mean :14.96 05-06-13 : 21 Mean :125.2
## 3rd Qu.:22.00 02-07-07 : 20 3rd Qu.:172.0
## Max. :31.00 8/24/2010 : 20 Max. :221.0
## (Other) :19870
## NO_OF_L_CR_TXNS NO_OF_L_DR_TXNS TOT_NO_OF_L_TXNS
## Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 6.00 1st Qu.: 2.000 1st Qu.: 9.00
## Median :10.00 Median : 5.000 Median : 14.00
## Mean :12.35 Mean : 6.634 Mean : 18.98
## 3rd Qu.:14.00 3rd Qu.: 7.000 3rd Qu.: 21.00
## Max. :75.00 Max. :74.000 Max. :149.00
##
## NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_ATM_DR_TXNS NO_OF_NET_DR_TXNS
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.000 Median : 1.000 Median : 0.000
## Mean : 1.883 Mean : 1.029 Mean : 1.172
## 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 1.000
## Max. :15.000 Max. :25.000 Max. :22.000
##
## NO_OF_MOB_DR_TXNS NO_OF_CHQ_DR_TXNS FLG_HAS_CC AMT_ATM_DR
## Min. : 0.0000 Min. : 0.000 Min. :0.0000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 0
## Median : 0.0000 Median : 2.000 Median :0.0000 Median : 6900
## Mean : 0.4118 Mean : 2.138 Mean :0.3054 Mean : 10990
## 3rd Qu.: 0.0000 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.: 15800
## Max. :25.0000 Max. :15.000 Max. :1.0000 Max. :199300
##
## AMT_BR_CSH_WDL_DR AMT_CHQ_DR AMT_NET_DR AMT_MOB_DR
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2990 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median :340150 Median : 23840 Median : 0 Median : 0
## Mean :378475 Mean : 124520 Mean :237308 Mean : 22425
## 3rd Qu.:674675 3rd Qu.: 72470 3rd Qu.:473971 3rd Qu.: 0
## Max. :999930 Max. :4928640 Max. :999854 Max. :199667
##
## AMT_L_DR FLG_HAS_ANY_CHGS AMT_OTH_BK_ATM_USG_CHGS
## Min. : 0 Min. :0.0000 Min. : 0.000
## 1st Qu.: 237936 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 695115 Median :0.0000 Median : 0.000
## Mean : 773717 Mean :0.1106 Mean : 1.099
## 3rd Qu.:1078927 3rd Qu.:0.0000 3rd Qu.: 0.000
## Max. :6514921 Max. :1.0000 Max. :250.000
##
## AMT_MIN_BAL_NMC_CHGS NO_OF_IW_CHQ_BNC_TXNS NO_OF_OW_CHQ_BNC_TXNS
## Min. : 0.000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 0.000 Median :0.00000 Median :0.0000
## Mean : 1.292 Mean :0.04275 Mean :0.0444
## 3rd Qu.: 0.000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :170.000 Max. :2.00000 Max. :2.0000
##
## AVG_AMT_PER_ATM_TXN AVG_AMT_PER_CSH_WDL_TXN AVG_AMT_PER_CHQ_TXN
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 1266 1st Qu.: 0
## Median : 6000 Median :147095 Median : 8645
## Mean : 7409 Mean :242237 Mean : 25093
## 3rd Qu.:13500 3rd Qu.:385000 3rd Qu.: 28605
## Max. :25000 Max. :999640 Max. :537842
##
## AVG_AMT_PER_NET_TXN AVG_AMT_PER_MOB_TXN FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN
## Min. : 0 Min. : 0 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 0 Median : 0 Median :1.0000 Median :0.0000
## Mean :179059 Mean : 20304 Mean :0.9012 Mean :0.4929
## 3rd Qu.:257699 3rd Qu.: 0 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :999854 Max. :199667 Max. :1.0000 Max. :1.0000
##
## random
## Min. :0.0000114
## 1st Qu.:0.2481866
## Median :0.5061214
## Mean :0.5019330
## 3rd Qu.:0.7535712
## Max. :0.9999471
##
Creating Training and Testing Dataset
The given data set is divided into Training and Testing data set, with 70:30 proportion.
The distribution of Responder and Non Responder Class is verified in both the data sets, and ensured it’s close to equal.
set.seed(111)
trainIndex <- createDataPartition(TARGET,
p=0.7,
list = FALSE,
times = 1)
train.data <- PL_X_SELL[trainIndex, ]
test.data <- PL_X_SELL[-trainIndex,]
dim(train.data)
## [1] 14000 40
dim(test.data)
## [1] 6000 40
Check if distribution of partition data is correct
table(train.data$TARGET)
##
## 0 1
## 12272 1728
table(test.data$TARGET)
##
## 0 1
## 5216 784
prop.table((table(train.data$TARGET)))
##
## 0 1
## 0.8765714 0.1234286
prop.table((table(test.data$TARGET)))
##
## 0 1
## 0.8693333 0.1306667
par(mfrow=c(1,2))
pie3D(prop.table((table(train.data$TARGET))),
main='Resp Vs Non Resp in Training Data set',
#explode=0.1,
labels=c("Non Resp", "Resp"),
col = c("Turquoise", "Medium Sea Green")
)
pie3D(prop.table((table(test.data$TARGET))),
main='Resp Vs Non Resp in Testing Data set',
#explode=0.1,
labels=c("Non Resp", "Resp"),
col = c("Aquamarine", "Dark Sea Green")
)
Decision Trees are commonly used in data mining with the objective of creating a model that predicts the value of a target (or dependent variable) based on the values of several input (or independent variables).
As an Umbrella term, the Classification and Regression Tree refers to the following decision trees:
> Classification Trees: where the target variable is categorical and the tree is used to identify the “class” within which a target variable would likely fall into.
> Regression Trees: where the target variable is continuous and tree is used to predict its value.
The CART algorithm is structured as a sequence of questions, the answers to which determine what the next question, if any should be. The result of these questions is a tree like structure where the ends are terminal nodes at which point there are no more questions.
Model Building - CART (Unbalanced Dataset)
Setting the control parameter inputs for rpart
r.ctrl <- rpart.control(minsplit = 100,
minbucket = 10,
cp = 0,
xval = 10
)
Build the model on Training Dataset (Unbalanced)
#Exclude columns - "Customer ID" and "Acct Opening Date"
cart.train <- train.data
names(cart.train)
## [1] "CUST_ID" "TARGET"
## [3] "AGE" "GENDER"
## [5] "BALANCE" "OCCUPATION"
## [7] "AGE_BKT" "SCR"
## [9] "HOLDING_PERIOD" "ACC_TYPE"
## [11] "ACC_OP_DATE" "LEN_OF_RLTN_IN_MNTH"
## [13] "NO_OF_L_CR_TXNS" "NO_OF_L_DR_TXNS"
## [15] "TOT_NO_OF_L_TXNS" "NO_OF_BR_CSH_WDL_DR_TXNS"
## [17] "NO_OF_ATM_DR_TXNS" "NO_OF_NET_DR_TXNS"
## [19] "NO_OF_MOB_DR_TXNS" "NO_OF_CHQ_DR_TXNS"
## [21] "FLG_HAS_CC" "AMT_ATM_DR"
## [23] "AMT_BR_CSH_WDL_DR" "AMT_CHQ_DR"
## [25] "AMT_NET_DR" "AMT_MOB_DR"
## [27] "AMT_L_DR" "FLG_HAS_ANY_CHGS"
## [29] "AMT_OTH_BK_ATM_USG_CHGS" "AMT_MIN_BAL_NMC_CHGS"
## [31] "NO_OF_IW_CHQ_BNC_TXNS" "NO_OF_OW_CHQ_BNC_TXNS"
## [33] "AVG_AMT_PER_ATM_TXN" "AVG_AMT_PER_CSH_WDL_TXN"
## [35] "AVG_AMT_PER_CHQ_TXN" "AVG_AMT_PER_NET_TXN"
## [37] "AVG_AMT_PER_MOB_TXN" "FLG_HAS_NOMINEE"
## [39] "FLG_HAS_OLD_LOAN" "random"
m1 <- rpart(formula = TARGET~.,
data = cart.train[,-c(1,11)],
method = "class",
control = r.ctrl
)
#m1
#install.packages("rattle")
#install.packages("RColorBrewer")
library(rattle)
library(RColorBrewer)
fancyRpartPlot(m1)
printcp(m1)
##
## Classification tree:
## rpart(formula = TARGET ~ ., data = cart.train[, -c(1, 11)], method = "class",
## control = r.ctrl)
##
## Variables actually used in tree construction:
## [1] ACC_TYPE AGE_BKT
## [3] AMT_ATM_DR AMT_BR_CSH_WDL_DR
## [5] AMT_L_DR AMT_NET_DR
## [7] AVG_AMT_PER_ATM_TXN AVG_AMT_PER_CSH_WDL_TXN
## [9] AVG_AMT_PER_MOB_TXN AVG_AMT_PER_NET_TXN
## [11] BALANCE FLG_HAS_CC
## [13] GENDER HOLDING_PERIOD
## [15] LEN_OF_RLTN_IN_MNTH NO_OF_ATM_DR_TXNS
## [17] NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_L_CR_TXNS
## [19] NO_OF_L_DR_TXNS NO_OF_OW_CHQ_BNC_TXNS
## [21] OCCUPATION SCR
##
## Root node error: 1728/14000 = 0.12343
##
## n= 14000
##
## CP nsplit rel error xerror xstd
## 1 0.00593171 0 1.00000 1.00000 0.022523
## 2 0.00405093 4 0.97627 0.99942 0.022517
## 3 0.00173611 10 0.95139 0.98669 0.022393
## 4 0.00166377 12 0.94792 0.99884 0.022512
## 5 0.00154321 26 0.91725 0.99826 0.022506
## 6 0.00135031 30 0.91088 0.99884 0.022512
## 7 0.00101273 41 0.89294 1.00347 0.022556
## 8 0.00096451 45 0.88889 1.00926 0.022612
## 9 0.00086806 48 0.88600 1.00926 0.022612
## 10 0.00057870 50 0.88426 1.01215 0.022640
## 11 0.00000000 55 0.88021 1.02546 0.022767
plotcp(m1)
Pruning the Tree
ptree<- prune(m1, cp= 0.0017 ,"CP")
printcp(ptree)
##
## Classification tree:
## rpart(formula = TARGET ~ ., data = cart.train[, -c(1, 11)], method = "class",
## control = r.ctrl)
##
## Variables actually used in tree construction:
## [1] AGE_BKT AMT_ATM_DR BALANCE HOLDING_PERIOD
## [5] NO_OF_L_CR_TXNS NO_OF_L_DR_TXNS OCCUPATION SCR
##
## Root node error: 1728/14000 = 0.12343
##
## n= 14000
##
## CP nsplit rel error xerror xstd
## 1 0.0059317 0 1.00000 1.00000 0.022523
## 2 0.0040509 4 0.97627 0.99942 0.022517
## 3 0.0017361 10 0.95139 0.98669 0.022393
## 4 0.0017000 12 0.94792 0.99884 0.022512
Ploting the final CART model (Unbalanced Dataset)
fancyRpartPlot(ptree,
uniform = TRUE,
main = "Final Tree",
palettes = c("Blues", "Oranges")
)
The following model performance measures will be calculated on the training set to gauge the goodness of the model:
> Rank Ordering
> KS
> Area Under Curve (AUC)
> Gini Coefficient
> Classification Error
Predict Training Data Set
cart.train$predict.class = predict(ptree, cart.train, type = "class")
cart.train$predict.score = predict(ptree, cart.train, type = "prob")
#head(cart.train)
Deciling Code
decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
#class(cart.train$predict.score)
## deciling
cart.train$deciles <- decile(cart.train$predict.score[,2])
#head(cart.train)
Ranking Code
#install.packages("data.table")
#install.packages("scales")
library(data.table)
library(scales)
tmp_DT = data.table(cart.train)
rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
rank
## deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1: 10 2670 654 2016 24.5% 654 2016
## 2: 9 2911 310 2601 10.6% 964 4617
## 3: 7 8419 764 7655 9.1% 1728 12272
## cum_rel_resp cum_rel_non_resp ks
## 1: 37.8% 16.4% 21.42
## 2: 55.8% 37.6% 18.17
## 3: 100.0% 100.0% 0.00
Interpretation:
> The response rate in top deciles is above 24.5%.
> The KS is around 21.42%, indicating it to be not a good model.
KS and Area under Curve
#install.packages("ROCR")
#install.packages("ineq")
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:plotrix':
##
## plotCI
## The following object is masked from 'package:stats':
##
## lowess
library(ineq)
pred <- prediction(cart.train$predict.score[,2], cart.train$TARGET)
perf <- performance(pred, "tpr", "fpr")
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values)
gini = ineq(cart.train$predict.score[,2], type="Gini")
with(cart.train, table(TARGET, predict.class))
## predict.class
## TARGET 0 1
## 0 12226 46
## 1 1592 136
plot(perf)
KS
## [1] 0.2142056
auc
## [1] 0.624329
gini
## [1] 0.2179665
Summary: CART - Model Performance(Training Unbalanced Dataset):
The KS = 21.4% and the AUC = 62.4% which indicates that the model is not good.
The Gini Coefficient = 21.7% also indicates that the model in not good.
Confusion matrix:
1. Accuracy = (12226+136)/(12226+1592+46+136) = 88.3%
2. Classification Error Rate = 1 - Accuracy = 11.7%
Predict Test Data Set
## Scoring Holdout sample
cart.test <- test.data
cart.test$predict.class = predict(ptree, cart.test, type = "class")
cart.test$predict.score = predict(ptree, cart.test, type = "prob")
cart.test$deciles <- decile(cart.test$predict.score[,2])
#head(cart.test)
Ranking Code - Test Data
tmp_DT = data.table(cart.test)
rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
rank
## deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1: 10 1230 324 906 26.3% 324 906
## 2: 8 1212 129 1083 10.6% 453 1989
## 3: 6 3558 331 3227 9.3% 784 5216
## cum_rel_resp cum_rel_non_resp ks
## 1: 41.3% 17.4% 23.96
## 2: 57.8% 38.1% 19.65
## 3: 100.0% 100.0% 0.00
Interpretation:
> The response rate in top deciles is above 26.3%.
> The KS is around 23.9%, indicating it to be not a good model.
KS and Area under Curve
pred <- prediction(cart.test$predict.score[,2], cart.test$TARGET)
perf <- performance(pred, "tpr", "fpr")
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values)
gini = ineq(cart.test$predict.score[,2], type="Gini")
with(cart.test, table(TARGET, predict.class))
## predict.class
## TARGET 0 1
## 0 5193 23
## 1 738 46
plot(perf)
KS
## [1] 0.2456805
auc
## [1] 0.6350559
gini
## [1] 0.2192173
CART - Model Performance(Test Dataset):
The KS = 24.5% and the AUC = 63.5% which indicates that the model is not good.
The Gini Coefficient = 21.9% also indicates that the model in not good.
Confusion matrix:
1. Accuracy = (5193 + 46)/(5193+738+23+46) = 87.32%
2. Classification Error Rate = 1 - Accuracy = 12.68%
#install.packages("ROSE", repos = "http://cran.us.r-project.org")
library(ROSE)
N=12272*2
N
## [1] 24544
cart.train.over <- ovun.sample(TARGET~.,data=cart.train,
method="over",N=24544)$data
table(cart.train.over$TARGET)
##
## 0 1
## 12272 12272
Setting the control parameter
r.ctrl = rpart.control(minsplit=100,
minbucket = 20,
cp = 0,
xval = 10)
Building the CART Model on Oversampled Training Dataset
# Exclude Columns Customer ID and Acct Opening Date.
names(cart.train.over)
## [1] "CUST_ID" "TARGET"
## [3] "AGE" "GENDER"
## [5] "BALANCE" "OCCUPATION"
## [7] "AGE_BKT" "SCR"
## [9] "HOLDING_PERIOD" "ACC_TYPE"
## [11] "ACC_OP_DATE" "LEN_OF_RLTN_IN_MNTH"
## [13] "NO_OF_L_CR_TXNS" "NO_OF_L_DR_TXNS"
## [15] "TOT_NO_OF_L_TXNS" "NO_OF_BR_CSH_WDL_DR_TXNS"
## [17] "NO_OF_ATM_DR_TXNS" "NO_OF_NET_DR_TXNS"
## [19] "NO_OF_MOB_DR_TXNS" "NO_OF_CHQ_DR_TXNS"
## [21] "FLG_HAS_CC" "AMT_ATM_DR"
## [23] "AMT_BR_CSH_WDL_DR" "AMT_CHQ_DR"
## [25] "AMT_NET_DR" "AMT_MOB_DR"
## [27] "AMT_L_DR" "FLG_HAS_ANY_CHGS"
## [29] "AMT_OTH_BK_ATM_USG_CHGS" "AMT_MIN_BAL_NMC_CHGS"
## [31] "NO_OF_IW_CHQ_BNC_TXNS" "NO_OF_OW_CHQ_BNC_TXNS"
## [33] "AVG_AMT_PER_ATM_TXN" "AVG_AMT_PER_CSH_WDL_TXN"
## [35] "AVG_AMT_PER_CHQ_TXN" "AVG_AMT_PER_NET_TXN"
## [37] "AVG_AMT_PER_MOB_TXN" "FLG_HAS_NOMINEE"
## [39] "FLG_HAS_OLD_LOAN" "random"
## [41] "predict.class" "predict.score"
## [43] "deciles"
m1 <- rpart(formula = TARGET ~ .,
data = cart.train.over[,-c(1,11,41,42,43)],
method = "class",
control = r.ctrl)
#m1
printcp(m1)
##
## Classification tree:
## rpart(formula = TARGET ~ ., data = cart.train.over[, -c(1, 11,
## 41, 42, 43)], method = "class", control = r.ctrl)
##
## Variables actually used in tree construction:
## [1] ACC_TYPE AGE
## [3] AGE_BKT AMT_ATM_DR
## [5] AMT_BR_CSH_WDL_DR AMT_CHQ_DR
## [7] AMT_L_DR AMT_MOB_DR
## [9] AMT_NET_DR AVG_AMT_PER_ATM_TXN
## [11] AVG_AMT_PER_CHQ_TXN AVG_AMT_PER_CSH_WDL_TXN
## [13] AVG_AMT_PER_NET_TXN BALANCE
## [15] FLG_HAS_ANY_CHGS FLG_HAS_CC
## [17] FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN
## [19] GENDER HOLDING_PERIOD
## [21] LEN_OF_RLTN_IN_MNTH NO_OF_ATM_DR_TXNS
## [23] NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_CHQ_DR_TXNS
## [25] NO_OF_IW_CHQ_BNC_TXNS NO_OF_L_CR_TXNS
## [27] NO_OF_L_DR_TXNS NO_OF_NET_DR_TXNS
## [29] NO_OF_OW_CHQ_BNC_TXNS OCCUPATION
## [31] SCR TOT_NO_OF_L_TXNS
##
## Root node error: 12272/24544 = 0.5
##
## n= 24544
##
## CP nsplit rel error xerror xstd
## 1 1.8106e-01 0 1.00000 1.02591 0.0063809
## 2 2.6239e-02 1 0.81894 0.83572 0.0062963
## 3 1.9883e-02 2 0.79270 0.80256 0.0062574
## 4 1.3174e-02 3 0.77282 0.77705 0.0062224
## 5 1.1367e-02 6 0.73330 0.74389 0.0061701
## 6 1.0308e-02 8 0.71056 0.72849 0.0061433
## 7 8.2301e-03 10 0.68994 0.69712 0.0060832
## 8 6.4782e-03 11 0.68171 0.68970 0.0060680
## 9 6.1930e-03 13 0.66876 0.68326 0.0060544
## 10 4.5361e-03 14 0.66257 0.67031 0.0060261
## 11 4.2373e-03 17 0.64896 0.65271 0.0059857
## 12 3.5039e-03 19 0.64048 0.63722 0.0059482
## 13 3.2595e-03 24 0.61954 0.61514 0.0058914
## 14 3.2323e-03 26 0.61302 0.60495 0.0058638
## 15 3.1780e-03 29 0.60332 0.60373 0.0058605
## 16 3.0557e-03 33 0.58841 0.59884 0.0058469
## 17 3.0150e-03 37 0.57309 0.59037 0.0058229
## 18 2.9878e-03 38 0.57008 0.58336 0.0058026
## 19 2.9335e-03 41 0.56111 0.57863 0.0057887
## 20 2.8928e-03 42 0.55818 0.57668 0.0057829
## 21 2.8113e-03 45 0.54808 0.56454 0.0057461
## 22 2.7705e-03 47 0.54245 0.55590 0.0057191
## 23 2.6890e-03 49 0.53691 0.55386 0.0057126
## 24 2.6076e-03 52 0.52885 0.54319 0.0056781
## 25 2.5261e-03 53 0.52624 0.53382 0.0056470
## 26 2.4853e-03 55 0.52119 0.52893 0.0056304
## 27 2.4446e-03 62 0.49756 0.52306 0.0056103
## 28 2.3631e-03 63 0.49511 0.51263 0.0055736
## 29 2.2816e-03 73 0.46537 0.50301 0.0055389
## 30 2.2545e-03 78 0.45396 0.50196 0.0055351
## 31 2.2409e-03 82 0.44492 0.49641 0.0055146
## 32 2.2001e-03 85 0.43783 0.49641 0.0055146
## 33 2.1186e-03 88 0.42911 0.49047 0.0054923
## 34 1.9964e-03 91 0.42275 0.47906 0.0054485
## 35 1.9557e-03 93 0.41876 0.47531 0.0054338
## 36 1.9149e-03 97 0.41094 0.47376 0.0054277
## 37 1.8742e-03 99 0.40711 0.46855 0.0054070
## 38 1.7927e-03 103 0.39920 0.46545 0.0053945
## 39 1.7384e-03 104 0.39741 0.45681 0.0053593
## 40 1.7112e-03 109 0.38649 0.45331 0.0053447
## 41 1.6705e-03 114 0.37793 0.44524 0.0053108
## 42 1.6297e-03 116 0.37459 0.44043 0.0052902
## 43 1.5890e-03 119 0.36970 0.43530 0.0052679
## 44 1.5482e-03 121 0.36653 0.42699 0.0052312
## 45 1.4668e-03 122 0.36498 0.42080 0.0052033
## 46 1.4260e-03 127 0.35764 0.41387 0.0051716
## 47 1.3853e-03 129 0.35479 0.41354 0.0051701
## 48 1.2223e-03 133 0.34925 0.40238 0.0051178
## 49 1.1951e-03 139 0.34192 0.38828 0.0050495
## 50 1.1408e-03 142 0.33833 0.38095 0.0050129
## 51 1.0593e-03 145 0.33491 0.37744 0.0049952
## 52 9.7784e-04 147 0.33279 0.37036 0.0049589
## 53 8.9635e-04 151 0.32888 0.36685 0.0049407
## 54 8.5561e-04 168 0.31274 0.36294 0.0049201
## 55 8.1486e-04 175 0.30525 0.36278 0.0049193
## 56 7.3338e-04 176 0.30443 0.35927 0.0049007
## 57 6.9263e-04 178 0.30297 0.35528 0.0048793
## 58 6.5189e-04 182 0.29979 0.35292 0.0048666
## 59 5.7040e-04 186 0.29718 0.34998 0.0048506
## 60 5.2966e-04 190 0.29490 0.34958 0.0048484
## 61 4.8892e-04 192 0.29384 0.34892 0.0048448
## 62 4.4817e-04 207 0.28471 0.34868 0.0048435
## 63 4.3459e-04 209 0.28382 0.34762 0.0048377
## 64 4.0743e-04 212 0.28251 0.34754 0.0048372
## 65 3.6669e-04 217 0.28048 0.34672 0.0048327
## 66 2.4446e-04 219 0.27974 0.34306 0.0048124
## 67 2.0372e-04 220 0.27950 0.34208 0.0048070
## 68 1.9013e-04 224 0.27836 0.34249 0.0048093
## 69 1.6297e-04 227 0.27779 0.34224 0.0048079
## 70 8.1486e-05 228 0.27762 0.34192 0.0048061
## 71 5.4324e-05 230 0.27746 0.34241 0.0048088
## 72 4.0743e-05 233 0.27730 0.34241 0.0048088
## 73 0.0000e+00 235 0.27722 0.34265 0.0048102
plotcp(m1)
Pruning the Tree
#cp value from row 69
ptree<- prune(m1, cp= 0.00029 ,"CP")
printcp(ptree)
##
## Classification tree:
## rpart(formula = TARGET ~ ., data = cart.train.over[, -c(1, 11,
## 41, 42, 43)], method = "class", control = r.ctrl)
##
## Variables actually used in tree construction:
## [1] ACC_TYPE AGE
## [3] AGE_BKT AMT_ATM_DR
## [5] AMT_BR_CSH_WDL_DR AMT_CHQ_DR
## [7] AMT_L_DR AMT_MOB_DR
## [9] AMT_NET_DR AVG_AMT_PER_ATM_TXN
## [11] AVG_AMT_PER_CHQ_TXN AVG_AMT_PER_CSH_WDL_TXN
## [13] AVG_AMT_PER_NET_TXN BALANCE
## [15] FLG_HAS_ANY_CHGS FLG_HAS_CC
## [17] FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN
## [19] GENDER HOLDING_PERIOD
## [21] LEN_OF_RLTN_IN_MNTH NO_OF_ATM_DR_TXNS
## [23] NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_CHQ_DR_TXNS
## [25] NO_OF_IW_CHQ_BNC_TXNS NO_OF_L_CR_TXNS
## [27] NO_OF_L_DR_TXNS NO_OF_NET_DR_TXNS
## [29] NO_OF_OW_CHQ_BNC_TXNS OCCUPATION
## [31] SCR TOT_NO_OF_L_TXNS
##
## Root node error: 12272/24544 = 0.5
##
## n= 24544
##
## CP nsplit rel error xerror xstd
## 1 0.18106258 0 1.00000 1.02591 0.0063809
## 2 0.02623859 1 0.81894 0.83572 0.0062963
## 3 0.01988266 2 0.79270 0.80256 0.0062574
## 4 0.01317362 3 0.77282 0.77705 0.0062224
## 5 0.01136734 6 0.73330 0.74389 0.0061701
## 6 0.01030802 8 0.71056 0.72849 0.0061433
## 7 0.00823012 10 0.68994 0.69712 0.0060832
## 8 0.00647816 11 0.68171 0.68970 0.0060680
## 9 0.00619296 13 0.66876 0.68326 0.0060544
## 10 0.00453607 14 0.66257 0.67031 0.0060261
## 11 0.00423729 17 0.64896 0.65271 0.0059857
## 12 0.00350391 19 0.64048 0.63722 0.0059482
## 13 0.00325945 24 0.61954 0.61514 0.0058914
## 14 0.00323229 26 0.61302 0.60495 0.0058638
## 15 0.00317797 29 0.60332 0.60373 0.0058605
## 16 0.00305574 33 0.58841 0.59884 0.0058469
## 17 0.00301499 37 0.57309 0.59037 0.0058229
## 18 0.00298783 38 0.57008 0.58336 0.0058026
## 19 0.00293351 41 0.56111 0.57863 0.0057887
## 20 0.00289276 42 0.55818 0.57668 0.0057829
## 21 0.00281128 45 0.54808 0.56454 0.0057461
## 22 0.00277053 47 0.54245 0.55590 0.0057191
## 23 0.00268905 49 0.53691 0.55386 0.0057126
## 24 0.00260756 52 0.52885 0.54319 0.0056781
## 25 0.00252608 53 0.52624 0.53382 0.0056470
## 26 0.00248533 55 0.52119 0.52893 0.0056304
## 27 0.00244459 62 0.49756 0.52306 0.0056103
## 28 0.00236310 63 0.49511 0.51263 0.0055736
## 29 0.00228162 73 0.46537 0.50301 0.0055389
## 30 0.00225445 78 0.45396 0.50196 0.0055351
## 31 0.00224087 82 0.44492 0.49641 0.0055146
## 32 0.00220013 85 0.43783 0.49641 0.0055146
## 33 0.00211864 88 0.42911 0.49047 0.0054923
## 34 0.00199641 91 0.42275 0.47906 0.0054485
## 35 0.00195567 93 0.41876 0.47531 0.0054338
## 36 0.00191493 97 0.41094 0.47376 0.0054277
## 37 0.00187419 99 0.40711 0.46855 0.0054070
## 38 0.00179270 103 0.39920 0.46545 0.0053945
## 39 0.00173837 104 0.39741 0.45681 0.0053593
## 40 0.00171121 109 0.38649 0.45331 0.0053447
## 41 0.00167047 114 0.37793 0.44524 0.0053108
## 42 0.00162973 116 0.37459 0.44043 0.0052902
## 43 0.00158898 119 0.36970 0.43530 0.0052679
## 44 0.00154824 121 0.36653 0.42699 0.0052312
## 45 0.00146675 122 0.36498 0.42080 0.0052033
## 46 0.00142601 127 0.35764 0.41387 0.0051716
## 47 0.00138527 129 0.35479 0.41354 0.0051701
## 48 0.00122229 133 0.34925 0.40238 0.0051178
## 49 0.00119513 139 0.34192 0.38828 0.0050495
## 50 0.00114081 142 0.33833 0.38095 0.0050129
## 51 0.00105932 145 0.33491 0.37744 0.0049952
## 52 0.00097784 147 0.33279 0.37036 0.0049589
## 53 0.00089635 151 0.32888 0.36685 0.0049407
## 54 0.00085561 168 0.31274 0.36294 0.0049201
## 55 0.00081486 175 0.30525 0.36278 0.0049193
## 56 0.00073338 176 0.30443 0.35927 0.0049007
## 57 0.00069263 178 0.30297 0.35528 0.0048793
## 58 0.00065189 182 0.29979 0.35292 0.0048666
## 59 0.00057040 186 0.29718 0.34998 0.0048506
## 60 0.00052966 190 0.29490 0.34958 0.0048484
## 61 0.00048892 192 0.29384 0.34892 0.0048448
## 62 0.00044817 207 0.28471 0.34868 0.0048435
## 63 0.00043459 209 0.28382 0.34762 0.0048377
## 64 0.00040743 212 0.28251 0.34754 0.0048372
## 65 0.00036669 217 0.28048 0.34672 0.0048327
## 66 0.00029000 219 0.27974 0.34306 0.0048124
fancyRpartPlot(ptree,
uniform = TRUE,
main = "Final Tree",
palettes = c("Blues", "Oranges"))
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
Measure CART Model Performance (Oversampled Training Dataset)
cart.train.over$predict.class <- predict(ptree, cart.train.over, type = "class")
cart.train.over$predict.score <- predict(ptree, cart.train.over, type = "prob")
#head(cart.train.over)
Deciling Code
decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
#class(cart.train.over$predict.score)
## deciling
cart.train.over$deciles <- decile(cart.train.over$predict.score[,2])
#View(cart.train.over)
Ranking Code
tmp_DT = data.table(cart.train.over)
rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
rank
## deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1: 10 2463 2318 145 94.1% 2318 145
## 2: 9 2644 2384 260 90.2% 4702 405
## 3: 8 2346 2009 337 85.6% 6711 742
## 4: 7 2530 2085 445 82.4% 8796 1187
## 5: 6 2334 1750 584 75.0% 10546 1771
## 6: 5 2413 976 1437 40.4% 11522 3208
## 7: 4 2668 451 2217 16.9% 11973 5425
## 8: 3 2297 218 2079 9.5% 12191 7504
## 9: 2 2489 81 2408 3.2% 12272 9912
## 10: 1 2360 0 2360 0.0% 12272 12272
## cum_rel_resp cum_rel_non_resp ks
## 1: 18.9% 1.2% 17.71
## 2: 38.3% 3.3% 35.01
## 3: 54.7% 6.0% 48.64
## 4: 71.7% 9.7% 62.01
## 5: 85.9% 14.4% 71.51
## 6: 93.9% 26.1% 67.75
## 7: 97.6% 44.2% 53.35
## 8: 99.3% 61.2% 38.19
## 9: 100.0% 80.8% 19.23
## 10: 100.0% 100.0% 0.00
Interpretation fron the rank order table:
The response rate in top deciles is above 76%.
The KS is above 70%, indicating it to be a good model.
KS and Area under Curve
pred <- prediction(cart.train.over$predict.score[,2], cart.train.over$TARGET)
perf <- performance(pred, "tpr", "fpr")
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values)
gini = ineq(cart.train.over$predict.score[,2], type="Gini")
with(cart.train.over, table(TARGET, predict.class))
## predict.class
## TARGET 0 1
## 0 10235 2037
## 1 1396 10876
plot(perf)
KS
## [1] 0.7202575
auc
## [1] 0.9162399
gini
## [1] 0.4162399
Summary - CART Model Performance(Oversampled Training Dataset):
The KS = 72.18% and the AUC = 91.46% which indicating it to be a good model.
The Gini Coefficient = 41.46% also indicating to be a good model, with scope of improvement.
Confusion matrix:
1. Accuracy = (10106 + 11025)/(10106 + 1247 + 2166 + 11025) = 86.9%
2. Classification Error Rate = 1 - Accuracy = 13.91%
#Syntax to get the node path
#tree.path<-path.rpart(ptree,node =c(2,12))
#nrow(cart.test)
Scoring Holdout Sample
cart.test$predict.class <- predict(ptree, cart.test, type = "class")
cart.test$predict.score <- predict(ptree, cart.test, type = "prob")
cart.test$deciles <- decile(cart.test$predict.score[,2])
#View(cart.test)
Ranking Code
tmp_DT = data.table(cart.test)
h_rank <- tmp_DT[, list(cnt=length(TARGET),
cnt_resp=sum(TARGET==1),
cnt_non_resp=sum(TARGET==0)
), by=deciles][order(-deciles)]
h_rank$rrate <- round(h_rank$cnt_resp / h_rank$cnt,4);
h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),4);
h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp / sum(h_rank$cnt_non_resp),4);
h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp) * 100;
h_rank$rrate <- percent(h_rank$rrate)
h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
h_rank
## deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1: 10 637 270 367 42.4% 270 367
## 2: 9 605 199 406 32.9% 469 773
## 3: 8 573 108 465 18.8% 577 1238
## 4: 7 621 66 555 10.6% 643 1793
## 5: 6 664 33 631 5.0% 676 2424
## 6: 5 519 41 478 7.9% 717 2902
## 7: 4 759 26 733 3.4% 743 3635
## 8: 3 563 9 554 1.6% 752 4189
## 9: 2 1059 32 1027 3.0% 784 5216
## cum_rel_resp cum_rel_non_resp ks
## 1: 34.4% 7.0% 27.40
## 2: 59.8% 14.8% 45.00
## 3: 73.6% 23.7% 49.87
## 4: 82.0% 34.4% 47.64
## 5: 86.2% 46.5% 39.75
## 6: 91.4% 55.6% 35.81
## 7: 94.8% 69.7% 25.08
## 8: 95.9% 80.3% 15.61
## 9: 100.0% 100.0% 0.00
Interpretation fron the rank order table: The response rate in top three deciles is 45%, 29% and 17% respectively.
With top 3 deciles, the KS is 46%, indicating it to be a good model.
KS and Area under Curve
pred <- prediction(cart.test$predict.score[,2], cart.test$TARGET)
perf <- performance(pred, "tpr", "fpr")
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values)
gini = ineq(cart.test$predict.score[,2], type="Gini")
with(cart.test, table(TARGET, predict.class))
## predict.class
## TARGET 0 1
## 0 4228 988
## 1 255 529
plot(perf)
KS
## [1] 0.5004147
auc
## [1] 0.8027087
gini
## [1] 0.5933811
Summary - CART Model Performance(Oversampled Training Dataset):
The KS = 47% and the AUC = 79% which indicating it to be a good model.
The Gini Coefficient = 60.4% also indicating to be a good model, with scope of improvement.
Confusion matrix:
1. Accuracy = (4178 + 523)/(4178 + 261 + 1038 + 523) = 78.35%
2. Classification Error Rate = 1 - Accuracy = 21.65%
Comparative Summary of the CART Model on Training and Testing Dataset is as follows:
Measures | Train | Test | %Deviation |
---|---|---|---|
KS | 72.18 | 47.40 | 35% |
AUC | 91.46 | 79.23 | 13% |
Gini | 41.46 | 60.42 | -46% |
Accuracy | 86.9 | 78.35 | 10% |
CeR | 13.91 | 21.65 | -56% |
CeR= Clasification error rate
It can be observed that most of the Model Performance values for Training & Testing sets are above the maximum tolerance deviation of +/- 10%. Hence, the model is over-fitting.
The good performance on the model performance measures indicates good prediction making capabilities of the developed CART model.
A Supervised Classification Algorithm, as the name suggests, this algorithm creates the forest with a number of trees in random order. In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high accuracy results.
Some advantages of using Random Forest are as follows:
> The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
> Random forest classifier will handle the missing values.
> When we have more trees in the forest, random forest classifier won’t over fit the model.
> Can model the random forest classifier for categorical values also.
Creating Training and Testing Dataset for RF Model
PL_X_SELL = read.table("D:/Analytics/BACP-Dec2017/10_MachineLearning/PL_X_SELL.csv", sep = ",", header = T)
attach(PL_X_SELL)
set.seed(111)
trainIndex <- createDataPartition(TARGET,
p=0.7,
list = FALSE,
times = 1)
rf.train <- PL_X_SELL[trainIndex, ]
rf.test <- PL_X_SELL[-trainIndex,]
dim(rf.train)
## [1] 14000 40
dim(rf.test)
## [1] 6000 40
#names(rf.train)
#str(rf.train)
#install.packages("randomForest")
library(randomForest)
Random Forest Model - Train Dataset
The model is built with dependant variable as TARGET, and considering all independent variables except Customer ID and Account Opening Date.
RF=randomForest(as.factor(TARGET)~.,
data = rf.train[,-c(1,11)],
ntree = 501, mtry = 3, nodesize = 10,
importance=TRUE)
print(RF)
##
## Call:
## randomForest(formula = as.factor(TARGET) ~ ., data = rf.train[, -c(1, 11)], ntree = 501, mtry = 3, nodesize = 10, importance = TRUE)
## Type of random forest: classification
## Number of trees: 501
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 7.75%
## Confusion matrix:
## 0 1 class.error
## 0 12271 1 8.148631e-05
## 1 1084 644 6.273148e-01
Out of Bag Error Rate:
Random Forests algorithm is a classifier based on primarily two methods - bagging and random subspace method.
Suppose we decide to have S number of trees in our forest then we first create S datasets of “same size as original” created from random resampling of data with-replacement. Each of these datasets is called a bootstrap dataset.
Due to “with-replacement” option, every dataset can have duplicate data records and at the same time, can be missing several data records from original datasets. This is called Bagging.
The algorithm uses m (=sqrt(M)) random sub features out of M possible features to create any tree. This is called random subspace method.
After creating the classifiers (S trees), there is a subset of records which does not include any of the records part of the classifier tree. This subset, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of all such records.
Out-of-bag estimate for the generalization error is the error rate of the outof-bag classifier on the training set (compare it with known yi’s).
Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating to subsample data samples used for training.
Out-of-bag estimates help in avoiding the need for an independent validation dataset.
The graphical output for the OOB estimate of error rate.
#dev.off()
plot(RF, main="")
legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
title(main="Error Rates Random Forest PL_X_SELL Training data")
The output in tabular form for the OOB estimate of error rate.
#RF$err.rate
It is observed that as the number of tress increases, the OOB error rate starts decreasing till it reaches around 155th tree with OOB = 0.0757 (the minimum value). After this, the OOB doesn’t decrease further and remains largely steady. Hence, the optimal number of trees would be 155.
Variable Importance
To understand the important variables in Random Forest, the following measures are generally used:
> Mean Decrease in Accuracy is based on permutation >> Randomly permute values of a variable for which importance is to be computed in the OOB sample >> Compute the Error Rate with permuted values >> Compute decrease in OOB Error rate (Permuted - Not permuted) >> Average the decrease over all the trees > Mean Decrease in Gini is computed as “total decrease in node impurities from splitting on the variable, averaged over all trees”
#List the iimportance of the variable
impVar <- round(randomForest::importance(RF), 2)
impVar[order(impVar[,3], decreasing=TRUE),]
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## LEN_OF_RLTN_IN_MNTH 34.95 51.92 50.20 98.80
## AGE_BKT 35.40 47.16 46.81 88.60
## OCCUPATION 38.57 43.34 45.44 66.11
## BALANCE 35.40 44.26 43.18 124.62
## SCR 34.33 43.85 42.03 116.92
## AGE 31.46 39.66 41.08 73.73
## AMT_L_DR 32.93 31.95 39.27 105.82
## NO_OF_L_CR_TXNS 32.00 28.38 37.55 93.59
## HOLDING_PERIOD 29.40 38.55 36.72 98.03
## FLG_HAS_CC 30.26 34.22 34.61 30.22
## TOT_NO_OF_L_TXNS 28.20 22.28 33.42 95.35
## AMT_BR_CSH_WDL_DR 26.38 32.95 32.48 86.82
## AVG_AMT_PER_CSH_WDL_TXN 25.01 33.86 31.92 86.31
## AMT_ATM_DR 23.60 22.21 28.24 82.45
## AVG_AMT_PER_CHQ_TXN 22.65 21.94 27.10 75.33
## NO_OF_BR_CSH_WDL_DR_TXNS 19.35 28.20 24.86 44.06
## FLG_HAS_OLD_LOAN 17.95 24.99 24.86 13.71
## AVG_AMT_PER_ATM_TXN 20.31 23.13 24.75 80.67
## AMT_CHQ_DR 21.10 24.95 24.52 75.79
## NO_OF_L_DR_TXNS 21.75 23.99 24.05 58.82
## GENDER 20.35 21.13 22.56 21.71
## AVG_AMT_PER_NET_TXN 19.03 20.92 22.51 61.53
## AMT_NET_DR 18.38 22.15 21.65 61.92
## NO_OF_CHQ_DR_TXNS 18.40 21.46 20.66 36.37
## FLG_HAS_ANY_CHGS 14.98 21.39 20.40 12.47
## AMT_MOB_DR 15.95 14.57 20.13 39.82
## FLG_HAS_NOMINEE 12.28 19.08 19.20 9.88
## AVG_AMT_PER_MOB_TXN 14.34 15.96 18.22 37.73
## NO_OF_ATM_DR_TXNS 17.11 6.50 17.80 26.76
## NO_OF_IW_CHQ_BNC_TXNS 7.26 17.83 16.71 7.67
## NO_OF_NET_DR_TXNS 14.04 18.44 16.41 21.32
## ACC_TYPE 14.21 13.34 16.30 12.67
## NO_OF_OW_CHQ_BNC_TXNS 8.96 17.20 15.91 8.38
## AMT_MIN_BAL_NMC_CHGS 5.50 9.75 10.03 2.79
## NO_OF_MOB_DR_TXNS 7.75 8.93 8.79 9.70
## AMT_OTH_BK_ATM_USG_CHGS 2.00 3.07 3.69 1.29
## random -0.86 -0.48 -0.99 58.39
Optimal mtry value
In the random forests the number of variables available for splitting at each tree node is referred to as the mtry parameter. The optimum number of variables is obtained using tuneRF function.
x = Predictor variables
y = Target variable
mtryStart = starting value of mtry
ntree = No of tree used for tuning
stepFactor = steps to increase (deflate) mtry
improve = the relative oob by atleast this much
trace = print the trace or not
plot = plot OOB vs mtry graph or not
doBest = Finally build the RF using optimal mtry
nodesize = min terminal node size
importance = compute variable importance or not
#Tuning Random Forest
tRF<- tuneRF(x = rf.train[,-c(1,2,11)],
y=as.factor(rf.train$TARGET),
mtryStart = 6, #Aprox, Sqrt of Total no. of variables
ntreeTry = 155,
stepFactor = 1.5,
improve = 0.0001,
trace = TRUE,
plot = TRUE,
doBest = TRUE,
nodesize = 10,
importance = TRUE
)
## mtry = 6 OOB error = 6.44%
## Searching left ...
## mtry = 4 OOB error = 7.1%
## -0.1032186 1e-04
## Searching right ...
## mtry = 9 OOB error = 6.14%
## 0.04661487 1e-04
## mtry = 13 OOB error = 5.9%
## 0.03841676 1e-04
## mtry = 19 OOB error = 5.55%
## 0.05932203 1e-04
## mtry = 28 OOB error = 5.54%
## 0.002574003 1e-04
## mtry = 37 OOB error = 5.36%
## 0.03225806 1e-04
As can be seen, the optimum number of variable is 28.
Scoring
rf.train$predict.class <- predict(tRF, rf.train, type = "class")
rf.train$predict.score <- predict(tRF, rf.train, type = "prob")
#head(rf.train)
#class(rf.train$predict.score)
Deciling code
decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
rf.train$deciles <- decile(rf.train$predict.score[,2])
Rank order table
library(data.table)
tmp_DT = data.table(rf.train)
rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET==1),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
rank$rrate <- round (rank$cnt_resp / rank$cnt,2);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),2);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp); library(scales)
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
rank
## deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1: 10 1404 1404 0 100% 1404 0
## 2: 9 1435 324 1111 23% 1728 1111
## 3: 8 1431 0 1431 0% 1728 2542
## 4: 7 1402 0 1402 0% 1728 3944
## 5: 6 1446 0 1446 0% 1728 5390
## 6: 5 1522 0 1522 0% 1728 6912
## 7: 4 1493 0 1493 0% 1728 8405
## 8: 3 1495 0 1495 0% 1728 9900
## 9: 2 1327 0 1327 0% 1728 11227
## 10: 1 1045 0 1045 0% 1728 12272
## cum_rel_resp cum_rel_non_resp ks
## 1: 81% 0% 0.81
## 2: 100% 9% 0.91
## 3: 100% 21% 0.79
## 4: 100% 32% 0.68
## 5: 100% 44% 0.56
## 6: 100% 56% 0.44
## 7: 100% 68% 0.32
## 8: 100% 81% 0.19
## 9: 100% 91% 0.09
## 10: 100% 100% 0.00
# Baseline Response Rate
sum(rf.train$TARGET) / nrow(rf.train)
## [1] 0.1234286
Interpretation:
> The baseline Response Rate is 12.34%, whereas the response rate in top two deciles is 100% and 23% respectively.
> With top 2 deciles, the KS is 91%, indicating it to be a good model.
KS and Area under Curve
library(ROCR)
pred <- prediction(rf.train$predict.score[,2], rf.train$TARGET)
perf <- performance(pred, "tpr", "fpr")
#plot(perf)
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
KS
## [1] 0.9975471
# Area Under Curve
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values)
auc
## [1] 0.9999683
# Gini Coefficient
library(ineq)
gini = ineq(rf.train$predict.score[,2], type="Gini")
gini
## [1] 0.7485589
# Classification Error
with(rf.train, table(TARGET, predict.class))
## predict.class
## TARGET 0 1
## 0 12270 2
## 1 171 1557
KS=99.75% AUC=99.99% Gini=74.85% Accuracy=98.76% CeR=1.24%
Model Performance on Testing Data Set
# Rank Order
# Scoring syntax
rf.test$predict.class <- predict(tRF, rf.test, type="class")
rf.test$predict.score <- predict(tRF, rf.test, type="prob")
rf.test$deciles <- decile(rf.test$predict.score[,2])
tmp_DT = data.table(rf.test)
h_rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2);
h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),2);
h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp / sum(h_rank$cnt_non_resp),2);
h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp);
library(scales)
h_rank$rrate <- percent(h_rank$rrate)
h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
h_rank
## deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1: 10 602 542 60 90% 542 60
## 2: 9 607 139 468 23% 681 528
## 3: 8 591 51 540 9% 732 1068
## 4: 7 627 27 600 4% 759 1668
## 5: 6 592 10 582 2% 769 2250
## 6: 5 592 12 580 2% 781 2830
## 7: 4 657 1 656 0% 782 3486
## 8: 3 647 0 647 0% 782 4133
## 9: 2 616 0 616 0% 782 4749
## 10: 1 469 2 467 0% 784 5216
## cum_rel_resp cum_rel_non_resp ks
## 1: 69% 1% 0.68
## 2: 87% 10% 0.77
## 3: 93% 20% 0.73
## 4: 97% 32% 0.65
## 5: 98% 43% 0.55
## 6: 100% 54% 0.46
## 7: 100% 67% 0.33
## 8: 100% 79% 0.21
## 9: 100% 91% 0.09
## 10: 100% 100% 0.00
Interpretation:
> The baseline Response Rate is 12.34%, whereas the response rate in top three deciles is 90%, 23%, 9% respectively.
> With two deciles, the KS is 77% indicating it to be a good model.
KS and AUC
pred1 <- prediction(rf.test$predict.score[,2], rf.test$TARGET)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1)
KS1 <- max(attr(perf1, 'y.values')[[1]]-attr(perf1, 'x.values')[[1]])
KS1
## [1] 0.774446
# Area Under Curve
auc1 <- performance(pred1,"auc");
auc1 <- as.numeric(auc1@y.values)
auc1
## [1] 0.9584023
# Gini Coefficient
library(ineq)
gini1 = ineq(rf.test$predict.score[,2], type="Gini")
gini1
## [1] 0.6463728
# Classification Error
with(rf.test, table(TARGET, predict.class))
## predict.class
## TARGET 0 1
## 0 5211 5
## 1 343 441
KS = 77.44% AUC = 95.84% Gini = 64.63% Accuracy = 94.2% CeR = 5.8%
Random Forest Conclusion
Measures | Train | Test | %Deviation |
---|---|---|---|
KS | 99.75% | 77.44% | 22% |
AUC | 99.99% | 95.84% | 04% |
Gini | 74.85% | 64.63% | 14% |
Accuracy | 98.76% | 94.20% | 05% |
CeR | 01.24% | 05.80% | -368% |
Artificial neural networks (ANNs) are statistical models directly inspired by, and partially modeled on biological neural networks. They are capable of modeling and processing nonlinear relationships between inputs and outputs in parallel.
Artificial neural networks are characterized by containing adaptive weights along paths between neurons that can be tuned by a learning algorithm that learns from observed data in order to improve the model. In addition to the learning algorithm itself, one must choose an appropriate cost function.
The cost function is what’s used to learn the optimal solution to the problem being solved. This involves determining the best values for all of the tuneable model parameters, with neuron path adaptive weights being the primary target, along with algorithm tuning parameters such as the learning rate. It’s usually done through optimization techniques such as gradient descent or stochastic gradient descent.
These optimization techniques basically try to make the ANN solution be as close as possible to the optimal solution, which when successful means that the ANN is able to solve the intended problem with high performance.
NNInput = read.table("D:/Analytics/BACP-Dec2017/10_MachineLearning/PL_X_SELL.csv", sep = ",", header = T)
attach(NNInput)
## The following objects are masked from PL_X_SELL (pos = 3):
##
## ACC_OP_DATE, ACC_TYPE, AGE, AGE_BKT, AMT_ATM_DR,
## AMT_BR_CSH_WDL_DR, AMT_CHQ_DR, AMT_L_DR, AMT_MIN_BAL_NMC_CHGS,
## AMT_MOB_DR, AMT_NET_DR, AMT_OTH_BK_ATM_USG_CHGS,
## AVG_AMT_PER_ATM_TXN, AVG_AMT_PER_CHQ_TXN,
## AVG_AMT_PER_CSH_WDL_TXN, AVG_AMT_PER_MOB_TXN,
## AVG_AMT_PER_NET_TXN, BALANCE, CUST_ID, FLG_HAS_ANY_CHGS,
## FLG_HAS_CC, FLG_HAS_NOMINEE, FLG_HAS_OLD_LOAN, GENDER,
## HOLDING_PERIOD, LEN_OF_RLTN_IN_MNTH, NO_OF_ATM_DR_TXNS,
## NO_OF_BR_CSH_WDL_DR_TXNS, NO_OF_CHQ_DR_TXNS,
## NO_OF_IW_CHQ_BNC_TXNS, NO_OF_L_CR_TXNS, NO_OF_L_DR_TXNS,
## NO_OF_MOB_DR_TXNS, NO_OF_NET_DR_TXNS, NO_OF_OW_CHQ_BNC_TXNS,
## OCCUPATION, random, SCR, TARGET, TOT_NO_OF_L_TXNS
## The following objects are masked from PL_X_SELL (pos = 13):
##
## ACC_OP_DATE, ACC_TYPE, AGE, AGE_BKT, AMT_ATM_DR,
## AMT_BR_CSH_WDL_DR, AMT_CHQ_DR, AMT_L_DR, AMT_MIN_BAL_NMC_CHGS,
## AMT_MOB_DR, AMT_NET_DR, AMT_OTH_BK_ATM_USG_CHGS,
## AVG_AMT_PER_ATM_TXN, AVG_AMT_PER_CHQ_TXN,
## AVG_AMT_PER_CSH_WDL_TXN, AVG_AMT_PER_MOB_TXN,
## AVG_AMT_PER_NET_TXN, BALANCE, CUST_ID, FLG_HAS_ANY_CHGS,
## FLG_HAS_CC, FLG_HAS_NOMINEE, FLG_HAS_OLD_LOAN, GENDER,
## HOLDING_PERIOD, LEN_OF_RLTN_IN_MNTH, NO_OF_ATM_DR_TXNS,
## NO_OF_BR_CSH_WDL_DR_TXNS, NO_OF_CHQ_DR_TXNS,
## NO_OF_IW_CHQ_BNC_TXNS, NO_OF_L_CR_TXNS, NO_OF_L_DR_TXNS,
## NO_OF_MOB_DR_TXNS, NO_OF_NET_DR_TXNS, NO_OF_OW_CHQ_BNC_TXNS,
## OCCUPATION, random, SCR, TARGET, TOT_NO_OF_L_TXNS
dim(NNInput)
## [1] 20000 40
str(NNInput)
## 'data.frame': 20000 obs. of 40 variables:
## $ CUST_ID : Factor w/ 20000 levels "C1","C10","C100",..: 17699 16532 11027 17984 2363 11747 18115 15556 15216 12494 ...
## $ TARGET : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AGE : int 27 47 40 53 36 42 30 53 42 30 ...
## $ GENDER : Factor w/ 3 levels "F","M","O": 2 2 2 2 2 1 2 1 1 2 ...
## $ BALANCE : num 3384 287489 18217 71720 1671623 ...
## $ OCCUPATION : Factor w/ 4 levels "PROF","SAL","SELF-EMP",..: 3 2 3 2 1 1 1 2 3 1 ...
## $ AGE_BKT : Factor w/ 7 levels "<25",">50","26-30",..: 3 7 5 2 5 6 3 2 6 3 ...
## $ SCR : int 776 324 603 196 167 493 479 562 105 170 ...
## $ HOLDING_PERIOD : int 30 28 2 13 24 26 14 25 15 13 ...
## $ ACC_TYPE : Factor w/ 2 levels "CA","SA": 2 2 2 1 2 2 2 1 2 2 ...
## $ ACC_OP_DATE : Factor w/ 4869 levels "01-01-00","01-01-01",..: 3270 1806 3575 993 2861 862 4533 3160 257 334 ...
## $ LEN_OF_RLTN_IN_MNTH : int 146 104 61 107 185 192 177 99 88 111 ...
## $ NO_OF_L_CR_TXNS : int 7 8 10 36 20 5 6 14 18 14 ...
## $ NO_OF_L_DR_TXNS : int 3 2 5 14 1 2 6 3 14 8 ...
## $ TOT_NO_OF_L_TXNS : int 10 10 15 50 21 7 12 17 32 22 ...
## $ NO_OF_BR_CSH_WDL_DR_TXNS: int 0 0 1 4 1 1 0 3 6 3 ...
## $ NO_OF_ATM_DR_TXNS : int 1 1 1 2 0 1 1 0 2 1 ...
## $ NO_OF_NET_DR_TXNS : int 2 1 1 3 0 0 1 0 4 0 ...
## $ NO_OF_MOB_DR_TXNS : int 0 0 0 1 0 0 0 0 1 0 ...
## $ NO_OF_CHQ_DR_TXNS : int 0 0 2 4 0 0 4 0 1 4 ...
## $ FLG_HAS_CC : int 0 0 0 0 0 1 0 0 1 0 ...
## $ AMT_ATM_DR : int 13100 6600 11200 26100 0 18500 6200 0 35400 18000 ...
## $ AMT_BR_CSH_WDL_DR : int 0 0 561120 673590 808480 379310 0 945160 198430 869880 ...
## $ AMT_CHQ_DR : int 0 0 49320 60780 0 0 10580 0 51490 32610 ...
## $ AMT_NET_DR : num 973557 799813 997570 741506 0 ...
## $ AMT_MOB_DR : int 0 0 0 71388 0 0 0 0 170332 0 ...
## $ AMT_L_DR : num 986657 806413 1619210 1573364 808480 ...
## $ FLG_HAS_ANY_CHGS : int 0 1 1 0 0 0 1 0 0 0 ...
## $ AMT_OTH_BK_ATM_USG_CHGS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AMT_MIN_BAL_NMC_CHGS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NO_OF_IW_CHQ_BNC_TXNS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NO_OF_OW_CHQ_BNC_TXNS : int 0 0 1 0 0 0 0 0 0 0 ...
## $ AVG_AMT_PER_ATM_TXN : num 13100 6600 11200 13050 0 ...
## $ AVG_AMT_PER_CSH_WDL_TXN : num 0 0 561120 168398 808480 ...
## $ AVG_AMT_PER_CHQ_TXN : num 0 0 24660 15195 0 ...
## $ AVG_AMT_PER_NET_TXN : num 486779 799813 997570 247169 0 ...
## $ AVG_AMT_PER_MOB_TXN : num 0 0 0 71388 0 ...
## $ FLG_HAS_NOMINEE : int 1 1 1 1 1 1 0 1 1 0 ...
## $ FLG_HAS_OLD_LOAN : int 1 0 1 0 0 1 1 1 1 0 ...
## $ random : num 1.14e-05 1.11e-04 1.20e-04 1.37e-04 1.74e-04 ...
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 3.4.4
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:ROCR':
##
## prediction
Create Dummy Variables for Categorical Variables
# Gender
GEN.matrix <- model.matrix(~ GENDER - 1, data = NNInput)
NNInput <- data.frame(NNInput, GEN.matrix)
# Occupation
occ.matrix <- model.matrix(~ OCCUPATION - 1, data = NNInput)
NNInput <- data.frame(NNInput, occ.matrix)
# AGE_BKT
AGEBKT.matrix <- model.matrix(~ AGE_BKT - 1, data = NNInput)
NNInput <- data.frame(NNInput, AGEBKT.matrix)
# ACC_TYPE
ACCTYP.matrix <- model.matrix(~ ACC_TYPE - 1, data = NNInput)
NNInput <- data.frame(NNInput, ACCTYP.matrix)
names(NNInput)
## [1] "CUST_ID" "TARGET"
## [3] "AGE" "GENDER"
## [5] "BALANCE" "OCCUPATION"
## [7] "AGE_BKT" "SCR"
## [9] "HOLDING_PERIOD" "ACC_TYPE"
## [11] "ACC_OP_DATE" "LEN_OF_RLTN_IN_MNTH"
## [13] "NO_OF_L_CR_TXNS" "NO_OF_L_DR_TXNS"
## [15] "TOT_NO_OF_L_TXNS" "NO_OF_BR_CSH_WDL_DR_TXNS"
## [17] "NO_OF_ATM_DR_TXNS" "NO_OF_NET_DR_TXNS"
## [19] "NO_OF_MOB_DR_TXNS" "NO_OF_CHQ_DR_TXNS"
## [21] "FLG_HAS_CC" "AMT_ATM_DR"
## [23] "AMT_BR_CSH_WDL_DR" "AMT_CHQ_DR"
## [25] "AMT_NET_DR" "AMT_MOB_DR"
## [27] "AMT_L_DR" "FLG_HAS_ANY_CHGS"
## [29] "AMT_OTH_BK_ATM_USG_CHGS" "AMT_MIN_BAL_NMC_CHGS"
## [31] "NO_OF_IW_CHQ_BNC_TXNS" "NO_OF_OW_CHQ_BNC_TXNS"
## [33] "AVG_AMT_PER_ATM_TXN" "AVG_AMT_PER_CSH_WDL_TXN"
## [35] "AVG_AMT_PER_CHQ_TXN" "AVG_AMT_PER_NET_TXN"
## [37] "AVG_AMT_PER_MOB_TXN" "FLG_HAS_NOMINEE"
## [39] "FLG_HAS_OLD_LOAN" "random"
## [41] "GENDERF" "GENDERM"
## [43] "GENDERO" "OCCUPATIONPROF"
## [45] "OCCUPATIONSAL" "OCCUPATIONSELF.EMP"
## [47] "OCCUPATIONSENP" "AGE_BKT.25"
## [49] "AGE_BKT.50" "AGE_BKT26.30"
## [51] "AGE_BKT31.35" "AGE_BKT36.40"
## [53] "AGE_BKT41.45" "AGE_BKT46.50"
## [55] "ACC_TYPECA" "ACC_TYPESA"
dim(NNInput)
## [1] 20000 56
Creating Training and Test Datasets
library(caret)
set.seed(111)
trainIndex <- createDataPartition(TARGET,
p = .7,
list = FALSE,
times = 1)
NN.train.data <- NNInput[trainIndex,]
NN.test.data <- NNInput[-trainIndex,]
dim(NN.train.data)
## [1] 14000 56
dim(NN.test.data)
## [1] 6000 56
Scaling the Dataset and Variables
#names(NN.train.data)
x <- subset(NN.train.data,
select = c("AGE",
"BALANCE",
"SCR",
"HOLDING_PERIOD",
"LEN_OF_RLTN_IN_MNTH",
"NO_OF_L_CR_TXNS",
"NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS",
"NO_OF_BR_CSH_WDL_DR_TXNS",
"NO_OF_ATM_DR_TXNS",
"NO_OF_NET_DR_TXNS",
"NO_OF_MOB_DR_TXNS",
"NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC",
"AMT_ATM_DR",
"AMT_BR_CSH_WDL_DR",
"AMT_CHQ_DR",
"AMT_NET_DR",
"AMT_MOB_DR",
"AMT_L_DR",
"FLG_HAS_ANY_CHGS",
"AMT_OTH_BK_ATM_USG_CHGS",
"AMT_MIN_BAL_NMC_CHGS",
"NO_OF_IW_CHQ_BNC_TXNS",
"NO_OF_OW_CHQ_BNC_TXNS",
"AVG_AMT_PER_ATM_TXN",
"AVG_AMT_PER_CSH_WDL_TXN",
"AVG_AMT_PER_CHQ_TXN",
"AVG_AMT_PER_NET_TXN",
"AVG_AMT_PER_MOB_TXN",
"FLG_HAS_NOMINEE",
"FLG_HAS_OLD_LOAN",
"random",
"GENDERF",
"GENDERM",
"GENDERO",
"OCCUPATIONPROF",
"OCCUPATIONSAL",
"OCCUPATIONSELF.EMP",
"OCCUPATIONSENP",
"AGE_BKT.25",
"AGE_BKT.50",
"AGE_BKT26.30",
"AGE_BKT31.35",
"AGE_BKT36.40",
"AGE_BKT41.45",
"AGE_BKT46.50",
"ACC_TYPECA",
"ACC_TYPESA" )
)
nn.devscaled <- scale(x)
nn.devscaled <- cbind(NN.train.data[2], nn.devscaled)
#names(nn.devscaled)
Building the NN Model
nn2 <- neuralnet(formula = TARGET ~
AGE +
BALANCE +
SCR +
HOLDING_PERIOD +
LEN_OF_RLTN_IN_MNTH +
NO_OF_L_CR_TXNS +
NO_OF_L_DR_TXNS +
TOT_NO_OF_L_TXNS +
NO_OF_BR_CSH_WDL_DR_TXNS +
NO_OF_ATM_DR_TXNS +
NO_OF_NET_DR_TXNS +
NO_OF_MOB_DR_TXNS +
NO_OF_CHQ_DR_TXNS +
FLG_HAS_CC +
AMT_ATM_DR +
AMT_BR_CSH_WDL_DR +
AMT_CHQ_DR +
AMT_NET_DR +
AMT_MOB_DR +
AMT_L_DR +
FLG_HAS_ANY_CHGS +
AMT_OTH_BK_ATM_USG_CHGS +
AMT_MIN_BAL_NMC_CHGS +
NO_OF_IW_CHQ_BNC_TXNS +
NO_OF_OW_CHQ_BNC_TXNS +
AVG_AMT_PER_ATM_TXN +
AVG_AMT_PER_CSH_WDL_TXN +
AVG_AMT_PER_CHQ_TXN +
AVG_AMT_PER_NET_TXN +
AVG_AMT_PER_MOB_TXN +
FLG_HAS_NOMINEE +
FLG_HAS_OLD_LOAN +
random +
GENDERF +
GENDERM +
GENDERO +
OCCUPATIONPROF +
OCCUPATIONSAL +
OCCUPATIONSELF.EMP +
OCCUPATIONSENP +
AGE_BKT.25 +
AGE_BKT.50 +
AGE_BKT26.30 +
AGE_BKT31.35 +
AGE_BKT36.40 +
AGE_BKT41.45 +
AGE_BKT46.50 +
ACC_TYPECA +
ACC_TYPESA ,
data = nn.devscaled,
hidden = 3,
err.fct = "sse",
linear.output = FALSE,
lifesign = "full",
lifesign.step = 10,
threshold = 0.1,
stepmax = 2000
)
## hidden: 3 thresh: 0.1 rep: 1/1 steps: 10 min thresh: 3.379522801
## 20 min thresh: 1.891409102
## 30 min thresh: 0.862865376
## 40 min thresh: 0.862865376
## 50 min thresh: 0.862865376
## 60 min thresh: 0.862865376
## 70 min thresh: 0.862865376
## 80 min thresh: 0.862865376
## 90 min thresh: 0.862865376
## 100 min thresh: 0.862865376
## 110 min thresh: 0.862865376
## 120 min thresh: 0.608434037
## 130 min thresh: 0.4805965769
## 140 min thresh: 0.4805965769
## 150 min thresh: 0.3968980988
## 160 min thresh: 0.3968980988
## 170 min thresh: 0.3968980988
## 180 min thresh: 0.3797986902
## 190 min thresh: 0.3797986902
## 200 min thresh: 0.3797986902
## 210 min thresh: 0.3797986902
## 220 min thresh: 0.3797986902
## 230 min thresh: 0.3684127934
## 240 min thresh: 0.3684127934
## 250 min thresh: 0.3684127934
## 260 min thresh: 0.3060293326
## 270 min thresh: 0.3060293326
## 280 min thresh: 0.2887896992
## 290 min thresh: 0.2887896992
## 300 min thresh: 0.2887896992
## 310 min thresh: 0.2887896992
## 320 min thresh: 0.2887896992
## 330 min thresh: 0.2887896992
## 340 min thresh: 0.2887896992
## 350 min thresh: 0.2887896992
## 360 min thresh: 0.2825161596
## 370 min thresh: 0.2825161596
## 380 min thresh: 0.2465968798
## 390 min thresh: 0.2207096009
## 400 min thresh: 0.2207096009
## 410 min thresh: 0.2207096009
## 420 min thresh: 0.2207096009
## 430 min thresh: 0.2207096009
## 440 min thresh: 0.2207096009
## 450 min thresh: 0.1852351401
## 460 min thresh: 0.1852351401
## 470 min thresh: 0.1852351401
## 480 min thresh: 0.1852351401
## 490 min thresh: 0.1852351401
## 500 min thresh: 0.1852351401
## 510 min thresh: 0.1852351401
## 520 min thresh: 0.1532250622
## 530 min thresh: 0.1532250622
## 540 min thresh: 0.1532250622
## 550 min thresh: 0.1532250622
## 560 min thresh: 0.1532250622
## 570 min thresh: 0.1532250622
## 580 min thresh: 0.1532250622
## 590 min thresh: 0.1532250622
## 600 min thresh: 0.1532250622
## 610 min thresh: 0.1532250622
## 620 min thresh: 0.1532250622
## 630 min thresh: 0.1532250622
## 640 min thresh: 0.1532250622
## 650 min thresh: 0.1532250622
## 660 min thresh: 0.1532250622
## 670 min thresh: 0.1532250622
## 680 min thresh: 0.1532250622
## 690 min thresh: 0.1532250622
## 700 min thresh: 0.1532250622
## 710 min thresh: 0.1532250622
## 720 min thresh: 0.1532250622
## 730 min thresh: 0.1532250622
## 740 min thresh: 0.1532250622
## 750 min thresh: 0.1532250622
## 760 min thresh: 0.1532250622
## 770 min thresh: 0.1532250622
## 780 min thresh: 0.1532250622
## 790 min thresh: 0.1532250622
## 800 min thresh: 0.1532250622
## 810 min thresh: 0.135546561
## 820 min thresh: 0.135546561
## 830 min thresh: 0.135546561
## 840 min thresh: 0.135546561
## 850 min thresh: 0.135546561
## 860 min thresh: 0.135546561
## 870 min thresh: 0.135546561
## 880 min thresh: 0.1333680851
## 890 min thresh: 0.1333680851
## 900 min thresh: 0.1201314264
## 910 min thresh: 0.1201314264
## 920 min thresh: 0.1179686224
## 930 min thresh: 0.1179686224
## 940 min thresh: 0.1179686224
## 950 min thresh: 0.1179686224
## 960 min thresh: 0.1179686224
## 970 min thresh: 0.1018929215
## 980 min thresh: 0.1018929215
## 990 min thresh: 0.1018929215
## 1000 min thresh: 0.1018929215
## 1010 min thresh: 0.1018929215
## 1020 min thresh: 0.1018929215
## 1030 min thresh: 0.1018929215
## 1034 error: 649.5383 time: 15.71 secs
#print(nn2)
The Artificial Neural Network - Graphical Representation
plot (nn2)
Measuring Model Performance
# Prediction on Train Data
NN.train.data$Prob = nn2$net.result[[1]]
#head(NN.train.data)
Probabilities in Training Dataset
# The distribution of the estimated probabilities
quantile(NN.train.data$Prob, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)
## 0% 1% 5% 10% 25%
## 0.005340550102 0.005340550123 0.005341659118 0.005787379101 0.031114204547
## 50% 75% 90% 95% 98%
## 0.036584905250 0.180627475781 0.188810014317 0.513345635301 0.581857453306
## 99% 100%
## 0.581921404836 0.581921674484
hist(NN.train.data$Prob)
Performance Measures on Training Data Set
The following model performance measures will be calculated on the training set to gauge the goodness of the model:
> Rank Ordering > KS > Area Under Curve (AUC) > Gini Coefficient > Classification Error
Deciling code
decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
# deciling
NN.train.data$deciles <- decile(NN.train.data$Prob)
#
# Ranking code
##install.packages("data.table")
library(data.table)
library(scales)
tmp_DT = data.table(NN.train.data)
rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET==1),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
rank$rrate <- round (rank$cnt_resp / rank$cnt,2);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),2);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp)
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
rank
## deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1: 10 1400 549 851 39% 549 851
## 2: 9 1400 251 1149 18% 800 2000
## 3: 8 1400 251 1149 18% 1051 3149
## 4: 7 1400 228 1172 16% 1279 4321
## 5: 6 1400 62 1338 4% 1341 5659
## 6: 5 1400 74 1326 5% 1415 6985
## 7: 4 1400 72 1328 5% 1487 8313
## 8: 3 1400 96 1304 7% 1583 9617
## 9: 2 1400 85 1315 6% 1668 10932
## 10: 1 1400 60 1340 4% 1728 12272
## cum_rel_resp cum_rel_non_resp ks
## 1: 32% 7% 0.25
## 2: 46% 16% 0.30
## 3: 61% 26% 0.35
## 4: 74% 35% 0.39
## 5: 78% 46% 0.32
## 6: 82% 57% 0.25
## 7: 86% 68% 0.18
## 8: 92% 78% 0.14
## 9: 97% 89% 0.08
## 10: 100% 100% 0.00
Interpretation:
The baseline Response Rate is 12.34%, whereas the response rate in top three deciles is 39%, 18% and 18%respectively.
With top 4 deciles, the KS is 39%, which is close to a good fitness model indicator.
Assgining 0 / 1 class based on certain threshold
NN.train.data$Class = ifelse(NN.train.data$Prob>0.5,1,0)
with( NN.train.data, table(TARGET, as.factor(Class) ))
##
## TARGET 0 1
## 0 11981 291
## 1 1296 432
Error Computation
sum((NN.train.data$TARGET - NN.train.data$Prob)^2)/2
## [1] 649.5382956
Other Model Performance Measures
library(ROCR)
#names(NN.train.data)
#str(NN.train.data)
detach(package:neuralnet)
pred3 <- prediction(NN.train.data$Prob, NN.train.data$TARGET)
perf3 <- performance(pred3, "tpr", "fpr")
plot(perf3)
KS3 <- max(attr(perf3, 'y.values')[[1]]-attr(perf3, 'x.values')[[1]])
KS3
## [1] 0.3913591313
auc3 <- performance(pred3,"auc");
auc3 <- as.numeric(auc3@y.values)
auc3
## [1] 0.7290062405
library(ineq)
gini3 = ineq(NN.train.data$Prob, type="Gini")
auc3
## [1] 0.7290062405
KS3
## [1] 0.3913591313
gini3
## [1] 0.5473745079
Summary: Model Performance Measures (Training Dataset)
Measure | Value
———-|———
KS | 39.13%
AUC | 72.90%
Gini | 54.73%
Accuracy | 88.66%
CeR | 11.00%
The Gini Coefficient is the ratio of the area between the line of perfect equality and the observed Lorenz curve to the area between the line of perfect equality and the line of perfect inequality.
The higher the coefficient, the more unequal the distribution is. Gini coefficient can be straight away derived from the AUC ROC number.
Gini above 60% is a good model.
The lower the classification error rate, higher the model accuracy, resulting in a better model. The classification error rate can be reduced if there were more independent variables were present for modeling.
The model observed to perform at par expectations on majority of the model performance measures, indicating it to be a good model, with scope for improvement.
Scoring Test dataset using the Neural Net Model Object
x <- subset(NN.test.data,
select = c("AGE",
"BALANCE",
"SCR",
"HOLDING_PERIOD",
"LEN_OF_RLTN_IN_MNTH",
"NO_OF_L_CR_TXNS",
"NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS",
"NO_OF_BR_CSH_WDL_DR_TXNS",
"NO_OF_ATM_DR_TXNS",
"NO_OF_NET_DR_TXNS",
"NO_OF_MOB_DR_TXNS",
"NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC",
"AMT_ATM_DR",
"AMT_BR_CSH_WDL_DR",
"AMT_CHQ_DR",
"AMT_NET_DR",
"AMT_MOB_DR",
"AMT_L_DR",
"FLG_HAS_ANY_CHGS",
"AMT_OTH_BK_ATM_USG_CHGS",
"AMT_MIN_BAL_NMC_CHGS",
"NO_OF_IW_CHQ_BNC_TXNS",
"NO_OF_OW_CHQ_BNC_TXNS",
"AVG_AMT_PER_ATM_TXN",
"AVG_AMT_PER_CSH_WDL_TXN",
"AVG_AMT_PER_CHQ_TXN",
"AVG_AMT_PER_NET_TXN",
"AVG_AMT_PER_MOB_TXN",
"FLG_HAS_NOMINEE",
"FLG_HAS_OLD_LOAN",
"random",
"GENDERF",
"GENDERM",
"GENDERO",
"OCCUPATIONPROF",
"OCCUPATIONSAL",
"OCCUPATIONSELF.EMP",
"OCCUPATIONSENP",
"AGE_BKT.25",
"AGE_BKT.50",
"AGE_BKT26.30",
"AGE_BKT31.35",
"AGE_BKT36.40",
"AGE_BKT41.45",
"AGE_BKT46.50",
"ACC_TYPECA",
"ACC_TYPESA" )
)
x.scaled <- scale(x)
#Calling the neuralnet package to use compute function, as we have detached it in the earlier step
library(neuralnet)
#Scoring the Test Dataset
#To score we will use the compute function
compute.output <- compute(nn2, x.scaled)
NN.test.data$Predict.score <- compute.output$net.result
#head(NN.test.data)
Estimated Probabilities
# The distribution of the estimated probabilities
quantile(NN.test.data$Predict.score, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)
## 0% 1% 5% 10% 25%
## 0.005340550102 0.005340550137 0.005341313796 0.005684425855 0.031114934658
## 50% 75% 90% 95% 98%
## 0.036585067251 0.183669290563 0.188810016694 0.494376372589 0.581827903967
## 99% 100%
## 0.581920926121 0.581921674484
hist(NN.test.data$Predict.score)
Deciling
# deciling
NN.test.data$deciles <- decile(NN.test.data$Predict.score)
#
Rank ordering
library(data.table)
tmp_DT = data.table(NN.test.data)
h_rank <- tmp_DT[, list(
cnt = length(TARGET),
cnt_resp = sum(TARGET),
cnt_non_resp = sum(TARGET == 0)) ,
by=deciles][order(-deciles)]
h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2);
h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),2);
h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp / sum(h_rank$cnt_non_resp),2); h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp); # library(scales)
h_rank$rrate <- percent(h_rank$rrate)
h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
h_rank
## deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1: 10 600 226 374 38% 226 374
## 2: 9 600 118 482 20% 344 856
## 3: 8 600 103 497 17% 447 1353
## 4: 7 600 103 497 17% 550 1850
## 5: 6 600 53 547 9% 603 2397
## 6: 5 600 48 552 8% 651 2949
## 7: 4 600 42 558 7% 693 3507
## 8: 3 600 33 567 6% 726 4074
## 9: 2 600 35 565 6% 761 4639
## 10: 1 600 23 577 4% 784 5216
## cum_rel_resp cum_rel_non_resp ks
## 1: 29% 7% 0.22
## 2: 44% 16% 0.28
## 3: 57% 26% 0.31
## 4: 70% 35% 0.35
## 5: 77% 46% 0.31
## 6: 83% 57% 0.26
## 7: 88% 67% 0.21
## 8: 93% 78% 0.15
## 9: 97% 89% 0.08
## 10: 100% 100% 0.00
Interpretation:
The baseline Response Rate is 12.34%, whereas the response rate in top three deciles is 38%, 20%, 17% respectively.
With top 4 deciles, the KS is 35%, indicating scope for improvement.
Assigning the Probabilities to the Test Dataset
NN.test.data$Prob = compute.output$net.result
Assigning 0/1 class based on certain threshold
NN.test.data$Class = ifelse(NN.test.data$Prob>0.5,1,0)
with( NN.test.data, table(TARGET, as.factor(Class) ))
##
## TARGET 0 1
## 0 5065 151
## 1 639 145
## Error Computation
sum((NN.test.data$TARGET - NN.test.data$Prob)^2)/2
## [1] 308.6009414
Other Model Performance Measures
library(ROCR)
# str(NN.test.data)
detach(package:neuralnet)
pred4 <- prediction(NN.test.data$Prob, NN.test.data$TARGET)
perf4 <- performance(pred4, "tpr", "fpr")
plot(perf4)
KS4 <- max(attr(perf4, 'y.values')[[1]]-attr(perf4, 'x.values')[[1]])
KS4
## [1] 0.3545409728
auc4 <- performance(pred4,"auc");
auc4 <- as.numeric(auc4@y.values)
auc4
## [1] 0.7213823293
library(ineq)
gini4 = ineq(NN.test.data$Prob, type="Gini")
auc4
## [1] 0.7213823293
KS4
## [1] 0.3545409728
gini4
## [1] 0.5462225246
Summary: Model Performance Measures (Test Dataset)
Measure | Value
———-|———
KS | 35.45%
AUC | 72.13%
Gini | 54.62%
Accuracy | 86.83%
CeR | 13.16%
The model observed to perform “normal” on majority of the model performance measures, indicating it to be a good model, with scope for improvement.
Artificial Neural Network - Conclusion
Comparative Summary of the Neural Network Model on Training and Testing Dataset is as follows:
Measures | Train | Test | %Deviation |
---|---|---|---|
KS | 39.00% | 35.45% | 9.10% |
AUC | 73.00% | 72.13% | 1.19% |
Gini | 54.70% | 54.62% | 0.14% |
Accuracy | 88.66% | 86.83% | 2.06% |
CeR | 11.00% | 13.16% | -19.6% |
It can be observed that most of the Model Performance values for Training & Testing sets are within the maximum tolerance deviation of +/- 10%. Hence, the model is not over-fitting.
At an overall level, the model observed to perform “normal” on the various model performance measures such as KS value close to 40, Gini coefficient value > 50 etc. The good performance on the model performance measures indicates good prediction making capabilities of the developed Neural Network model.
The main objective of the project was to develop a predictive model to predict if MyBank customers will respond positively to a promotion or an offer using tools of Machine Learning. In this context, the key parameter for model evaluation was ‘Accuracy’, i.e., the proportion of the total number of predictions that were correct (i.e. % of the customers that were correctly predicted).
The predictive models was be developed using the following Machine Learning techniques:
> Classification Tree - CART
> Random Forest
> Neural Network
The snap shot of the performance of all the models on accuracy, over-fitting and other model performance measures is provided below:
CART
Measures | Train | Test | %Deviation |
---|---|---|---|
KS | 72.18% | 47.40% | 35% |
AUC | 91.46% | 79.23% | 13% |
Gini | 41.46% | 60.42% | -46% |
Accuracy | 86.9 % | 78.35% | 10% |
CeR | 13.91% | 21.65% | -56% |
Radom Forest
Measures | Train | Test | %Deviation |
---|---|---|---|
KS | 99.75% | 77.44% | 22% |
AUC | 99.99% | 95.84% | 04% |
Gini | 74.85% | 64.63% | 14% |
Accuracy | 98.76% | 94.20% | 05% |
CeR | 01.24% | 05.80% | -368% |
Artificial Neural Network
Measures | Train | Test | %Deviation |
---|---|---|---|
KS | 39.00% | 35.45% | 9.10% |
AUC | 73.00% | 72.13% | 1.19% |
Gini | 54.70% | 54.62% | 0.14% |
Accuracy | 88.66% | 86.83% | 2.06% |
CeR | 11.00% | 13.16% | -19.6% |
Interpretation:
> The CART method has given poor performance compared to Random Forest and ANN. Looking at the percentage deviation between Training and Testing Dataset, it looks like the Model is over fit.
> The Random Forest method has the best performance (best accuracy) among all the three models. The percentage deviation between Training and Testing Dataset also is reasonably under control, suggesting a robust model.
> Neural Network has given relatively secondary performance compared to Random Forest, however, better than CART. However, the percentage deviation between Training and Testing Data set is minimal among three models.
Random Forest seems to be the overall winner because of the best accuracy % and reasonable deviations.