Supervised Machine Learning - CART, Random Forest and Neural Network

Personal Loans Campaign

Objective

The data-set provides details from MyBank about a Personal Loan Campaign that was executed by the bank. 20000 customers were targeted with an offer of personal loan on 10% interest rate, out of which 2512 customers responded positively. The data needs to be used to create classification model(s) in order to predict the response of new set of customers in the future, depending on the attributes available in the data.
Classification Models using following Supervised Machine Learning Techniques:
1. Classification and Regression Tree
2. Random Forest
3. Artificial Neural Network
Once the Classification Models are built, compare the three models with each other and provide your observations / recommendations.

Libraries Required

#install.packages("caret, repos = http://cran.us.r-project.org")
#install.packages("rpart, repos = http://cran.us.r-project.org")
#install.packages("rpart.plot, repos = http://cran.us.r-project.org")
#install.packages("randomForest, repos = http://cran.us.r-project.org")

library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)

Load Data

PL_X_SELL = read.table("D:/Analytics/BACP-Dec2017/10_MachineLearning/PL_X_SELL.csv", sep = ",", header = T)
attach(PL_X_SELL)

Exploratory Data Analysis and Descriptive Statistics

# Find out Total Number of Rows and Columns 
dim(PL_X_SELL)

## [1] 20000    40

# Find out Names of the Columns (Features) 
names(PL_X_SELL)

##  [1] "CUST_ID"                  "TARGET"                  
##  [3] "AGE"                      "GENDER"                  
##  [5] "BALANCE"                  "OCCUPATION"              
##  [7] "AGE_BKT"                  "SCR"                     
##  [9] "HOLDING_PERIOD"           "ACC_TYPE"                
## [11] "ACC_OP_DATE"              "LEN_OF_RLTN_IN_MNTH"     
## [13] "NO_OF_L_CR_TXNS"          "NO_OF_L_DR_TXNS"         
## [15] "TOT_NO_OF_L_TXNS"         "NO_OF_BR_CSH_WDL_DR_TXNS"
## [17] "NO_OF_ATM_DR_TXNS"        "NO_OF_NET_DR_TXNS"       
## [19] "NO_OF_MOB_DR_TXNS"        "NO_OF_CHQ_DR_TXNS"       
## [21] "FLG_HAS_CC"               "AMT_ATM_DR"              
## [23] "AMT_BR_CSH_WDL_DR"        "AMT_CHQ_DR"              
## [25] "AMT_NET_DR"               "AMT_MOB_DR"              
## [27] "AMT_L_DR"                 "FLG_HAS_ANY_CHGS"        
## [29] "AMT_OTH_BK_ATM_USG_CHGS"  "AMT_MIN_BAL_NMC_CHGS"    
## [31] "NO_OF_IW_CHQ_BNC_TXNS"    "NO_OF_OW_CHQ_BNC_TXNS"   
## [33] "AVG_AMT_PER_ATM_TXN"      "AVG_AMT_PER_CSH_WDL_TXN" 
## [35] "AVG_AMT_PER_CHQ_TXN"      "AVG_AMT_PER_NET_TXN"     
## [37] "AVG_AMT_PER_MOB_TXN"      "FLG_HAS_NOMINEE"         
## [39] "FLG_HAS_OLD_LOAN"         "random"

# Find out Class of each Feature, along with internal structure 
str(PL_X_SELL)

## 'data.frame':    20000 obs. of  40 variables:
##  $ CUST_ID                 : Factor w/ 20000 levels "C1","C10","C100",..: 17699 16532 11027 17984 2363 11747 18115 15556 15216 12494 ...
##  $ TARGET                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AGE                     : int  27 47 40 53 36 42 30 53 42 30 ...
##  $ GENDER                  : Factor w/ 3 levels "F","M","O": 2 2 2 2 2 1 2 1 1 2 ...
##  $ BALANCE                 : num  3384 287489 18217 71720 1671623 ...
##  $ OCCUPATION              : Factor w/ 4 levels "PROF","SAL","SELF-EMP",..: 3 2 3 2 1 1 1 2 3 1 ...
##  $ AGE_BKT                 : Factor w/ 7 levels "<25",">50","26-30",..: 3 7 5 2 5 6 3 2 6 3 ...
##  $ SCR                     : int  776 324 603 196 167 493 479 562 105 170 ...
##  $ HOLDING_PERIOD          : int  30 28 2 13 24 26 14 25 15 13 ...
##  $ ACC_TYPE                : Factor w/ 2 levels "CA","SA": 2 2 2 1 2 2 2 1 2 2 ...
##  $ ACC_OP_DATE             : Factor w/ 4869 levels "01-01-00","01-01-01",..: 3270 1806 3575 993 2861 862 4533 3160 257 334 ...
##  $ LEN_OF_RLTN_IN_MNTH     : int  146 104 61 107 185 192 177 99 88 111 ...
##  $ NO_OF_L_CR_TXNS         : int  7 8 10 36 20 5 6 14 18 14 ...
##  $ NO_OF_L_DR_TXNS         : int  3 2 5 14 1 2 6 3 14 8 ...
##  $ TOT_NO_OF_L_TXNS        : int  10 10 15 50 21 7 12 17 32 22 ...
##  $ NO_OF_BR_CSH_WDL_DR_TXNS: int  0 0 1 4 1 1 0 3 6 3 ...
##  $ NO_OF_ATM_DR_TXNS       : int  1 1 1 2 0 1 1 0 2 1 ...
##  $ NO_OF_NET_DR_TXNS       : int  2 1 1 3 0 0 1 0 4 0 ...
##  $ NO_OF_MOB_DR_TXNS       : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ NO_OF_CHQ_DR_TXNS       : int  0 0 2 4 0 0 4 0 1 4 ...
##  $ FLG_HAS_CC              : int  0 0 0 0 0 1 0 0 1 0 ...
##  $ AMT_ATM_DR              : int  13100 6600 11200 26100 0 18500 6200 0 35400 18000 ...
##  $ AMT_BR_CSH_WDL_DR       : int  0 0 561120 673590 808480 379310 0 945160 198430 869880 ...
##  $ AMT_CHQ_DR              : int  0 0 49320 60780 0 0 10580 0 51490 32610 ...
##  $ AMT_NET_DR              : num  973557 799813 997570 741506 0 ...
##  $ AMT_MOB_DR              : int  0 0 0 71388 0 0 0 0 170332 0 ...
##  $ AMT_L_DR                : num  986657 806413 1619210 1573364 808480 ...
##  $ FLG_HAS_ANY_CHGS        : int  0 1 1 0 0 0 1 0 0 0 ...
##  $ AMT_OTH_BK_ATM_USG_CHGS : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AMT_MIN_BAL_NMC_CHGS    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_OF_IW_CHQ_BNC_TXNS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_OF_OW_CHQ_BNC_TXNS   : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ AVG_AMT_PER_ATM_TXN     : num  13100 6600 11200 13050 0 ...
##  $ AVG_AMT_PER_CSH_WDL_TXN : num  0 0 561120 168398 808480 ...
##  $ AVG_AMT_PER_CHQ_TXN     : num  0 0 24660 15195 0 ...
##  $ AVG_AMT_PER_NET_TXN     : num  486779 799813 997570 247169 0 ...
##  $ AVG_AMT_PER_MOB_TXN     : num  0 0 0 71388 0 ...
##  $ FLG_HAS_NOMINEE         : int  1 1 1 1 1 1 0 1 1 0 ...
##  $ FLG_HAS_OLD_LOAN        : int  1 0 1 0 0 1 1 1 1 0 ...
##  $ random                  : num  1.14e-05 1.11e-04 1.20e-04 1.37e-04 1.74e-04 ...

# Check top 10 and bottom 10 Rows of the Dataset 
head(PL_X_SELL,5)

##   CUST_ID TARGET AGE GENDER    BALANCE OCCUPATION AGE_BKT SCR
## 1   C7927      0  27      M    3383.75   SELF-EMP   26-30 776
## 2   C6877      0  47      M  287489.04        SAL   46-50 324
## 3  C19922      0  40      M   18216.88   SELF-EMP   36-40 603
## 4   C8183      0  53      M   71720.48        SAL     >50 196
## 5  C12123      0  36      M 1671622.89       PROF   36-40 167
##   HOLDING_PERIOD ACC_TYPE ACC_OP_DATE LEN_OF_RLTN_IN_MNTH NO_OF_L_CR_TXNS
## 1             30       SA   3/23/2005                 146               7
## 2             28       SA    10-11-08                 104               8
## 3              2       SA   4/26/2012                  61              10
## 4             13       CA    07-04-08                 107              36
## 5             24       SA  12/29/2001                 185              20
##   NO_OF_L_DR_TXNS TOT_NO_OF_L_TXNS NO_OF_BR_CSH_WDL_DR_TXNS
## 1               3               10                        0
## 2               2               10                        0
## 3               5               15                        1
## 4              14               50                        4
## 5               1               21                        1
##   NO_OF_ATM_DR_TXNS NO_OF_NET_DR_TXNS NO_OF_MOB_DR_TXNS NO_OF_CHQ_DR_TXNS
## 1                 1                 2                 0                 0
## 2                 1                 1                 0                 0
## 3                 1                 1                 0                 2
## 4                 2                 3                 1                 4
## 5                 0                 0                 0                 0
##   FLG_HAS_CC AMT_ATM_DR AMT_BR_CSH_WDL_DR AMT_CHQ_DR AMT_NET_DR AMT_MOB_DR
## 1          0      13100                 0          0     973557          0
## 2          0       6600                 0          0     799813          0
## 3          0      11200            561120      49320     997570          0
## 4          0      26100            673590      60780     741506      71388
## 5          0          0            808480          0          0          0
##   AMT_L_DR FLG_HAS_ANY_CHGS AMT_OTH_BK_ATM_USG_CHGS AMT_MIN_BAL_NMC_CHGS
## 1   986657                0                       0                    0
## 2   806413                1                       0                    0
## 3  1619210                1                       0                    0
## 4  1573364                0                       0                    0
## 5   808480                0                       0                    0
##   NO_OF_IW_CHQ_BNC_TXNS NO_OF_OW_CHQ_BNC_TXNS AVG_AMT_PER_ATM_TXN
## 1                     0                     0               13100
## 2                     0                     0                6600
## 3                     0                     1               11200
## 4                     0                     0               13050
## 5                     0                     0                   0
##   AVG_AMT_PER_CSH_WDL_TXN AVG_AMT_PER_CHQ_TXN AVG_AMT_PER_NET_TXN
## 1                     0.0                   0            486778.5
## 2                     0.0                   0            799813.0
## 3                561120.0               24660            997570.0
## 4                168397.5               15195            247168.7
## 5                808480.0                   0                 0.0
##   AVG_AMT_PER_MOB_TXN FLG_HAS_NOMINEE FLG_HAS_OLD_LOAN      random
## 1                   0               1                1 0.000011400
## 2                   0               1                0 0.000111373
## 3                   0               1                1 0.000119954
## 4               71388               1                0 0.000136825
## 5                   0               1                0 0.000173976

#tail(PL_X_SELL,5) 

# Check for Missing Values 
colSums(is.na(PL_X_SELL))

##                  CUST_ID                   TARGET                      AGE 
##                        0                        0                        0 
##                   GENDER                  BALANCE               OCCUPATION 
##                        0                        0                        0 
##                  AGE_BKT                      SCR           HOLDING_PERIOD 
##                        0                        0                        0 
##                 ACC_TYPE              ACC_OP_DATE      LEN_OF_RLTN_IN_MNTH 
##                        0                        0                        0 
##          NO_OF_L_CR_TXNS          NO_OF_L_DR_TXNS         TOT_NO_OF_L_TXNS 
##                        0                        0                        0 
## NO_OF_BR_CSH_WDL_DR_TXNS        NO_OF_ATM_DR_TXNS        NO_OF_NET_DR_TXNS 
##                        0                        0                        0 
##        NO_OF_MOB_DR_TXNS        NO_OF_CHQ_DR_TXNS               FLG_HAS_CC 
##                        0                        0                        0 
##               AMT_ATM_DR        AMT_BR_CSH_WDL_DR               AMT_CHQ_DR 
##                        0                        0                        0 
##               AMT_NET_DR               AMT_MOB_DR                 AMT_L_DR 
##                        0                        0                        0 
##         FLG_HAS_ANY_CHGS  AMT_OTH_BK_ATM_USG_CHGS     AMT_MIN_BAL_NMC_CHGS 
##                        0                        0                        0 
##    NO_OF_IW_CHQ_BNC_TXNS    NO_OF_OW_CHQ_BNC_TXNS      AVG_AMT_PER_ATM_TXN 
##                        0                        0                        0 
##  AVG_AMT_PER_CSH_WDL_TXN      AVG_AMT_PER_CHQ_TXN      AVG_AMT_PER_NET_TXN 
##                        0                        0                        0 
##      AVG_AMT_PER_MOB_TXN          FLG_HAS_NOMINEE         FLG_HAS_OLD_LOAN 
##                        0                        0                        0 
##                   random 
##                        0

table(TARGET)

## TARGET
##     0     1 
## 17488  2512

prop.table(table(TARGET))

## TARGET
##      0      1 
## 0.8744 0.1256

Key observations:

Number of Rows and Columns:
> The number of rows in the dataset is 20,000
> The number of columns (Features) in the dataset is 40

Missing values check:
No Missing values present in the dataset.

Proportion of Responders Vs Non Responders:
> Total Responder Records: 2512 (12.56%)
> Total Non-Responder Records: 17,488 (87.44%)

Inference:
Non Responder class is highly dominating. We might need to Balance out the dataset to achieve better classification results.

# install.packages("plotrix, repos  = http://cran.us.r-project.org"" )
library(plotrix)

Data Visualization

pie3D(prop.table((table(PL_X_SELL$TARGET))), 
      main='Resp Vs Non Resp in Input Data set', 
      #explode=0.1, 
      labels=c("Non Resp", "Resp"), 
      col = c("Turquoise", "Medium Sea Green") 
      )

pie3D(prop.table((table(PL_X_SELL$GENDER))), 
      main='Gender in Input Data set', 
      #explode=0.1, 
      labels=c("F", "M","O"), 
      col = c("Turquoise", "Medium Sea Green") 
      )

Summary statistics

summary(PL_X_SELL)

##     CUST_ID          TARGET            AGE        GENDER   
##  C1     :    1   Min.   :0.0000   Min.   :21.00   F: 5433  
##  C10    :    1   1st Qu.:0.0000   1st Qu.:30.00   M:14376  
##  C100   :    1   Median :0.0000   Median :38.00   O:  191  
##  C1000  :    1   Mean   :0.1256   Mean   :38.42            
##  C10000 :    1   3rd Qu.:0.0000   3rd Qu.:46.00            
##  C10001 :    1   Max.   :1.0000   Max.   :55.00            
##  (Other):19994                                             
##     BALANCE           OCCUPATION    AGE_BKT          SCR       
##  Min.   :      0   PROF    :5417   <25  :1753   Min.   :100.0  
##  1st Qu.:  64754   SAL     :5855   >50  :3035   1st Qu.:227.0  
##  Median : 231676   SELF-EMP:3568   26-30:3434   Median :364.0  
##  Mean   : 511362   SENP    :5160   31-35:3404   Mean   :440.2  
##  3rd Qu.: 653877                   36-40:2814   3rd Qu.:644.0  
##  Max.   :8360431                   41-45:3067   Max.   :999.0  
##                                    46-50:2493                  
##  HOLDING_PERIOD  ACC_TYPE       ACC_OP_DATE    LEN_OF_RLTN_IN_MNTH
##  Min.   : 1.00   CA: 4241   11/16/2010:   24   Min.   : 29.0      
##  1st Qu.: 7.00   SA:15759   04-03-09  :   23   1st Qu.: 79.0      
##  Median :15.00              7/25/2010 :   22   Median :125.0      
##  Mean   :14.96              05-06-13  :   21   Mean   :125.2      
##  3rd Qu.:22.00              02-07-07  :   20   3rd Qu.:172.0      
##  Max.   :31.00              8/24/2010 :   20   Max.   :221.0      
##                             (Other)   :19870                      
##  NO_OF_L_CR_TXNS NO_OF_L_DR_TXNS  TOT_NO_OF_L_TXNS
##  Min.   : 0.00   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 6.00   1st Qu.: 2.000   1st Qu.:  9.00  
##  Median :10.00   Median : 5.000   Median : 14.00  
##  Mean   :12.35   Mean   : 6.634   Mean   : 18.98  
##  3rd Qu.:14.00   3rd Qu.: 7.000   3rd Qu.: 21.00  
##  Max.   :75.00   Max.   :74.000   Max.   :149.00  
##                                                   
##  NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_ATM_DR_TXNS NO_OF_NET_DR_TXNS
##  Min.   : 0.000           Min.   : 0.000    Min.   : 0.000   
##  1st Qu.: 1.000           1st Qu.: 0.000    1st Qu.: 0.000   
##  Median : 1.000           Median : 1.000    Median : 0.000   
##  Mean   : 1.883           Mean   : 1.029    Mean   : 1.172   
##  3rd Qu.: 2.000           3rd Qu.: 1.000    3rd Qu.: 1.000   
##  Max.   :15.000           Max.   :25.000    Max.   :22.000   
##                                                              
##  NO_OF_MOB_DR_TXNS NO_OF_CHQ_DR_TXNS   FLG_HAS_CC       AMT_ATM_DR    
##  Min.   : 0.0000   Min.   : 0.000    Min.   :0.0000   Min.   :     0  
##  1st Qu.: 0.0000   1st Qu.: 0.000    1st Qu.:0.0000   1st Qu.:     0  
##  Median : 0.0000   Median : 2.000    Median :0.0000   Median :  6900  
##  Mean   : 0.4118   Mean   : 2.138    Mean   :0.3054   Mean   : 10990  
##  3rd Qu.: 0.0000   3rd Qu.: 4.000    3rd Qu.:1.0000   3rd Qu.: 15800  
##  Max.   :25.0000   Max.   :15.000    Max.   :1.0000   Max.   :199300  
##                                                                       
##  AMT_BR_CSH_WDL_DR   AMT_CHQ_DR        AMT_NET_DR       AMT_MOB_DR    
##  Min.   :     0    Min.   :      0   Min.   :     0   Min.   :     0  
##  1st Qu.:  2990    1st Qu.:      0   1st Qu.:     0   1st Qu.:     0  
##  Median :340150    Median :  23840   Median :     0   Median :     0  
##  Mean   :378475    Mean   : 124520   Mean   :237308   Mean   : 22425  
##  3rd Qu.:674675    3rd Qu.:  72470   3rd Qu.:473971   3rd Qu.:     0  
##  Max.   :999930    Max.   :4928640   Max.   :999854   Max.   :199667  
##                                                                       
##     AMT_L_DR       FLG_HAS_ANY_CHGS AMT_OTH_BK_ATM_USG_CHGS
##  Min.   :      0   Min.   :0.0000   Min.   :  0.000        
##  1st Qu.: 237936   1st Qu.:0.0000   1st Qu.:  0.000        
##  Median : 695115   Median :0.0000   Median :  0.000        
##  Mean   : 773717   Mean   :0.1106   Mean   :  1.099        
##  3rd Qu.:1078927   3rd Qu.:0.0000   3rd Qu.:  0.000        
##  Max.   :6514921   Max.   :1.0000   Max.   :250.000        
##                                                            
##  AMT_MIN_BAL_NMC_CHGS NO_OF_IW_CHQ_BNC_TXNS NO_OF_OW_CHQ_BNC_TXNS
##  Min.   :  0.000      Min.   :0.00000       Min.   :0.0000       
##  1st Qu.:  0.000      1st Qu.:0.00000       1st Qu.:0.0000       
##  Median :  0.000      Median :0.00000       Median :0.0000       
##  Mean   :  1.292      Mean   :0.04275       Mean   :0.0444       
##  3rd Qu.:  0.000      3rd Qu.:0.00000       3rd Qu.:0.0000       
##  Max.   :170.000      Max.   :2.00000       Max.   :2.0000       
##                                                                  
##  AVG_AMT_PER_ATM_TXN AVG_AMT_PER_CSH_WDL_TXN AVG_AMT_PER_CHQ_TXN
##  Min.   :    0       Min.   :     0          Min.   :     0     
##  1st Qu.:    0       1st Qu.:  1266          1st Qu.:     0     
##  Median : 6000       Median :147095          Median :  8645     
##  Mean   : 7409       Mean   :242237          Mean   : 25093     
##  3rd Qu.:13500       3rd Qu.:385000          3rd Qu.: 28605     
##  Max.   :25000       Max.   :999640          Max.   :537842     
##                                                                 
##  AVG_AMT_PER_NET_TXN AVG_AMT_PER_MOB_TXN FLG_HAS_NOMINEE  FLG_HAS_OLD_LOAN
##  Min.   :     0      Min.   :     0      Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:     0      1st Qu.:     0      1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :     0      Median :     0      Median :1.0000   Median :0.0000  
##  Mean   :179059      Mean   : 20304      Mean   :0.9012   Mean   :0.4929  
##  3rd Qu.:257699      3rd Qu.:     0      3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :999854      Max.   :199667      Max.   :1.0000   Max.   :1.0000  
##                                                                           
##      random         
##  Min.   :0.0000114  
##  1st Qu.:0.2481866  
##  Median :0.5061214  
##  Mean   :0.5019330  
##  3rd Qu.:0.7535712  
##  Max.   :0.9999471  
##

Data Partition

Creating Training and Testing Dataset
The given data set is divided into Training and Testing data set, with 70:30 proportion.
The distribution of Responder and Non Responder Class is verified in both the data sets, and ensured it’s close to equal.

set.seed(111)
trainIndex <- createDataPartition(TARGET,
                                  p=0.7,
                                  list = FALSE,
                                  times = 1)
train.data <- PL_X_SELL[trainIndex, ]
test.data  <- PL_X_SELL[-trainIndex,]

dim(train.data)

## [1] 14000    40

dim(test.data)

## [1] 6000   40

Check if distribution of partition data is correct

table(train.data$TARGET)

## 
##     0     1 
## 12272  1728

table(test.data$TARGET)

## 
##    0    1 
## 5216  784

prop.table((table(train.data$TARGET)))

## 
##         0         1 
## 0.8765714 0.1234286

prop.table((table(test.data$TARGET)))

## 
##         0         1 
## 0.8693333 0.1306667

par(mfrow=c(1,2)) 

pie3D(prop.table((table(train.data$TARGET))), 
      main='Resp Vs Non Resp in Training Data set', 
      #explode=0.1, 
      labels=c("Non Resp", "Resp"), 
      col = c("Turquoise", "Medium Sea Green") 
      ) 

pie3D(prop.table((table(test.data$TARGET))), 
      main='Resp Vs Non Resp in Testing Data set', 
      #explode=0.1, 
      labels=c("Non Resp", "Resp"), 
      col = c("Aquamarine", "Dark Sea Green") 
      )

Model Building - CART

Decision Trees are commonly used in data mining with the objective of creating a model that predicts the value of a target (or dependent variable) based on the values of several input (or independent variables).

As an Umbrella term, the Classification and Regression Tree refers to the following decision trees:
> Classification Trees: where the target variable is categorical and the tree is used to identify the “class” within which a target variable would likely fall into.
> Regression Trees: where the target variable is continuous and tree is used to predict its value.

The CART algorithm is structured as a sequence of questions, the answers to which determine what the next question, if any should be. The result of these questions is a tree like structure where the ends are terminal nodes at which point there are no more questions.

Model Building - CART (Unbalanced Dataset)
Setting the control parameter inputs for rpart

r.ctrl <- rpart.control(minsplit = 100,
                        minbucket = 10,
                        cp = 0,
                        xval = 10
                        )

Build the model on Training Dataset (Unbalanced)

#Exclude columns - "Customer ID" and "Acct Opening Date"
cart.train <- train.data
names(cart.train)

##  [1] "CUST_ID"                  "TARGET"                  
##  [3] "AGE"                      "GENDER"                  
##  [5] "BALANCE"                  "OCCUPATION"              
##  [7] "AGE_BKT"                  "SCR"                     
##  [9] "HOLDING_PERIOD"           "ACC_TYPE"                
## [11] "ACC_OP_DATE"              "LEN_OF_RLTN_IN_MNTH"     
## [13] "NO_OF_L_CR_TXNS"          "NO_OF_L_DR_TXNS"         
## [15] "TOT_NO_OF_L_TXNS"         "NO_OF_BR_CSH_WDL_DR_TXNS"
## [17] "NO_OF_ATM_DR_TXNS"        "NO_OF_NET_DR_TXNS"       
## [19] "NO_OF_MOB_DR_TXNS"        "NO_OF_CHQ_DR_TXNS"       
## [21] "FLG_HAS_CC"               "AMT_ATM_DR"              
## [23] "AMT_BR_CSH_WDL_DR"        "AMT_CHQ_DR"              
## [25] "AMT_NET_DR"               "AMT_MOB_DR"              
## [27] "AMT_L_DR"                 "FLG_HAS_ANY_CHGS"        
## [29] "AMT_OTH_BK_ATM_USG_CHGS"  "AMT_MIN_BAL_NMC_CHGS"    
## [31] "NO_OF_IW_CHQ_BNC_TXNS"    "NO_OF_OW_CHQ_BNC_TXNS"   
## [33] "AVG_AMT_PER_ATM_TXN"      "AVG_AMT_PER_CSH_WDL_TXN" 
## [35] "AVG_AMT_PER_CHQ_TXN"      "AVG_AMT_PER_NET_TXN"     
## [37] "AVG_AMT_PER_MOB_TXN"      "FLG_HAS_NOMINEE"         
## [39] "FLG_HAS_OLD_LOAN"         "random"

m1 <- rpart(formula = TARGET~.,
            data = cart.train[,-c(1,11)],
            method = "class",
            control = r.ctrl
            )
#m1

#install.packages("rattle") 
#install.packages("RColorBrewer") 
library(rattle) 
library(RColorBrewer) 

fancyRpartPlot(m1)

printcp(m1)

## 
## Classification tree:
## rpart(formula = TARGET ~ ., data = cart.train[, -c(1, 11)], method = "class", 
##     control = r.ctrl)
## 
## Variables actually used in tree construction:
##  [1] ACC_TYPE                 AGE_BKT                 
##  [3] AMT_ATM_DR               AMT_BR_CSH_WDL_DR       
##  [5] AMT_L_DR                 AMT_NET_DR              
##  [7] AVG_AMT_PER_ATM_TXN      AVG_AMT_PER_CSH_WDL_TXN 
##  [9] AVG_AMT_PER_MOB_TXN      AVG_AMT_PER_NET_TXN     
## [11] BALANCE                  FLG_HAS_CC              
## [13] GENDER                   HOLDING_PERIOD          
## [15] LEN_OF_RLTN_IN_MNTH      NO_OF_ATM_DR_TXNS       
## [17] NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_L_CR_TXNS         
## [19] NO_OF_L_DR_TXNS          NO_OF_OW_CHQ_BNC_TXNS   
## [21] OCCUPATION               SCR                     
## 
## Root node error: 1728/14000 = 0.12343
## 
## n= 14000 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.00593171      0   1.00000 1.00000 0.022523
## 2  0.00405093      4   0.97627 0.99942 0.022517
## 3  0.00173611     10   0.95139 0.98669 0.022393
## 4  0.00166377     12   0.94792 0.99884 0.022512
## 5  0.00154321     26   0.91725 0.99826 0.022506
## 6  0.00135031     30   0.91088 0.99884 0.022512
## 7  0.00101273     41   0.89294 1.00347 0.022556
## 8  0.00096451     45   0.88889 1.00926 0.022612
## 9  0.00086806     48   0.88600 1.00926 0.022612
## 10 0.00057870     50   0.88426 1.01215 0.022640
## 11 0.00000000     55   0.88021 1.02546 0.022767

plotcp(m1)

Pruning the Tree

ptree<- prune(m1, cp= 0.0017 ,"CP") 
printcp(ptree)

## 
## Classification tree:
## rpart(formula = TARGET ~ ., data = cart.train[, -c(1, 11)], method = "class", 
##     control = r.ctrl)
## 
## Variables actually used in tree construction:
## [1] AGE_BKT         AMT_ATM_DR      BALANCE         HOLDING_PERIOD 
## [5] NO_OF_L_CR_TXNS NO_OF_L_DR_TXNS OCCUPATION      SCR            
## 
## Root node error: 1728/14000 = 0.12343
## 
## n= 14000 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.0059317      0   1.00000 1.00000 0.022523
## 2 0.0040509      4   0.97627 0.99942 0.022517
## 3 0.0017361     10   0.95139 0.98669 0.022393
## 4 0.0017000     12   0.94792 0.99884 0.022512

Ploting the final CART model (Unbalanced Dataset)

fancyRpartPlot(ptree, 
               uniform = TRUE, 
               main = "Final Tree", 
               palettes = c("Blues", "Oranges")
               )

Performance Measures on Training Data Set (Unbalanced)

The following model performance measures will be calculated on the training set to gauge the goodness of the model:
> Rank Ordering
> KS
> Area Under Curve (AUC)
> Gini Coefficient
> Classification Error

Predict Training Data Set

cart.train$predict.class = predict(ptree, cart.train, type = "class")
cart.train$predict.score = predict(ptree, cart.train, type = "prob")
#head(cart.train)

Deciling Code

decile <- function(x)
  { 
  deciles <- vector(length=10) 
  for (i in seq(0.1,1,.1))
    { 
    deciles[i*10] <- quantile(x, i, na.rm=T)   
    }   
  return ( 
    ifelse(x<deciles[1], 1, 
           ifelse(x<deciles[2], 2, 
                  ifelse(x<deciles[3], 3, 
                         ifelse(x<deciles[4], 4, 
                                ifelse(x<deciles[5], 5,
                                       ifelse(x<deciles[6], 6,
                                              ifelse(x<deciles[7], 7,
                                                     ifelse(x<deciles[8], 8,
                                                            ifelse(x<deciles[9], 9, 10
                                                                   )))))))))) 
  }
#class(cart.train$predict.score) 
## deciling 
cart.train$deciles <- decile(cart.train$predict.score[,2]) 
#head(cart.train)

Ranking Code

#install.packages("data.table") 
#install.packages("scales") 
library(data.table) 
library(scales)

tmp_DT = data.table(cart.train)

rank <- tmp_DT[, list(cnt=length(TARGET),
                      cnt_resp=sum(TARGET==1),
                      cnt_non_resp=sum(TARGET==0)
                      ), by=deciles][order(-deciles)]

rank$rrate <- round(rank$cnt_resp / rank$cnt,4); 
rank$cum_resp <- cumsum(rank$cnt_resp) 
rank$cum_non_resp <- cumsum(rank$cnt_non_resp) 
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4); 
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4); 
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100; 
rank$rrate <- percent(rank$rrate) 
rank$cum_rel_resp <- percent(rank$cum_rel_resp) 
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp) 
rank

##    deciles  cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1:      10 2670      654         2016 24.5%      654         2016
## 2:       9 2911      310         2601 10.6%      964         4617
## 3:       7 8419      764         7655  9.1%     1728        12272
##    cum_rel_resp cum_rel_non_resp    ks
## 1:        37.8%            16.4% 21.42
## 2:        55.8%            37.6% 18.17
## 3:       100.0%           100.0%  0.00

Interpretation:
> The response rate in top deciles is above 24.5%.
> The KS is around 21.42%, indicating it to be not a good model.

KS and Area under Curve

#install.packages("ROCR") 
#install.packages("ineq") 
library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:plotrix':
## 
##     plotCI

## The following object is masked from 'package:stats':
## 
##     lowess

library(ineq)

pred <- prediction(cart.train$predict.score[,2], cart.train$TARGET) 
perf <- performance(pred, "tpr", "fpr") 

KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]]) 

auc <- performance(pred,"auc"); 
auc <- as.numeric(auc@y.values) 

gini = ineq(cart.train$predict.score[,2], type="Gini") 
with(cart.train, table(TARGET, predict.class))

##       predict.class
## TARGET     0     1
##      0 12226    46
##      1  1592   136

plot(perf)

KS

## [1] 0.2142056

auc

## [1] 0.624329

gini

## [1] 0.2179665

Summary: CART - Model Performance(Training Unbalanced Dataset):
The KS = 21.4% and the AUC = 62.4% which indicates that the model is not good.
The Gini Coefficient = 21.7% also indicates that the model in not good.
Confusion matrix:
1. Accuracy = (12226+136)/(12226+1592+46+136) = 88.3%
2. Classification Error Rate = 1 - Accuracy = 11.7%

Predict Test Data Set

## Scoring Holdout sample 
cart.test <- test.data
cart.test$predict.class = predict(ptree, cart.test, type = "class")
cart.test$predict.score = predict(ptree, cart.test, type = "prob")
cart.test$deciles <- decile(cart.test$predict.score[,2])
#head(cart.test)

Ranking Code - Test Data

tmp_DT = data.table(cart.test)

rank <- tmp_DT[, list(cnt=length(TARGET),
                      cnt_resp=sum(TARGET==1),
                      cnt_non_resp=sum(TARGET==0)
                      ), by=deciles][order(-deciles)]

rank$rrate <- round(rank$cnt_resp / rank$cnt,4); 
rank$cum_resp <- cumsum(rank$cnt_resp) 
rank$cum_non_resp <- cumsum(rank$cnt_non_resp) 
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4); 
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4); 
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100; 
rank$rrate <- percent(rank$rrate) 
rank$cum_rel_resp <- percent(rank$cum_rel_resp) 
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp) 
rank

##    deciles  cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1:      10 1230      324          906 26.3%      324          906
## 2:       8 1212      129         1083 10.6%      453         1989
## 3:       6 3558      331         3227  9.3%      784         5216
##    cum_rel_resp cum_rel_non_resp    ks
## 1:        41.3%            17.4% 23.96
## 2:        57.8%            38.1% 19.65
## 3:       100.0%           100.0%  0.00

Interpretation:
> The response rate in top deciles is above 26.3%.
> The KS is around 23.9%, indicating it to be not a good model.

KS and Area under Curve

pred <- prediction(cart.test$predict.score[,2], cart.test$TARGET) 
perf <- performance(pred, "tpr", "fpr") 

KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]]) 

auc <- performance(pred,"auc"); 
auc <- as.numeric(auc@y.values) 

gini = ineq(cart.test$predict.score[,2], type="Gini") 
with(cart.test, table(TARGET, predict.class))

##       predict.class
## TARGET    0    1
##      0 5193   23
##      1  738   46

plot(perf)

KS

## [1] 0.2456805

auc

## [1] 0.6350559

gini

## [1] 0.2192173

CART - Model Performance(Test Dataset):
The KS = 24.5% and the AUC = 63.5% which indicates that the model is not good.
The Gini Coefficient = 21.9% also indicates that the model in not good.
Confusion matrix:
1. Accuracy = (5193 + 46)/(5193+738+23+46) = 87.32%
2. Classification Error Rate = 1 - Accuracy = 12.68%

Balancing the training data using ROSE Algorithm

#install.packages("ROSE", repos = "http://cran.us.r-project.org")
library(ROSE)

N=12272*2 
N

## [1] 24544

cart.train.over <- ovun.sample(TARGET~.,data=cart.train, 
                               method="over",N=24544)$data 
table(cart.train.over$TARGET)

## 
##     0     1 
## 12272 12272

Model Building - CART with Oversampled Dataset

Setting the control parameter

r.ctrl = rpart.control(minsplit=100, 
                       minbucket = 20, 
                       cp = 0, 
                       xval = 10)

Building the CART Model on Oversampled Training Dataset

# Exclude Columns Customer ID and Acct Opening Date. 
names(cart.train.over)

##  [1] "CUST_ID"                  "TARGET"                  
##  [3] "AGE"                      "GENDER"                  
##  [5] "BALANCE"                  "OCCUPATION"              
##  [7] "AGE_BKT"                  "SCR"                     
##  [9] "HOLDING_PERIOD"           "ACC_TYPE"                
## [11] "ACC_OP_DATE"              "LEN_OF_RLTN_IN_MNTH"     
## [13] "NO_OF_L_CR_TXNS"          "NO_OF_L_DR_TXNS"         
## [15] "TOT_NO_OF_L_TXNS"         "NO_OF_BR_CSH_WDL_DR_TXNS"
## [17] "NO_OF_ATM_DR_TXNS"        "NO_OF_NET_DR_TXNS"       
## [19] "NO_OF_MOB_DR_TXNS"        "NO_OF_CHQ_DR_TXNS"       
## [21] "FLG_HAS_CC"               "AMT_ATM_DR"              
## [23] "AMT_BR_CSH_WDL_DR"        "AMT_CHQ_DR"              
## [25] "AMT_NET_DR"               "AMT_MOB_DR"              
## [27] "AMT_L_DR"                 "FLG_HAS_ANY_CHGS"        
## [29] "AMT_OTH_BK_ATM_USG_CHGS"  "AMT_MIN_BAL_NMC_CHGS"    
## [31] "NO_OF_IW_CHQ_BNC_TXNS"    "NO_OF_OW_CHQ_BNC_TXNS"   
## [33] "AVG_AMT_PER_ATM_TXN"      "AVG_AMT_PER_CSH_WDL_TXN" 
## [35] "AVG_AMT_PER_CHQ_TXN"      "AVG_AMT_PER_NET_TXN"     
## [37] "AVG_AMT_PER_MOB_TXN"      "FLG_HAS_NOMINEE"         
## [39] "FLG_HAS_OLD_LOAN"         "random"                  
## [41] "predict.class"            "predict.score"           
## [43] "deciles"

m1 <- rpart(formula = TARGET ~ ., 
            data = cart.train.over[,-c(1,11,41,42,43)], 
            method = "class", 
            control = r.ctrl)
#m1
printcp(m1)

## 
## Classification tree:
## rpart(formula = TARGET ~ ., data = cart.train.over[, -c(1, 11, 
##     41, 42, 43)], method = "class", control = r.ctrl)
## 
## Variables actually used in tree construction:
##  [1] ACC_TYPE                 AGE                     
##  [3] AGE_BKT                  AMT_ATM_DR              
##  [5] AMT_BR_CSH_WDL_DR        AMT_CHQ_DR              
##  [7] AMT_L_DR                 AMT_MOB_DR              
##  [9] AMT_NET_DR               AVG_AMT_PER_ATM_TXN     
## [11] AVG_AMT_PER_CHQ_TXN      AVG_AMT_PER_CSH_WDL_TXN 
## [13] AVG_AMT_PER_NET_TXN      BALANCE                 
## [15] FLG_HAS_ANY_CHGS         FLG_HAS_CC              
## [17] FLG_HAS_NOMINEE          FLG_HAS_OLD_LOAN        
## [19] GENDER                   HOLDING_PERIOD          
## [21] LEN_OF_RLTN_IN_MNTH      NO_OF_ATM_DR_TXNS       
## [23] NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_CHQ_DR_TXNS       
## [25] NO_OF_IW_CHQ_BNC_TXNS    NO_OF_L_CR_TXNS         
## [27] NO_OF_L_DR_TXNS          NO_OF_NET_DR_TXNS       
## [29] NO_OF_OW_CHQ_BNC_TXNS    OCCUPATION              
## [31] SCR                      TOT_NO_OF_L_TXNS        
## 
## Root node error: 12272/24544 = 0.5
## 
## n= 24544 
## 
##            CP nsplit rel error  xerror      xstd
## 1  1.8106e-01      0   1.00000 1.02591 0.0063809
## 2  2.6239e-02      1   0.81894 0.83572 0.0062963
## 3  1.9883e-02      2   0.79270 0.80256 0.0062574
## 4  1.3174e-02      3   0.77282 0.77705 0.0062224
## 5  1.1367e-02      6   0.73330 0.74389 0.0061701
## 6  1.0308e-02      8   0.71056 0.72849 0.0061433
## 7  8.2301e-03     10   0.68994 0.69712 0.0060832
## 8  6.4782e-03     11   0.68171 0.68970 0.0060680
## 9  6.1930e-03     13   0.66876 0.68326 0.0060544
## 10 4.5361e-03     14   0.66257 0.67031 0.0060261
## 11 4.2373e-03     17   0.64896 0.65271 0.0059857
## 12 3.5039e-03     19   0.64048 0.63722 0.0059482
## 13 3.2595e-03     24   0.61954 0.61514 0.0058914
## 14 3.2323e-03     26   0.61302 0.60495 0.0058638
## 15 3.1780e-03     29   0.60332 0.60373 0.0058605
## 16 3.0557e-03     33   0.58841 0.59884 0.0058469
## 17 3.0150e-03     37   0.57309 0.59037 0.0058229
## 18 2.9878e-03     38   0.57008 0.58336 0.0058026
## 19 2.9335e-03     41   0.56111 0.57863 0.0057887
## 20 2.8928e-03     42   0.55818 0.57668 0.0057829
## 21 2.8113e-03     45   0.54808 0.56454 0.0057461
## 22 2.7705e-03     47   0.54245 0.55590 0.0057191
## 23 2.6890e-03     49   0.53691 0.55386 0.0057126
## 24 2.6076e-03     52   0.52885 0.54319 0.0056781
## 25 2.5261e-03     53   0.52624 0.53382 0.0056470
## 26 2.4853e-03     55   0.52119 0.52893 0.0056304
## 27 2.4446e-03     62   0.49756 0.52306 0.0056103
## 28 2.3631e-03     63   0.49511 0.51263 0.0055736
## 29 2.2816e-03     73   0.46537 0.50301 0.0055389
## 30 2.2545e-03     78   0.45396 0.50196 0.0055351
## 31 2.2409e-03     82   0.44492 0.49641 0.0055146
## 32 2.2001e-03     85   0.43783 0.49641 0.0055146
## 33 2.1186e-03     88   0.42911 0.49047 0.0054923
## 34 1.9964e-03     91   0.42275 0.47906 0.0054485
## 35 1.9557e-03     93   0.41876 0.47531 0.0054338
## 36 1.9149e-03     97   0.41094 0.47376 0.0054277
## 37 1.8742e-03     99   0.40711 0.46855 0.0054070
## 38 1.7927e-03    103   0.39920 0.46545 0.0053945
## 39 1.7384e-03    104   0.39741 0.45681 0.0053593
## 40 1.7112e-03    109   0.38649 0.45331 0.0053447
## 41 1.6705e-03    114   0.37793 0.44524 0.0053108
## 42 1.6297e-03    116   0.37459 0.44043 0.0052902
## 43 1.5890e-03    119   0.36970 0.43530 0.0052679
## 44 1.5482e-03    121   0.36653 0.42699 0.0052312
## 45 1.4668e-03    122   0.36498 0.42080 0.0052033
## 46 1.4260e-03    127   0.35764 0.41387 0.0051716
## 47 1.3853e-03    129   0.35479 0.41354 0.0051701
## 48 1.2223e-03    133   0.34925 0.40238 0.0051178
## 49 1.1951e-03    139   0.34192 0.38828 0.0050495
## 50 1.1408e-03    142   0.33833 0.38095 0.0050129
## 51 1.0593e-03    145   0.33491 0.37744 0.0049952
## 52 9.7784e-04    147   0.33279 0.37036 0.0049589
## 53 8.9635e-04    151   0.32888 0.36685 0.0049407
## 54 8.5561e-04    168   0.31274 0.36294 0.0049201
## 55 8.1486e-04    175   0.30525 0.36278 0.0049193
## 56 7.3338e-04    176   0.30443 0.35927 0.0049007
## 57 6.9263e-04    178   0.30297 0.35528 0.0048793
## 58 6.5189e-04    182   0.29979 0.35292 0.0048666
## 59 5.7040e-04    186   0.29718 0.34998 0.0048506
## 60 5.2966e-04    190   0.29490 0.34958 0.0048484
## 61 4.8892e-04    192   0.29384 0.34892 0.0048448
## 62 4.4817e-04    207   0.28471 0.34868 0.0048435
## 63 4.3459e-04    209   0.28382 0.34762 0.0048377
## 64 4.0743e-04    212   0.28251 0.34754 0.0048372
## 65 3.6669e-04    217   0.28048 0.34672 0.0048327
## 66 2.4446e-04    219   0.27974 0.34306 0.0048124
## 67 2.0372e-04    220   0.27950 0.34208 0.0048070
## 68 1.9013e-04    224   0.27836 0.34249 0.0048093
## 69 1.6297e-04    227   0.27779 0.34224 0.0048079
## 70 8.1486e-05    228   0.27762 0.34192 0.0048061
## 71 5.4324e-05    230   0.27746 0.34241 0.0048088
## 72 4.0743e-05    233   0.27730 0.34241 0.0048088
## 73 0.0000e+00    235   0.27722 0.34265 0.0048102

plotcp(m1)

Pruning the Tree

#cp value from row 69
ptree<- prune(m1, cp= 0.00029 ,"CP") 
printcp(ptree)

## 
## Classification tree:
## rpart(formula = TARGET ~ ., data = cart.train.over[, -c(1, 11, 
##     41, 42, 43)], method = "class", control = r.ctrl)
## 
## Variables actually used in tree construction:
##  [1] ACC_TYPE                 AGE                     
##  [3] AGE_BKT                  AMT_ATM_DR              
##  [5] AMT_BR_CSH_WDL_DR        AMT_CHQ_DR              
##  [7] AMT_L_DR                 AMT_MOB_DR              
##  [9] AMT_NET_DR               AVG_AMT_PER_ATM_TXN     
## [11] AVG_AMT_PER_CHQ_TXN      AVG_AMT_PER_CSH_WDL_TXN 
## [13] AVG_AMT_PER_NET_TXN      BALANCE                 
## [15] FLG_HAS_ANY_CHGS         FLG_HAS_CC              
## [17] FLG_HAS_NOMINEE          FLG_HAS_OLD_LOAN        
## [19] GENDER                   HOLDING_PERIOD          
## [21] LEN_OF_RLTN_IN_MNTH      NO_OF_ATM_DR_TXNS       
## [23] NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_CHQ_DR_TXNS       
## [25] NO_OF_IW_CHQ_BNC_TXNS    NO_OF_L_CR_TXNS         
## [27] NO_OF_L_DR_TXNS          NO_OF_NET_DR_TXNS       
## [29] NO_OF_OW_CHQ_BNC_TXNS    OCCUPATION              
## [31] SCR                      TOT_NO_OF_L_TXNS        
## 
## Root node error: 12272/24544 = 0.5
## 
## n= 24544 
## 
##            CP nsplit rel error  xerror      xstd
## 1  0.18106258      0   1.00000 1.02591 0.0063809
## 2  0.02623859      1   0.81894 0.83572 0.0062963
## 3  0.01988266      2   0.79270 0.80256 0.0062574
## 4  0.01317362      3   0.77282 0.77705 0.0062224
## 5  0.01136734      6   0.73330 0.74389 0.0061701
## 6  0.01030802      8   0.71056 0.72849 0.0061433
## 7  0.00823012     10   0.68994 0.69712 0.0060832
## 8  0.00647816     11   0.68171 0.68970 0.0060680
## 9  0.00619296     13   0.66876 0.68326 0.0060544
## 10 0.00453607     14   0.66257 0.67031 0.0060261
## 11 0.00423729     17   0.64896 0.65271 0.0059857
## 12 0.00350391     19   0.64048 0.63722 0.0059482
## 13 0.00325945     24   0.61954 0.61514 0.0058914
## 14 0.00323229     26   0.61302 0.60495 0.0058638
## 15 0.00317797     29   0.60332 0.60373 0.0058605
## 16 0.00305574     33   0.58841 0.59884 0.0058469
## 17 0.00301499     37   0.57309 0.59037 0.0058229
## 18 0.00298783     38   0.57008 0.58336 0.0058026
## 19 0.00293351     41   0.56111 0.57863 0.0057887
## 20 0.00289276     42   0.55818 0.57668 0.0057829
## 21 0.00281128     45   0.54808 0.56454 0.0057461
## 22 0.00277053     47   0.54245 0.55590 0.0057191
## 23 0.00268905     49   0.53691 0.55386 0.0057126
## 24 0.00260756     52   0.52885 0.54319 0.0056781
## 25 0.00252608     53   0.52624 0.53382 0.0056470
## 26 0.00248533     55   0.52119 0.52893 0.0056304
## 27 0.00244459     62   0.49756 0.52306 0.0056103
## 28 0.00236310     63   0.49511 0.51263 0.0055736
## 29 0.00228162     73   0.46537 0.50301 0.0055389
## 30 0.00225445     78   0.45396 0.50196 0.0055351
## 31 0.00224087     82   0.44492 0.49641 0.0055146
## 32 0.00220013     85   0.43783 0.49641 0.0055146
## 33 0.00211864     88   0.42911 0.49047 0.0054923
## 34 0.00199641     91   0.42275 0.47906 0.0054485
## 35 0.00195567     93   0.41876 0.47531 0.0054338
## 36 0.00191493     97   0.41094 0.47376 0.0054277
## 37 0.00187419     99   0.40711 0.46855 0.0054070
## 38 0.00179270    103   0.39920 0.46545 0.0053945
## 39 0.00173837    104   0.39741 0.45681 0.0053593
## 40 0.00171121    109   0.38649 0.45331 0.0053447
## 41 0.00167047    114   0.37793 0.44524 0.0053108
## 42 0.00162973    116   0.37459 0.44043 0.0052902
## 43 0.00158898    119   0.36970 0.43530 0.0052679
## 44 0.00154824    121   0.36653 0.42699 0.0052312
## 45 0.00146675    122   0.36498 0.42080 0.0052033
## 46 0.00142601    127   0.35764 0.41387 0.0051716
## 47 0.00138527    129   0.35479 0.41354 0.0051701
## 48 0.00122229    133   0.34925 0.40238 0.0051178
## 49 0.00119513    139   0.34192 0.38828 0.0050495
## 50 0.00114081    142   0.33833 0.38095 0.0050129
## 51 0.00105932    145   0.33491 0.37744 0.0049952
## 52 0.00097784    147   0.33279 0.37036 0.0049589
## 53 0.00089635    151   0.32888 0.36685 0.0049407
## 54 0.00085561    168   0.31274 0.36294 0.0049201
## 55 0.00081486    175   0.30525 0.36278 0.0049193
## 56 0.00073338    176   0.30443 0.35927 0.0049007
## 57 0.00069263    178   0.30297 0.35528 0.0048793
## 58 0.00065189    182   0.29979 0.35292 0.0048666
## 59 0.00057040    186   0.29718 0.34998 0.0048506
## 60 0.00052966    190   0.29490 0.34958 0.0048484
## 61 0.00048892    192   0.29384 0.34892 0.0048448
## 62 0.00044817    207   0.28471 0.34868 0.0048435
## 63 0.00043459    209   0.28382 0.34762 0.0048377
## 64 0.00040743    212   0.28251 0.34754 0.0048372
## 65 0.00036669    217   0.28048 0.34672 0.0048327
## 66 0.00029000    219   0.27974 0.34306 0.0048124

fancyRpartPlot(ptree, 
               uniform = TRUE, 
               main = "Final Tree", 
               palettes = c("Blues", "Oranges"))

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Measure CART Model Performance (Oversampled Training Dataset)

cart.train.over$predict.class <- predict(ptree, cart.train.over, type = "class")
cart.train.over$predict.score <- predict(ptree, cart.train.over, type = "prob")
#head(cart.train.over)

Deciling Code

decile <- function(x)
  { 
  deciles <- vector(length=10) 
  for (i in seq(0.1,1,.1))
    { 
    deciles[i*10] <- quantile(x, i, na.rm=T)   
    }   
  return ( 
    ifelse(x<deciles[1], 1, 
           ifelse(x<deciles[2], 2, 
                  ifelse(x<deciles[3], 3, 
                         ifelse(x<deciles[4], 4, 
                                ifelse(x<deciles[5], 5,
                                       ifelse(x<deciles[6], 6,
                                              ifelse(x<deciles[7], 7,
                                                     ifelse(x<deciles[8], 8,
                                                            ifelse(x<deciles[9], 9, 10
                                                                   )))))))))) 
  }

#class(cart.train.over$predict.score)
## deciling 
cart.train.over$deciles <- decile(cart.train.over$predict.score[,2])
#View(cart.train.over)

Ranking Code

tmp_DT = data.table(cart.train.over)

rank <- tmp_DT[, list(cnt=length(TARGET),
                      cnt_resp=sum(TARGET==1),
                      cnt_non_resp=sum(TARGET==0)
                      ), by=deciles][order(-deciles)]

rank$rrate <- round(rank$cnt_resp / rank$cnt,4); 
rank$cum_resp <- cumsum(rank$cnt_resp) 
rank$cum_non_resp <- cumsum(rank$cnt_non_resp) 
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4); 
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4); 
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100; 
rank$rrate <- percent(rank$rrate) 
rank$cum_rel_resp <- percent(rank$cum_rel_resp) 
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp) 
rank

##     deciles  cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
##  1:      10 2463     2318          145 94.1%     2318          145
##  2:       9 2644     2384          260 90.2%     4702          405
##  3:       8 2346     2009          337 85.6%     6711          742
##  4:       7 2530     2085          445 82.4%     8796         1187
##  5:       6 2334     1750          584 75.0%    10546         1771
##  6:       5 2413      976         1437 40.4%    11522         3208
##  7:       4 2668      451         2217 16.9%    11973         5425
##  8:       3 2297      218         2079  9.5%    12191         7504
##  9:       2 2489       81         2408  3.2%    12272         9912
## 10:       1 2360        0         2360  0.0%    12272        12272
##     cum_rel_resp cum_rel_non_resp    ks
##  1:        18.9%             1.2% 17.71
##  2:        38.3%             3.3% 35.01
##  3:        54.7%             6.0% 48.64
##  4:        71.7%             9.7% 62.01
##  5:        85.9%            14.4% 71.51
##  6:        93.9%            26.1% 67.75
##  7:        97.6%            44.2% 53.35
##  8:        99.3%            61.2% 38.19
##  9:       100.0%            80.8% 19.23
## 10:       100.0%           100.0%  0.00

Interpretation fron the rank order table:
The response rate in top deciles is above 76%.
The KS is above 70%, indicating it to be a good model.

KS and Area under Curve

pred <- prediction(cart.train.over$predict.score[,2], cart.train.over$TARGET) 
perf <- performance(pred, "tpr", "fpr") 

KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]]) 

auc <- performance(pred,"auc"); 
auc <- as.numeric(auc@y.values) 

gini = ineq(cart.train.over$predict.score[,2], type="Gini") 
with(cart.train.over, table(TARGET, predict.class))

##       predict.class
## TARGET     0     1
##      0 10235  2037
##      1  1396 10876

plot(perf)

KS

## [1] 0.7202575

auc

## [1] 0.9162399

gini

## [1] 0.4162399

Summary - CART Model Performance(Oversampled Training Dataset):
The KS = 72.18% and the AUC = 91.46% which indicating it to be a good model.
The Gini Coefficient = 41.46% also indicating to be a good model, with scope of improvement.
Confusion matrix:
1. Accuracy = (10106 + 11025)/(10106 + 1247 + 2166 + 11025) = 86.9%
2. Classification Error Rate = 1 - Accuracy = 13.91%

#Syntax to get the node path
#tree.path<-path.rpart(ptree,node =c(2,12))
#nrow(cart.test)

Scoring Holdout Sample

cart.test$predict.class <- predict(ptree, cart.test, type = "class")
cart.test$predict.score <- predict(ptree, cart.test, type = "prob")

cart.test$deciles <- decile(cart.test$predict.score[,2])
#View(cart.test)

Ranking Code

tmp_DT = data.table(cart.test)

h_rank <- tmp_DT[, list(cnt=length(TARGET),
                      cnt_resp=sum(TARGET==1),
                      cnt_non_resp=sum(TARGET==0)
                      ), by=deciles][order(-deciles)]

h_rank$rrate <- round(h_rank$cnt_resp / h_rank$cnt,4); 
h_rank$cum_resp <- cumsum(h_rank$cnt_resp) 
h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp) 
h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),4); 
h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp / sum(h_rank$cnt_non_resp),4); 
h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp) * 100; 
h_rank$rrate <- percent(h_rank$rrate) 
h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp) 
h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp) 
h_rank

##    deciles  cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
## 1:      10  637      270          367 42.4%      270          367
## 2:       9  605      199          406 32.9%      469          773
## 3:       8  573      108          465 18.8%      577         1238
## 4:       7  621       66          555 10.6%      643         1793
## 5:       6  664       33          631  5.0%      676         2424
## 6:       5  519       41          478  7.9%      717         2902
## 7:       4  759       26          733  3.4%      743         3635
## 8:       3  563        9          554  1.6%      752         4189
## 9:       2 1059       32         1027  3.0%      784         5216
##    cum_rel_resp cum_rel_non_resp    ks
## 1:        34.4%             7.0% 27.40
## 2:        59.8%            14.8% 45.00
## 3:        73.6%            23.7% 49.87
## 4:        82.0%            34.4% 47.64
## 5:        86.2%            46.5% 39.75
## 6:        91.4%            55.6% 35.81
## 7:        94.8%            69.7% 25.08
## 8:        95.9%            80.3% 15.61
## 9:       100.0%           100.0%  0.00

Interpretation fron the rank order table: The response rate in top three deciles is 45%, 29% and 17% respectively.
With top 3 deciles, the KS is 46%, indicating it to be a good model.

KS and Area under Curve

pred <- prediction(cart.test$predict.score[,2], cart.test$TARGET) 
perf <- performance(pred, "tpr", "fpr") 

KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]]) 

auc <- performance(pred,"auc"); 
auc <- as.numeric(auc@y.values) 

gini = ineq(cart.test$predict.score[,2], type="Gini") 
with(cart.test, table(TARGET, predict.class))

##       predict.class
## TARGET    0    1
##      0 4228  988
##      1  255  529

plot(perf)

KS

## [1] 0.5004147

auc

## [1] 0.8027087

gini

## [1] 0.5933811

Summary - CART Model Performance(Oversampled Training Dataset):
The KS = 47% and the AUC = 79% which indicating it to be a good model.
The Gini Coefficient = 60.4% also indicating to be a good model, with scope of improvement.
Confusion matrix:
1. Accuracy = (4178 + 523)/(4178 + 261 + 1038 + 523) = 78.35%
2. Classification Error Rate = 1 - Accuracy = 21.65%

CART - Conclusion

Comparative Summary of the CART Model on Training and Testing Dataset is as follows:

Measures	Train	Test	%Deviation
KS	72.18	47.40	35%
AUC	91.46	79.23	13%
Gini	41.46	60.42	-46%
Accuracy	86.9	78.35	10%
CeR	13.91	21.65	-56%

CeR= Clasification error rate

It can be observed that most of the Model Performance values for Training & Testing sets are above the maximum tolerance deviation of +/- 10%. Hence, the model is over-fitting.
The good performance on the model performance measures indicates good prediction making capabilities of the developed CART model.

Model Building - Random Forest

A Supervised Classification Algorithm, as the name suggests, this algorithm creates the forest with a number of trees in random order. In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high accuracy results.

Some advantages of using Random Forest are as follows:
> The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
> Random forest classifier will handle the missing values.
> When we have more trees in the forest, random forest classifier won’t over fit the model.
> Can model the random forest classifier for categorical values also.

Creating Training and Testing Dataset for RF Model

PL_X_SELL = read.table("D:/Analytics/BACP-Dec2017/10_MachineLearning/PL_X_SELL.csv", sep = ",", header = T)
attach(PL_X_SELL)

set.seed(111)

trainIndex <- createDataPartition(TARGET,
                                  p=0.7,
                                  list = FALSE,
                                  times = 1)
rf.train <- PL_X_SELL[trainIndex, ]
rf.test  <- PL_X_SELL[-trainIndex,]

dim(rf.train)

## [1] 14000    40

dim(rf.test)

## [1] 6000   40

#names(rf.train)
#str(rf.train)

#install.packages("randomForest")
library(randomForest)

Random Forest Model - Train Dataset
The model is built with dependant variable as TARGET, and considering all independent variables except Customer ID and Account Opening Date.

RF=randomForest(as.factor(TARGET)~.,
                data = rf.train[,-c(1,11)],
                ntree = 501, mtry = 3, nodesize = 10,
                importance=TRUE)
print(RF)

## 
## Call:
##  randomForest(formula = as.factor(TARGET) ~ ., data = rf.train[,      -c(1, 11)], ntree = 501, mtry = 3, nodesize = 10, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 501
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 7.75%
## Confusion matrix:
##       0   1  class.error
## 0 12271   1 8.148631e-05
## 1  1084 644 6.273148e-01

Out of Bag Error Rate:
Random Forests algorithm is a classifier based on primarily two methods - bagging and random subspace method.

Suppose we decide to have S number of trees in our forest then we first create S datasets of “same size as original” created from random resampling of data with-replacement. Each of these datasets is called a bootstrap dataset.

Due to “with-replacement” option, every dataset can have duplicate data records and at the same time, can be missing several data records from original datasets. This is called Bagging.

The algorithm uses m (=sqrt(M)) random sub features out of M possible features to create any tree. This is called random subspace method.

After creating the classifiers (S trees), there is a subset of records which does not include any of the records part of the classifier tree. This subset, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of all such records.

Out-of-bag estimate for the generalization error is the error rate of the outof-bag classifier on the training set (compare it with known yi’s).

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating to subsample data samples used for training.

Out-of-bag estimates help in avoiding the need for an independent validation dataset.

The graphical output for the OOB estimate of error rate.

#dev.off()
plot(RF, main="")
legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3) 
title(main="Error Rates Random Forest PL_X_SELL Training data")

The output in tabular form for the OOB estimate of error rate.

#RF$err.rate

It is observed that as the number of tress increases, the OOB error rate starts decreasing till it reaches around 155th tree with OOB = 0.0757 (the minimum value). After this, the OOB doesn’t decrease further and remains largely steady. Hence, the optimal number of trees would be 155.

Variable Importance
To understand the important variables in Random Forest, the following measures are generally used:
> Mean Decrease in Accuracy is based on permutation >> Randomly permute values of a variable for which importance is to be computed in the OOB sample >> Compute the Error Rate with permuted values >> Compute decrease in OOB Error rate (Permuted - Not permuted) >> Average the decrease over all the trees > Mean Decrease in Gini is computed as “total decrease in node impurities from splitting on the variable, averaged over all trees”

#List the iimportance of the variable
impVar <- round(randomForest::importance(RF), 2) 
impVar[order(impVar[,3], decreasing=TRUE),]

##                              0     1 MeanDecreaseAccuracy MeanDecreaseGini
## LEN_OF_RLTN_IN_MNTH      34.95 51.92                50.20            98.80
## AGE_BKT                  35.40 47.16                46.81            88.60
## OCCUPATION               38.57 43.34                45.44            66.11
## BALANCE                  35.40 44.26                43.18           124.62
## SCR                      34.33 43.85                42.03           116.92
## AGE                      31.46 39.66                41.08            73.73
## AMT_L_DR                 32.93 31.95                39.27           105.82
## NO_OF_L_CR_TXNS          32.00 28.38                37.55            93.59
## HOLDING_PERIOD           29.40 38.55                36.72            98.03
## FLG_HAS_CC               30.26 34.22                34.61            30.22
## TOT_NO_OF_L_TXNS         28.20 22.28                33.42            95.35
## AMT_BR_CSH_WDL_DR        26.38 32.95                32.48            86.82
## AVG_AMT_PER_CSH_WDL_TXN  25.01 33.86                31.92            86.31
## AMT_ATM_DR               23.60 22.21                28.24            82.45
## AVG_AMT_PER_CHQ_TXN      22.65 21.94                27.10            75.33
## NO_OF_BR_CSH_WDL_DR_TXNS 19.35 28.20                24.86            44.06
## FLG_HAS_OLD_LOAN         17.95 24.99                24.86            13.71
## AVG_AMT_PER_ATM_TXN      20.31 23.13                24.75            80.67
## AMT_CHQ_DR               21.10 24.95                24.52            75.79
## NO_OF_L_DR_TXNS          21.75 23.99                24.05            58.82
## GENDER                   20.35 21.13                22.56            21.71
## AVG_AMT_PER_NET_TXN      19.03 20.92                22.51            61.53
## AMT_NET_DR               18.38 22.15                21.65            61.92
## NO_OF_CHQ_DR_TXNS        18.40 21.46                20.66            36.37
## FLG_HAS_ANY_CHGS         14.98 21.39                20.40            12.47
## AMT_MOB_DR               15.95 14.57                20.13            39.82
## FLG_HAS_NOMINEE          12.28 19.08                19.20             9.88
## AVG_AMT_PER_MOB_TXN      14.34 15.96                18.22            37.73
## NO_OF_ATM_DR_TXNS        17.11  6.50                17.80            26.76
## NO_OF_IW_CHQ_BNC_TXNS     7.26 17.83                16.71             7.67
## NO_OF_NET_DR_TXNS        14.04 18.44                16.41            21.32
## ACC_TYPE                 14.21 13.34                16.30            12.67
## NO_OF_OW_CHQ_BNC_TXNS     8.96 17.20                15.91             8.38
## AMT_MIN_BAL_NMC_CHGS      5.50  9.75                10.03             2.79
## NO_OF_MOB_DR_TXNS         7.75  8.93                 8.79             9.70
## AMT_OTH_BK_ATM_USG_CHGS   2.00  3.07                 3.69             1.29
## random                   -0.86 -0.48                -0.99            58.39

Optimal mtry value
In the random forests the number of variables available for splitting at each tree node is referred to as the mtry parameter. The optimum number of variables is obtained using tuneRF function.
x = Predictor variables
y = Target variable
mtryStart = starting value of mtry
ntree = No of tree used for tuning
stepFactor = steps to increase (deflate) mtry
improve = the relative oob by atleast this much
trace = print the trace or not
plot = plot OOB vs mtry graph or not
doBest = Finally build the RF using optimal mtry
nodesize = min terminal node size
importance = compute variable importance or not

#Tuning Random Forest
tRF<- tuneRF(x = rf.train[,-c(1,2,11)],
             y=as.factor(rf.train$TARGET),
             mtryStart = 6, #Aprox, Sqrt of Total no. of variables
             ntreeTry = 155,
             stepFactor = 1.5,
             improve = 0.0001,
             trace = TRUE,
             plot = TRUE,
             doBest = TRUE,
             nodesize = 10,
             importance = TRUE
)

## mtry = 6  OOB error = 6.44% 
## Searching left ...
## mtry = 4     OOB error = 7.1% 
## -0.1032186 1e-04 
## Searching right ...
## mtry = 9     OOB error = 6.14% 
## 0.04661487 1e-04 
## mtry = 13    OOB error = 5.9% 
## 0.03841676 1e-04 
## mtry = 19    OOB error = 5.55% 
## 0.05932203 1e-04 
## mtry = 28    OOB error = 5.54% 
## 0.002574003 1e-04 
## mtry = 37    OOB error = 5.36% 
## 0.03225806 1e-04

As can be seen, the optimum number of variable is 28.

Scoring

rf.train$predict.class <- predict(tRF, rf.train, type = "class")
rf.train$predict.score <- predict(tRF, rf.train, type = "prob")
#head(rf.train) 
#class(rf.train$predict.score)

Deciling code

decile <- function(x)
  { 
  deciles <- vector(length=10) 
  for (i in seq(0.1,1,.1)){ 
    deciles[i*10] <- quantile(x, i, na.rm=T) 
  } 
  return ( 
    ifelse(x<deciles[1], 1, 
           ifelse(x<deciles[2], 2, 
                  ifelse(x<deciles[3], 3, 
                         ifelse(x<deciles[4], 4, 
                                ifelse(x<deciles[5], 5, 
                                       ifelse(x<deciles[6], 6, 
                                              ifelse(x<deciles[7], 7, 
                                                     ifelse(x<deciles[8], 8, 
                                                            ifelse(x<deciles[9], 9, 10 
                                                                   )))))))))) 
  } 
rf.train$deciles <- decile(rf.train$predict.score[,2])

Rank order table

library(data.table) 
tmp_DT = data.table(rf.train) 
rank <- tmp_DT[, list( 
  cnt = length(TARGET), 
  cnt_resp = sum(TARGET==1), 
  cnt_non_resp = sum(TARGET == 0)) , 
  by=deciles][order(-deciles)] 
rank$rrate <- round (rank$cnt_resp / rank$cnt,2); 
rank$cum_resp <- cumsum(rank$cnt_resp) 
rank$cum_non_resp <- cumsum(rank$cnt_non_resp) 
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2); 
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),2); 
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp); library(scales) 
rank$rrate <- percent(rank$rrate) 
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp) 
rank

##     deciles  cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
##  1:      10 1404     1404            0  100%     1404            0
##  2:       9 1435      324         1111   23%     1728         1111
##  3:       8 1431        0         1431    0%     1728         2542
##  4:       7 1402        0         1402    0%     1728         3944
##  5:       6 1446        0         1446    0%     1728         5390
##  6:       5 1522        0         1522    0%     1728         6912
##  7:       4 1493        0         1493    0%     1728         8405
##  8:       3 1495        0         1495    0%     1728         9900
##  9:       2 1327        0         1327    0%     1728        11227
## 10:       1 1045        0         1045    0%     1728        12272
##     cum_rel_resp cum_rel_non_resp   ks
##  1:          81%               0% 0.81
##  2:         100%               9% 0.91
##  3:         100%              21% 0.79
##  4:         100%              32% 0.68
##  5:         100%              44% 0.56
##  6:         100%              56% 0.44
##  7:         100%              68% 0.32
##  8:         100%              81% 0.19
##  9:         100%              91% 0.09
## 10:         100%             100% 0.00

# Baseline Response Rate 
sum(rf.train$TARGET) / nrow(rf.train)

## [1] 0.1234286

Interpretation:
> The baseline Response Rate is 12.34%, whereas the response rate in top two deciles is 100% and 23% respectively.
> With top 2 deciles, the KS is 91%, indicating it to be a good model.

KS and Area under Curve

library(ROCR) 
pred <- prediction(rf.train$predict.score[,2], rf.train$TARGET) 
perf <- performance(pred, "tpr", "fpr") 
#plot(perf) 
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]]) 
KS

## [1] 0.9975471

# Area Under Curve 
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values) 
auc

## [1] 0.9999683

# Gini Coefficient 
library(ineq) 
gini = ineq(rf.train$predict.score[,2], type="Gini") 
gini

## [1] 0.7485589

# Classification Error 
with(rf.train, table(TARGET, predict.class))

##       predict.class
## TARGET     0     1
##      0 12270     2
##      1   171  1557

KS=99.75% AUC=99.99% Gini=74.85% Accuracy=98.76% CeR=1.24%

Model Performance on Testing Data Set

# Rank Order 
# Scoring syntax 
rf.test$predict.class <- predict(tRF, rf.test, type="class") 
rf.test$predict.score <- predict(tRF, rf.test, type="prob") 
rf.test$deciles <- decile(rf.test$predict.score[,2]) 

tmp_DT = data.table(rf.test) 
h_rank <- tmp_DT[, list( 
  cnt = length(TARGET), 
  cnt_resp = sum(TARGET), 
  cnt_non_resp = sum(TARGET == 0)) , 
  by=deciles][order(-deciles)] 
h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2); 
h_rank$cum_resp <- cumsum(h_rank$cnt_resp) 
h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp) 
h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),2); 
h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp / sum(h_rank$cnt_non_resp),2); 
h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp);
library(scales) 
h_rank$rrate <- percent(h_rank$rrate) 
h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp) 
h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp) 

h_rank

##     deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
##  1:      10 602      542           60   90%      542           60
##  2:       9 607      139          468   23%      681          528
##  3:       8 591       51          540    9%      732         1068
##  4:       7 627       27          600    4%      759         1668
##  5:       6 592       10          582    2%      769         2250
##  6:       5 592       12          580    2%      781         2830
##  7:       4 657        1          656    0%      782         3486
##  8:       3 647        0          647    0%      782         4133
##  9:       2 616        0          616    0%      782         4749
## 10:       1 469        2          467    0%      784         5216
##     cum_rel_resp cum_rel_non_resp   ks
##  1:          69%               1% 0.68
##  2:          87%              10% 0.77
##  3:          93%              20% 0.73
##  4:          97%              32% 0.65
##  5:          98%              43% 0.55
##  6:         100%              54% 0.46
##  7:         100%              67% 0.33
##  8:         100%              79% 0.21
##  9:         100%              91% 0.09
## 10:         100%             100% 0.00

Interpretation:
> The baseline Response Rate is 12.34%, whereas the response rate in top three deciles is 90%, 23%, 9% respectively.
> With two deciles, the KS is 77% indicating it to be a good model.

KS and AUC

pred1 <- prediction(rf.test$predict.score[,2], rf.test$TARGET) 
perf1 <- performance(pred1, "tpr", "fpr") 
plot(perf1)

KS1 <- max(attr(perf1, 'y.values')[[1]]-attr(perf1, 'x.values')[[1]]) 
KS1

## [1] 0.774446

# Area Under Curve 
auc1 <- performance(pred1,"auc"); 
auc1 <- as.numeric(auc1@y.values) 
auc1

## [1] 0.9584023

# Gini Coefficient 
library(ineq) 
gini1 = ineq(rf.test$predict.score[,2], type="Gini") 
gini1

## [1] 0.6463728

# Classification Error 
with(rf.test, table(TARGET, predict.class))

##       predict.class
## TARGET    0    1
##      0 5211    5
##      1  343  441

KS = 77.44% AUC = 95.84% Gini = 64.63% Accuracy = 94.2% CeR = 5.8%

Random Forest Conclusion

Measures	Train	Test	%Deviation
KS	99.75%	77.44%	22%
AUC	99.99%	95.84%	04%
Gini	74.85%	64.63%	14%
Accuracy	98.76%	94.20%	05%
CeR	01.24%	05.80%	-368%

Neural Network

Artificial neural networks (ANNs) are statistical models directly inspired by, and partially modeled on biological neural networks. They are capable of modeling and processing nonlinear relationships between inputs and outputs in parallel.

Artificial neural networks are characterized by containing adaptive weights along paths between neurons that can be tuned by a learning algorithm that learns from observed data in order to improve the model. In addition to the learning algorithm itself, one must choose an appropriate cost function.

The cost function is what’s used to learn the optimal solution to the problem being solved. This involves determining the best values for all of the tuneable model parameters, with neuron path adaptive weights being the primary target, along with algorithm tuning parameters such as the learning rate. It’s usually done through optimization techniques such as gradient descent or stochastic gradient descent.

These optimization techniques basically try to make the ANN solution be as close as possible to the optimal solution, which when successful means that the ANN is able to solve the intended problem with high performance.

NNInput = read.table("D:/Analytics/BACP-Dec2017/10_MachineLearning/PL_X_SELL.csv", sep = ",", header = T)
attach(NNInput)

## The following objects are masked from PL_X_SELL (pos = 3):
## 
##     ACC_OP_DATE, ACC_TYPE, AGE, AGE_BKT, AMT_ATM_DR,
##     AMT_BR_CSH_WDL_DR, AMT_CHQ_DR, AMT_L_DR, AMT_MIN_BAL_NMC_CHGS,
##     AMT_MOB_DR, AMT_NET_DR, AMT_OTH_BK_ATM_USG_CHGS,
##     AVG_AMT_PER_ATM_TXN, AVG_AMT_PER_CHQ_TXN,
##     AVG_AMT_PER_CSH_WDL_TXN, AVG_AMT_PER_MOB_TXN,
##     AVG_AMT_PER_NET_TXN, BALANCE, CUST_ID, FLG_HAS_ANY_CHGS,
##     FLG_HAS_CC, FLG_HAS_NOMINEE, FLG_HAS_OLD_LOAN, GENDER,
##     HOLDING_PERIOD, LEN_OF_RLTN_IN_MNTH, NO_OF_ATM_DR_TXNS,
##     NO_OF_BR_CSH_WDL_DR_TXNS, NO_OF_CHQ_DR_TXNS,
##     NO_OF_IW_CHQ_BNC_TXNS, NO_OF_L_CR_TXNS, NO_OF_L_DR_TXNS,
##     NO_OF_MOB_DR_TXNS, NO_OF_NET_DR_TXNS, NO_OF_OW_CHQ_BNC_TXNS,
##     OCCUPATION, random, SCR, TARGET, TOT_NO_OF_L_TXNS

## The following objects are masked from PL_X_SELL (pos = 13):
## 
##     ACC_OP_DATE, ACC_TYPE, AGE, AGE_BKT, AMT_ATM_DR,
##     AMT_BR_CSH_WDL_DR, AMT_CHQ_DR, AMT_L_DR, AMT_MIN_BAL_NMC_CHGS,
##     AMT_MOB_DR, AMT_NET_DR, AMT_OTH_BK_ATM_USG_CHGS,
##     AVG_AMT_PER_ATM_TXN, AVG_AMT_PER_CHQ_TXN,
##     AVG_AMT_PER_CSH_WDL_TXN, AVG_AMT_PER_MOB_TXN,
##     AVG_AMT_PER_NET_TXN, BALANCE, CUST_ID, FLG_HAS_ANY_CHGS,
##     FLG_HAS_CC, FLG_HAS_NOMINEE, FLG_HAS_OLD_LOAN, GENDER,
##     HOLDING_PERIOD, LEN_OF_RLTN_IN_MNTH, NO_OF_ATM_DR_TXNS,
##     NO_OF_BR_CSH_WDL_DR_TXNS, NO_OF_CHQ_DR_TXNS,
##     NO_OF_IW_CHQ_BNC_TXNS, NO_OF_L_CR_TXNS, NO_OF_L_DR_TXNS,
##     NO_OF_MOB_DR_TXNS, NO_OF_NET_DR_TXNS, NO_OF_OW_CHQ_BNC_TXNS,
##     OCCUPATION, random, SCR, TARGET, TOT_NO_OF_L_TXNS

dim(NNInput)

## [1] 20000    40

str(NNInput)

## 'data.frame':    20000 obs. of  40 variables:
##  $ CUST_ID                 : Factor w/ 20000 levels "C1","C10","C100",..: 17699 16532 11027 17984 2363 11747 18115 15556 15216 12494 ...
##  $ TARGET                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AGE                     : int  27 47 40 53 36 42 30 53 42 30 ...
##  $ GENDER                  : Factor w/ 3 levels "F","M","O": 2 2 2 2 2 1 2 1 1 2 ...
##  $ BALANCE                 : num  3384 287489 18217 71720 1671623 ...
##  $ OCCUPATION              : Factor w/ 4 levels "PROF","SAL","SELF-EMP",..: 3 2 3 2 1 1 1 2 3 1 ...
##  $ AGE_BKT                 : Factor w/ 7 levels "<25",">50","26-30",..: 3 7 5 2 5 6 3 2 6 3 ...
##  $ SCR                     : int  776 324 603 196 167 493 479 562 105 170 ...
##  $ HOLDING_PERIOD          : int  30 28 2 13 24 26 14 25 15 13 ...
##  $ ACC_TYPE                : Factor w/ 2 levels "CA","SA": 2 2 2 1 2 2 2 1 2 2 ...
##  $ ACC_OP_DATE             : Factor w/ 4869 levels "01-01-00","01-01-01",..: 3270 1806 3575 993 2861 862 4533 3160 257 334 ...
##  $ LEN_OF_RLTN_IN_MNTH     : int  146 104 61 107 185 192 177 99 88 111 ...
##  $ NO_OF_L_CR_TXNS         : int  7 8 10 36 20 5 6 14 18 14 ...
##  $ NO_OF_L_DR_TXNS         : int  3 2 5 14 1 2 6 3 14 8 ...
##  $ TOT_NO_OF_L_TXNS        : int  10 10 15 50 21 7 12 17 32 22 ...
##  $ NO_OF_BR_CSH_WDL_DR_TXNS: int  0 0 1 4 1 1 0 3 6 3 ...
##  $ NO_OF_ATM_DR_TXNS       : int  1 1 1 2 0 1 1 0 2 1 ...
##  $ NO_OF_NET_DR_TXNS       : int  2 1 1 3 0 0 1 0 4 0 ...
##  $ NO_OF_MOB_DR_TXNS       : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ NO_OF_CHQ_DR_TXNS       : int  0 0 2 4 0 0 4 0 1 4 ...
##  $ FLG_HAS_CC              : int  0 0 0 0 0 1 0 0 1 0 ...
##  $ AMT_ATM_DR              : int  13100 6600 11200 26100 0 18500 6200 0 35400 18000 ...
##  $ AMT_BR_CSH_WDL_DR       : int  0 0 561120 673590 808480 379310 0 945160 198430 869880 ...
##  $ AMT_CHQ_DR              : int  0 0 49320 60780 0 0 10580 0 51490 32610 ...
##  $ AMT_NET_DR              : num  973557 799813 997570 741506 0 ...
##  $ AMT_MOB_DR              : int  0 0 0 71388 0 0 0 0 170332 0 ...
##  $ AMT_L_DR                : num  986657 806413 1619210 1573364 808480 ...
##  $ FLG_HAS_ANY_CHGS        : int  0 1 1 0 0 0 1 0 0 0 ...
##  $ AMT_OTH_BK_ATM_USG_CHGS : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AMT_MIN_BAL_NMC_CHGS    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_OF_IW_CHQ_BNC_TXNS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NO_OF_OW_CHQ_BNC_TXNS   : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ AVG_AMT_PER_ATM_TXN     : num  13100 6600 11200 13050 0 ...
##  $ AVG_AMT_PER_CSH_WDL_TXN : num  0 0 561120 168398 808480 ...
##  $ AVG_AMT_PER_CHQ_TXN     : num  0 0 24660 15195 0 ...
##  $ AVG_AMT_PER_NET_TXN     : num  486779 799813 997570 247169 0 ...
##  $ AVG_AMT_PER_MOB_TXN     : num  0 0 0 71388 0 ...
##  $ FLG_HAS_NOMINEE         : int  1 1 1 1 1 1 0 1 1 0 ...
##  $ FLG_HAS_OLD_LOAN        : int  1 0 1 0 0 1 1 1 1 0 ...
##  $ random                  : num  1.14e-05 1.11e-04 1.20e-04 1.37e-04 1.74e-04 ...

library(neuralnet)

## Warning: package 'neuralnet' was built under R version 3.4.4

## 
## Attaching package: 'neuralnet'

## The following object is masked from 'package:ROCR':
## 
##     prediction

Create Dummy Variables for Categorical Variables

# Gender 
GEN.matrix <- model.matrix(~ GENDER - 1, data = NNInput) 
NNInput <- data.frame(NNInput, GEN.matrix) 

# Occupation 
occ.matrix <- model.matrix(~ OCCUPATION - 1, data = NNInput) 
NNInput <- data.frame(NNInput, occ.matrix) 

# AGE_BKT 
AGEBKT.matrix <- model.matrix(~ AGE_BKT - 1, data = NNInput)
NNInput <- data.frame(NNInput, AGEBKT.matrix) 

# ACC_TYPE 
ACCTYP.matrix <- model.matrix(~ ACC_TYPE - 1, data = NNInput) 
NNInput <- data.frame(NNInput, ACCTYP.matrix) 
names(NNInput)

##  [1] "CUST_ID"                  "TARGET"                  
##  [3] "AGE"                      "GENDER"                  
##  [5] "BALANCE"                  "OCCUPATION"              
##  [7] "AGE_BKT"                  "SCR"                     
##  [9] "HOLDING_PERIOD"           "ACC_TYPE"                
## [11] "ACC_OP_DATE"              "LEN_OF_RLTN_IN_MNTH"     
## [13] "NO_OF_L_CR_TXNS"          "NO_OF_L_DR_TXNS"         
## [15] "TOT_NO_OF_L_TXNS"         "NO_OF_BR_CSH_WDL_DR_TXNS"
## [17] "NO_OF_ATM_DR_TXNS"        "NO_OF_NET_DR_TXNS"       
## [19] "NO_OF_MOB_DR_TXNS"        "NO_OF_CHQ_DR_TXNS"       
## [21] "FLG_HAS_CC"               "AMT_ATM_DR"              
## [23] "AMT_BR_CSH_WDL_DR"        "AMT_CHQ_DR"              
## [25] "AMT_NET_DR"               "AMT_MOB_DR"              
## [27] "AMT_L_DR"                 "FLG_HAS_ANY_CHGS"        
## [29] "AMT_OTH_BK_ATM_USG_CHGS"  "AMT_MIN_BAL_NMC_CHGS"    
## [31] "NO_OF_IW_CHQ_BNC_TXNS"    "NO_OF_OW_CHQ_BNC_TXNS"   
## [33] "AVG_AMT_PER_ATM_TXN"      "AVG_AMT_PER_CSH_WDL_TXN" 
## [35] "AVG_AMT_PER_CHQ_TXN"      "AVG_AMT_PER_NET_TXN"     
## [37] "AVG_AMT_PER_MOB_TXN"      "FLG_HAS_NOMINEE"         
## [39] "FLG_HAS_OLD_LOAN"         "random"                  
## [41] "GENDERF"                  "GENDERM"                 
## [43] "GENDERO"                  "OCCUPATIONPROF"          
## [45] "OCCUPATIONSAL"            "OCCUPATIONSELF.EMP"      
## [47] "OCCUPATIONSENP"           "AGE_BKT.25"              
## [49] "AGE_BKT.50"               "AGE_BKT26.30"            
## [51] "AGE_BKT31.35"             "AGE_BKT36.40"            
## [53] "AGE_BKT41.45"             "AGE_BKT46.50"            
## [55] "ACC_TYPECA"               "ACC_TYPESA"

dim(NNInput)

## [1] 20000    56

Creating Training and Test Datasets

library(caret)
set.seed(111) 
trainIndex <- createDataPartition(TARGET, 
                                  p = .7, 
                                  list = FALSE, 
                                  times = 1) 
NN.train.data <- NNInput[trainIndex,] 
NN.test.data <- NNInput[-trainIndex,] 
dim(NN.train.data)

## [1] 14000    56

dim(NN.test.data)

## [1] 6000   56

Scaling the Dataset and Variables

#names(NN.train.data) 
x <- subset(NN.train.data, 
            select = c("AGE", 
                       "BALANCE", 
                       "SCR", 
                       "HOLDING_PERIOD", 
                       "LEN_OF_RLTN_IN_MNTH", 
                       "NO_OF_L_CR_TXNS", 
                       "NO_OF_L_DR_TXNS", 
                       "TOT_NO_OF_L_TXNS", 
                       "NO_OF_BR_CSH_WDL_DR_TXNS", 
                       "NO_OF_ATM_DR_TXNS", 
                       "NO_OF_NET_DR_TXNS", 
                       "NO_OF_MOB_DR_TXNS", 
                       "NO_OF_CHQ_DR_TXNS", 
                       "FLG_HAS_CC", 
                       "AMT_ATM_DR", 
                       "AMT_BR_CSH_WDL_DR", 
                       "AMT_CHQ_DR", 
                       "AMT_NET_DR", 
                       "AMT_MOB_DR", 
                       "AMT_L_DR", 
                       "FLG_HAS_ANY_CHGS", 
                       "AMT_OTH_BK_ATM_USG_CHGS", 
                       "AMT_MIN_BAL_NMC_CHGS", 
                       "NO_OF_IW_CHQ_BNC_TXNS", 
                       "NO_OF_OW_CHQ_BNC_TXNS", 
                       "AVG_AMT_PER_ATM_TXN",
                       "AVG_AMT_PER_CSH_WDL_TXN", 
                       "AVG_AMT_PER_CHQ_TXN", 
                       "AVG_AMT_PER_NET_TXN", 
                       "AVG_AMT_PER_MOB_TXN", 
                       "FLG_HAS_NOMINEE", 
                       "FLG_HAS_OLD_LOAN", 
                       "random", 
                       "GENDERF", 
                       "GENDERM", 
                       "GENDERO", 
                       "OCCUPATIONPROF", 
                       "OCCUPATIONSAL", 
                       "OCCUPATIONSELF.EMP", 
                       "OCCUPATIONSENP", 
                       "AGE_BKT.25", 
                       "AGE_BKT.50", 
                       "AGE_BKT26.30", 
                       "AGE_BKT31.35", 
                       "AGE_BKT36.40", 
                       "AGE_BKT41.45", 
                       "AGE_BKT46.50", 
                       "ACC_TYPECA", 
                       "ACC_TYPESA" ) 
            ) 
nn.devscaled <- scale(x) 
nn.devscaled <- cbind(NN.train.data[2], nn.devscaled) 
#names(nn.devscaled)

Building the NN Model

nn2 <- neuralnet(formula = TARGET ~
                   AGE + 
                   BALANCE + 
                   SCR + 
                   HOLDING_PERIOD + 
                   LEN_OF_RLTN_IN_MNTH + 
                   NO_OF_L_CR_TXNS + 
                   NO_OF_L_DR_TXNS + 
                   TOT_NO_OF_L_TXNS + 
                   NO_OF_BR_CSH_WDL_DR_TXNS + 
                   NO_OF_ATM_DR_TXNS + 
                   NO_OF_NET_DR_TXNS + 
                   NO_OF_MOB_DR_TXNS + 
                   NO_OF_CHQ_DR_TXNS + 
                   FLG_HAS_CC + 
                   AMT_ATM_DR + 
                   AMT_BR_CSH_WDL_DR + 
                   AMT_CHQ_DR + 
                   AMT_NET_DR + 
                   AMT_MOB_DR + 
                   AMT_L_DR + 
                   FLG_HAS_ANY_CHGS + 
                   AMT_OTH_BK_ATM_USG_CHGS + 
                   AMT_MIN_BAL_NMC_CHGS + 
                   NO_OF_IW_CHQ_BNC_TXNS + 
                   NO_OF_OW_CHQ_BNC_TXNS + 
                   AVG_AMT_PER_ATM_TXN + 
                   AVG_AMT_PER_CSH_WDL_TXN + 
                   AVG_AMT_PER_CHQ_TXN + 
                   AVG_AMT_PER_NET_TXN + 
                   AVG_AMT_PER_MOB_TXN + 
                   FLG_HAS_NOMINEE + 
                   FLG_HAS_OLD_LOAN + 
                   random + 
                   GENDERF + 
                   GENDERM + 
                   GENDERO + 
                   OCCUPATIONPROF + 
                   OCCUPATIONSAL + 
                   OCCUPATIONSELF.EMP + 
                   OCCUPATIONSENP + 
                   AGE_BKT.25 + 
                   AGE_BKT.50 + 
                   AGE_BKT26.30 + 
                   AGE_BKT31.35 + 
                   AGE_BKT36.40 + 
                   AGE_BKT41.45 + 
                   AGE_BKT46.50 + 
                   ACC_TYPECA + 
                   ACC_TYPESA , 
                 data = nn.devscaled, 
                 hidden = 3, 
                 err.fct = "sse", 
                 linear.output = FALSE, 
                 lifesign = "full", 
                 lifesign.step = 10, 
                 threshold = 0.1, 
                 stepmax = 2000
)

## hidden: 3    thresh: 0.1    rep: 1/1    steps:      10   min thresh: 3.379522801
##                                                     20   min thresh: 1.891409102
##                                                     30   min thresh: 0.862865376
##                                                     40   min thresh: 0.862865376
##                                                     50   min thresh: 0.862865376
##                                                     60   min thresh: 0.862865376
##                                                     70   min thresh: 0.862865376
##                                                     80   min thresh: 0.862865376
##                                                     90   min thresh: 0.862865376
##                                                    100   min thresh: 0.862865376
##                                                    110   min thresh: 0.862865376
##                                                    120   min thresh: 0.608434037
##                                                    130   min thresh: 0.4805965769
##                                                    140   min thresh: 0.4805965769
##                                                    150   min thresh: 0.3968980988
##                                                    160   min thresh: 0.3968980988
##                                                    170   min thresh: 0.3968980988
##                                                    180   min thresh: 0.3797986902
##                                                    190   min thresh: 0.3797986902
##                                                    200   min thresh: 0.3797986902
##                                                    210   min thresh: 0.3797986902
##                                                    220   min thresh: 0.3797986902
##                                                    230   min thresh: 0.3684127934
##                                                    240   min thresh: 0.3684127934
##                                                    250   min thresh: 0.3684127934
##                                                    260   min thresh: 0.3060293326
##                                                    270   min thresh: 0.3060293326
##                                                    280   min thresh: 0.2887896992
##                                                    290   min thresh: 0.2887896992
##                                                    300   min thresh: 0.2887896992
##                                                    310   min thresh: 0.2887896992
##                                                    320   min thresh: 0.2887896992
##                                                    330   min thresh: 0.2887896992
##                                                    340   min thresh: 0.2887896992
##                                                    350   min thresh: 0.2887896992
##                                                    360   min thresh: 0.2825161596
##                                                    370   min thresh: 0.2825161596
##                                                    380   min thresh: 0.2465968798
##                                                    390   min thresh: 0.2207096009
##                                                    400   min thresh: 0.2207096009
##                                                    410   min thresh: 0.2207096009
##                                                    420   min thresh: 0.2207096009
##                                                    430   min thresh: 0.2207096009
##                                                    440   min thresh: 0.2207096009
##                                                    450   min thresh: 0.1852351401
##                                                    460   min thresh: 0.1852351401
##                                                    470   min thresh: 0.1852351401
##                                                    480   min thresh: 0.1852351401
##                                                    490   min thresh: 0.1852351401
##                                                    500   min thresh: 0.1852351401
##                                                    510   min thresh: 0.1852351401
##                                                    520   min thresh: 0.1532250622
##                                                    530   min thresh: 0.1532250622
##                                                    540   min thresh: 0.1532250622
##                                                    550   min thresh: 0.1532250622
##                                                    560   min thresh: 0.1532250622
##                                                    570   min thresh: 0.1532250622
##                                                    580   min thresh: 0.1532250622
##                                                    590   min thresh: 0.1532250622
##                                                    600   min thresh: 0.1532250622
##                                                    610   min thresh: 0.1532250622
##                                                    620   min thresh: 0.1532250622
##                                                    630   min thresh: 0.1532250622
##                                                    640   min thresh: 0.1532250622
##                                                    650   min thresh: 0.1532250622
##                                                    660   min thresh: 0.1532250622
##                                                    670   min thresh: 0.1532250622
##                                                    680   min thresh: 0.1532250622
##                                                    690   min thresh: 0.1532250622
##                                                    700   min thresh: 0.1532250622
##                                                    710   min thresh: 0.1532250622
##                                                    720   min thresh: 0.1532250622
##                                                    730   min thresh: 0.1532250622
##                                                    740   min thresh: 0.1532250622
##                                                    750   min thresh: 0.1532250622
##                                                    760   min thresh: 0.1532250622
##                                                    770   min thresh: 0.1532250622
##                                                    780   min thresh: 0.1532250622
##                                                    790   min thresh: 0.1532250622
##                                                    800   min thresh: 0.1532250622
##                                                    810   min thresh: 0.135546561
##                                                    820   min thresh: 0.135546561
##                                                    830   min thresh: 0.135546561
##                                                    840   min thresh: 0.135546561
##                                                    850   min thresh: 0.135546561
##                                                    860   min thresh: 0.135546561
##                                                    870   min thresh: 0.135546561
##                                                    880   min thresh: 0.1333680851
##                                                    890   min thresh: 0.1333680851
##                                                    900   min thresh: 0.1201314264
##                                                    910   min thresh: 0.1201314264
##                                                    920   min thresh: 0.1179686224
##                                                    930   min thresh: 0.1179686224
##                                                    940   min thresh: 0.1179686224
##                                                    950   min thresh: 0.1179686224
##                                                    960   min thresh: 0.1179686224
##                                                    970   min thresh: 0.1018929215
##                                                    980   min thresh: 0.1018929215
##                                                    990   min thresh: 0.1018929215
##                                                   1000   min thresh: 0.1018929215
##                                                   1010   min thresh: 0.1018929215
##                                                   1020   min thresh: 0.1018929215
##                                                   1030   min thresh: 0.1018929215
##                                                   1034   error: 649.5383     time: 15.71 secs

#print(nn2)

The Artificial Neural Network - Graphical Representation

plot (nn2)

Measuring Model Performance

# Prediction on Train Data 
NN.train.data$Prob = nn2$net.result[[1]]
#head(NN.train.data)

Probabilities in Training Dataset

# The distribution of the estimated probabilities 
quantile(NN.train.data$Prob, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)

##             0%             1%             5%            10%            25% 
## 0.005340550102 0.005340550123 0.005341659118 0.005787379101 0.031114204547 
##            50%            75%            90%            95%            98% 
## 0.036584905250 0.180627475781 0.188810014317 0.513345635301 0.581857453306 
##            99%           100% 
## 0.581921404836 0.581921674484

hist(NN.train.data$Prob)

Performance Measures on Training Data Set
The following model performance measures will be calculated on the training set to gauge the goodness of the model:
> Rank Ordering > KS > Area Under Curve (AUC) > Gini Coefficient > Classification Error

Deciling code

decile <- function(x)
  { 
  deciles <- vector(length=10) 
  for (i in seq(0.1,1,.1)){ 
    deciles[i*10] <- quantile(x, i, na.rm=T) 
  } 
  return ( 
    ifelse(x<deciles[1], 1, 
           ifelse(x<deciles[2], 2, 
                  ifelse(x<deciles[3], 3, 
                         ifelse(x<deciles[4], 4, 
                                ifelse(x<deciles[5], 5, 
                                       ifelse(x<deciles[6], 6, 
                                              ifelse(x<deciles[7], 7, 
                                                     ifelse(x<deciles[8], 8, 
                                                            ifelse(x<deciles[9], 9, 10 
                                                                   )))))))))) 
  }

# deciling 
NN.train.data$deciles <- decile(NN.train.data$Prob) 
#

# Ranking code 
##install.packages("data.table") 
library(data.table) 
library(scales) 
tmp_DT = data.table(NN.train.data)

rank <- tmp_DT[, list( 
  cnt = length(TARGET), 
  cnt_resp = sum(TARGET==1), 
  cnt_non_resp = sum(TARGET == 0)) , 
  by=deciles][order(-deciles)] 
rank$rrate <- round (rank$cnt_resp / rank$cnt,2); 
rank$cum_resp <- cumsum(rank$cnt_resp) 
rank$cum_non_resp <- cumsum(rank$cnt_non_resp) 
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2); 
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),2); 
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) 
rank$rrate <- percent(rank$rrate) 
rank$cum_rel_resp <- percent(rank$cum_rel_resp) 
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)

rank

##     deciles  cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
##  1:      10 1400      549          851   39%      549          851
##  2:       9 1400      251         1149   18%      800         2000
##  3:       8 1400      251         1149   18%     1051         3149
##  4:       7 1400      228         1172   16%     1279         4321
##  5:       6 1400       62         1338    4%     1341         5659
##  6:       5 1400       74         1326    5%     1415         6985
##  7:       4 1400       72         1328    5%     1487         8313
##  8:       3 1400       96         1304    7%     1583         9617
##  9:       2 1400       85         1315    6%     1668        10932
## 10:       1 1400       60         1340    4%     1728        12272
##     cum_rel_resp cum_rel_non_resp   ks
##  1:          32%               7% 0.25
##  2:          46%              16% 0.30
##  3:          61%              26% 0.35
##  4:          74%              35% 0.39
##  5:          78%              46% 0.32
##  6:          82%              57% 0.25
##  7:          86%              68% 0.18
##  8:          92%              78% 0.14
##  9:          97%              89% 0.08
## 10:         100%             100% 0.00

Interpretation:
The baseline Response Rate is 12.34%, whereas the response rate in top three deciles is 39%, 18% and 18%respectively.
With top 4 deciles, the KS is 39%, which is close to a good fitness model indicator.

Assgining 0 / 1 class based on certain threshold

NN.train.data$Class = ifelse(NN.train.data$Prob>0.5,1,0) 
with( NN.train.data, table(TARGET, as.factor(Class) ))

##       
## TARGET     0     1
##      0 11981   291
##      1  1296   432

Error Computation

sum((NN.train.data$TARGET - NN.train.data$Prob)^2)/2

## [1] 649.5382956

Other Model Performance Measures

library(ROCR) 
#names(NN.train.data)
#str(NN.train.data) 
detach(package:neuralnet)
pred3 <- prediction(NN.train.data$Prob, NN.train.data$TARGET)
perf3 <- performance(pred3, "tpr", "fpr") 
plot(perf3)

KS3 <- max(attr(perf3, 'y.values')[[1]]-attr(perf3, 'x.values')[[1]]) 
KS3

## [1] 0.3913591313

auc3 <- performance(pred3,"auc"); 
auc3 <- as.numeric(auc3@y.values) 
auc3

## [1] 0.7290062405

library(ineq) 
gini3 = ineq(NN.train.data$Prob, type="Gini") 

auc3

## [1] 0.7290062405

KS3

## [1] 0.3913591313

gini3

## [1] 0.5473745079

Summary: Model Performance Measures (Training Dataset)
Measure | Value
———-|———
KS | 39.13%
AUC | 72.90%
Gini | 54.73%
Accuracy | 88.66%
CeR | 11.00%

The Gini Coefficient is the ratio of the area between the line of perfect equality and the observed Lorenz curve to the area between the line of perfect equality and the line of perfect inequality.
The higher the coefficient, the more unequal the distribution is. Gini coefficient can be straight away derived from the AUC ROC number.
Gini above 60% is a good model.
The lower the classification error rate, higher the model accuracy, resulting in a better model. The classification error rate can be reduced if there were more independent variables were present for modeling.

The model observed to perform at par expectations on majority of the model performance measures, indicating it to be a good model, with scope for improvement.

Scoring Test dataset using the Neural Net Model Object

x <- subset(NN.test.data, 
            select = c("AGE", 
                       "BALANCE", 
                       "SCR", 
                       "HOLDING_PERIOD", 
                       "LEN_OF_RLTN_IN_MNTH", 
                       "NO_OF_L_CR_TXNS", 
                       "NO_OF_L_DR_TXNS", 
                       "TOT_NO_OF_L_TXNS", 
                       "NO_OF_BR_CSH_WDL_DR_TXNS", 
                       "NO_OF_ATM_DR_TXNS", 
                       "NO_OF_NET_DR_TXNS", 
                       "NO_OF_MOB_DR_TXNS", 
                       "NO_OF_CHQ_DR_TXNS", 
                       "FLG_HAS_CC", 
                       "AMT_ATM_DR", 
                       "AMT_BR_CSH_WDL_DR", 
                       "AMT_CHQ_DR", 
                       "AMT_NET_DR", 
                       "AMT_MOB_DR", 
                       "AMT_L_DR", 
                       "FLG_HAS_ANY_CHGS", 
                       "AMT_OTH_BK_ATM_USG_CHGS", 
                       "AMT_MIN_BAL_NMC_CHGS", 
                       "NO_OF_IW_CHQ_BNC_TXNS", 
                       "NO_OF_OW_CHQ_BNC_TXNS", 
                       "AVG_AMT_PER_ATM_TXN",
                       "AVG_AMT_PER_CSH_WDL_TXN", 
                       "AVG_AMT_PER_CHQ_TXN", 
                       "AVG_AMT_PER_NET_TXN", 
                       "AVG_AMT_PER_MOB_TXN", 
                       "FLG_HAS_NOMINEE", 
                       "FLG_HAS_OLD_LOAN", 
                       "random", 
                       "GENDERF", 
                       "GENDERM", 
                       "GENDERO", 
                       "OCCUPATIONPROF", 
                       "OCCUPATIONSAL", 
                       "OCCUPATIONSELF.EMP", 
                       "OCCUPATIONSENP", 
                       "AGE_BKT.25", 
                       "AGE_BKT.50", 
                       "AGE_BKT26.30", 
                       "AGE_BKT31.35", 
                       "AGE_BKT36.40", 
                       "AGE_BKT41.45", 
                       "AGE_BKT46.50", 
                       "ACC_TYPECA", 
                       "ACC_TYPESA" ) 
            ) 
x.scaled <- scale(x)

#Calling the neuralnet package to use compute function, as we have detached it in the earlier step
library(neuralnet)

#Scoring the Test Dataset
#To score we will use the compute function
compute.output <- compute(nn2, x.scaled)
NN.test.data$Predict.score <- compute.output$net.result
#head(NN.test.data)

Estimated Probabilities

# The distribution of the estimated probabilities 
quantile(NN.test.data$Predict.score, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)

##             0%             1%             5%            10%            25% 
## 0.005340550102 0.005340550137 0.005341313796 0.005684425855 0.031114934658 
##            50%            75%            90%            95%            98% 
## 0.036585067251 0.183669290563 0.188810016694 0.494376372589 0.581827903967 
##            99%           100% 
## 0.581920926121 0.581921674484

hist(NN.test.data$Predict.score)

Deciling

# deciling 
NN.test.data$deciles <- decile(NN.test.data$Predict.score) 
#

Rank ordering

library(data.table) 
tmp_DT = data.table(NN.test.data) 
h_rank <- tmp_DT[, list( 
  cnt = length(TARGET), 
  cnt_resp = sum(TARGET), 
  cnt_non_resp = sum(TARGET == 0)) , 
  by=deciles][order(-deciles)] 
h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2); 
h_rank$cum_resp <- cumsum(h_rank$cnt_resp) 
h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp) 
h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),2); 
h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp / sum(h_rank$cnt_non_resp),2); h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp); # library(scales) 
h_rank$rrate <- percent(h_rank$rrate) 
h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp) 
h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp) 
h_rank

##     deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
##  1:      10 600      226          374   38%      226          374
##  2:       9 600      118          482   20%      344          856
##  3:       8 600      103          497   17%      447         1353
##  4:       7 600      103          497   17%      550         1850
##  5:       6 600       53          547    9%      603         2397
##  6:       5 600       48          552    8%      651         2949
##  7:       4 600       42          558    7%      693         3507
##  8:       3 600       33          567    6%      726         4074
##  9:       2 600       35          565    6%      761         4639
## 10:       1 600       23          577    4%      784         5216
##     cum_rel_resp cum_rel_non_resp   ks
##  1:          29%               7% 0.22
##  2:          44%              16% 0.28
##  3:          57%              26% 0.31
##  4:          70%              35% 0.35
##  5:          77%              46% 0.31
##  6:          83%              57% 0.26
##  7:          88%              67% 0.21
##  8:          93%              78% 0.15
##  9:          97%              89% 0.08
## 10:         100%             100% 0.00

Interpretation:
The baseline Response Rate is 12.34%, whereas the response rate in top three deciles is 38%, 20%, 17% respectively.
With top 4 deciles, the KS is 35%, indicating scope for improvement.

Assigning the Probabilities to the Test Dataset

NN.test.data$Prob = compute.output$net.result

Assigning 0/1 class based on certain threshold

NN.test.data$Class = ifelse(NN.test.data$Prob>0.5,1,0) 
with( NN.test.data, table(TARGET, as.factor(Class) ))

##       
## TARGET    0    1
##      0 5065  151
##      1  639  145

## Error Computation 
sum((NN.test.data$TARGET - NN.test.data$Prob)^2)/2

## [1] 308.6009414

Other Model Performance Measures

library(ROCR) 
# str(NN.test.data) 
detach(package:neuralnet)
pred4 <- prediction(NN.test.data$Prob, NN.test.data$TARGET) 
perf4 <- performance(pred4, "tpr", "fpr") 
plot(perf4)

KS4 <- max(attr(perf4, 'y.values')[[1]]-attr(perf4, 'x.values')[[1]]) 
KS4

## [1] 0.3545409728

auc4 <- performance(pred4,"auc"); 
auc4 <- as.numeric(auc4@y.values) 
auc4

## [1] 0.7213823293

library(ineq) 
gini4 = ineq(NN.test.data$Prob, type="Gini") 

auc4

## [1] 0.7213823293

KS4

## [1] 0.3545409728

gini4

## [1] 0.5462225246

Summary: Model Performance Measures (Test Dataset)
Measure | Value
———-|———
KS | 35.45%
AUC | 72.13%
Gini | 54.62%
Accuracy | 86.83%
CeR | 13.16%

The model observed to perform “normal” on majority of the model performance measures, indicating it to be a good model, with scope for improvement.

Artificial Neural Network - Conclusion
Comparative Summary of the Neural Network Model on Training and Testing Dataset is as follows:

Measures	Train	Test	%Deviation
KS	39.00%	35.45%	9.10%
AUC	73.00%	72.13%	1.19%
Gini	54.70%	54.62%	0.14%
Accuracy	88.66%	86.83%	2.06%
CeR	11.00%	13.16%	-19.6%

It can be observed that most of the Model Performance values for Training & Testing sets are within the maximum tolerance deviation of +/- 10%. Hence, the model is not over-fitting.
At an overall level, the model observed to perform “normal” on the various model performance measures such as KS value close to 40, Gini coefficient value > 50 etc. The good performance on the model performance measures indicates good prediction making capabilities of the developed Neural Network model.

Model Comparision

The main objective of the project was to develop a predictive model to predict if MyBank customers will respond positively to a promotion or an offer using tools of Machine Learning. In this context, the key parameter for model evaluation was ‘Accuracy’, i.e., the proportion of the total number of predictions that were correct (i.e. % of the customers that were correctly predicted).

The predictive models was be developed using the following Machine Learning techniques:
> Classification Tree - CART
> Random Forest
> Neural Network

The snap shot of the performance of all the models on accuracy, over-fitting and other model performance measures is provided below:

CART

Measures	Train	Test	%Deviation
KS	72.18%	47.40%	35%
AUC	91.46%	79.23%	13%
Gini	41.46%	60.42%	-46%
Accuracy	86.9 %	78.35%	10%
CeR	13.91%	21.65%	-56%

Radom Forest

Measures	Train	Test	%Deviation
KS	99.75%	77.44%	22%
AUC	99.99%	95.84%	04%
Gini	74.85%	64.63%	14%
Accuracy	98.76%	94.20%	05%
CeR	01.24%	05.80%	-368%

Artificial Neural Network

Measures	Train	Test	%Deviation
KS	39.00%	35.45%	9.10%
AUC	73.00%	72.13%	1.19%
Gini	54.70%	54.62%	0.14%
Accuracy	88.66%	86.83%	2.06%
CeR	11.00%	13.16%	-19.6%

Interpretation:
> The CART method has given poor performance compared to Random Forest and ANN. Looking at the percentage deviation between Training and Testing Dataset, it looks like the Model is over fit.
> The Random Forest method has the best performance (best accuracy) among all the three models. The percentage deviation between Training and Testing Dataset also is reasonably under control, suggesting a robust model.
> Neural Network has given relatively secondary performance compared to Random Forest, however, better than CART. However, the percentage deviation between Training and Testing Data set is minimal among three models.

Random Forest seems to be the overall winner because of the best accuracy % and reasonable deviations.

Supervised Machine Learning - CART, Random Forest and Neural Network

Ranvir

Personal Loans Campaign

Objective

Libraries Required

Exploratory Data Analysis and Descriptive Statistics

Key observations:

Data Partition

Model Building - CART

Performance Measures on Training Data Set (Unbalanced)

Balancing the training data using ROSE Algorithm

Model Building - CART with Oversampled Dataset

CART - Conclusion

Model Building - Random Forest

Neural Network

Model Comparision