decision_tree

## Warning: package 'ggplot2' was built under R version 3.4.1

## Warning: package 'ggthemes' was built under R version 3.4.1

## Warning: package 'scales' was built under R version 3.4.1

## Warning: package 'dplyr' was built under R version 3.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: package 'mice' was built under R version 3.4.2

## Loading required package: lattice

## Warning: package 'randomForest' was built under R version 3.4.1

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Warning: package 'rpart' was built under R version 3.4.2

## Warning: package 'ROCR' was built under R version 3.4.1

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.4.1

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

## Warning: package 'corrr' was built under R version 3.4.1

## Warning: package 'corrplot' was built under R version 3.4.2

## corrplot 0.84 loaded

## Warning: package 'glue' was built under R version 3.4.2

## 
## Attaching package: 'glue'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## Warning: package 'caTools' was built under R version 3.4.1

## Warning: package 'data.table' was built under R version 3.4.2

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## Loading required package: knitr

## Warning: package 'knitr' was built under R version 3.4.2

## Loading required package: geosphere

## Warning: package 'geosphere' was built under R version 3.4.2

## Loading required package: gmapsdistance

## Warning: package 'gmapsdistance' was built under R version 3.4.2

## Loading required package: tidyr

## Warning: package 'tidyr' was built under R version 3.4.2

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:mice':
## 
##     complete

## Warning: package 'car' was built under R version 3.4.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## Warning: package 'caret' was built under R version 3.4.1

## Warning: package 'gclus' was built under R version 3.4.1

## Loading required package: cluster

## Warning: package 'cluster' was built under R version 3.4.2

## Warning: package 'visdat' was built under R version 3.4.1

## Warning: package 'psych' was built under R version 3.4.2

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

## The following object is masked from 'package:randomForest':
## 
##     outlier

## The following objects are masked from 'package:scales':
## 
##     alpha, rescale

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## Warning: package 'leaflet' was built under R version 3.4.1

## Warning: package 'leaflet.extras' was built under R version 3.4.1

## Warning: package 'PerformanceAnalytics' was built under R version 3.4.2

## Loading required package: xts

## Warning: package 'xts' was built under R version 3.4.1

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.4.1

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following object is masked from 'package:leaflet':
## 
##     addLegend

## The following objects are masked from 'package:data.table':
## 
##     first, last

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:gplots':
## 
##     textplot

## The following object is masked from 'package:graphics':
## 
##     legend

## Warning: package 'GPArotation' was built under R version 3.4.1

## Warning: package 'MVN' was built under R version 3.4.2

## sROC 0.1-2 loaded

## 
## Attaching package: 'MVN'

## The following object is masked from 'package:psych':
## 
##     mardia

## Warning: package 'MASS' was built under R version 3.4.1

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Warning: package 'psy' was built under R version 3.4.1

## 
## Attaching package: 'psy'

## The following object is masked from 'package:psych':
## 
##     wkappa

## Warning: package 'corpcor' was built under R version 3.4.1

## Warning: package 'fastmatch' was built under R version 3.4.1

## 
## Attaching package: 'fastmatch'

## The following object is masked from 'package:dplyr':
## 
##     coalesce

## Warning: package 'plyr' was built under R version 3.4.1

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## Warning: package 'ggcorrplot' was built under R version 3.4.2

## Warning: package 'rpart.plot' was built under R version 3.4.2

## Warning: package 'rattle' was built under R version 3.4.2

## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## 
## Attaching package: 'rattle'

## The following object is masked from 'package:randomForest':
## 
##     importance

## Warning: package 'RColorBrewer' was built under R version 3.4.1

## Warning: package 'maptree' was built under R version 3.4.2

Load the data into data frame and do basic visualisation with descriptive stats..We also try to understand data skewness

myloaddata <- read.csv('data.csv')
mybankdata <- read.csv('g.csv')

str(myloaddata)

## 'data.frame':    11548 obs. of  7 variables:
##  $ ID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ delinquent : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sdelinquent: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ term       : Factor w/ 2 levels "36 months","60 months": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender     : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age        : Factor w/ 2 levels ">25","20-25": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FICO       : Factor w/ 2 levels ">500","300-500": 2 2 2 2 2 2 2 2 2 2 ...

summary(myloaddata)

##        ID        delinquent  Sdelinquent            term      
##  Min.   :    1   No :3827   Min.   :0.0000   36 months:10589  
##  1st Qu.: 2888   Yes:7721   1st Qu.:0.0000   60 months:  959  
##  Median : 5774              Median :1.0000                    
##  Mean   : 5774              Mean   :0.6686                    
##  3rd Qu.: 8661              3rd Qu.:1.0000                    
##  Max.   :11548              Max.   :1.0000                    
##     gender        age            FICO     
##  Female:4993   >25  :5660   >500   :5178  
##  Male  :6555   20-25:5888   300-500:6370  
##                                           
##                                           
##                                           
##

str(mybankdata)  ## shows the columns of the data structure

## 'data.frame':    1000 obs. of  21 variables:
##  $ CHECK_A  : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ DURATION : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ C_HIST   : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ PURPOSE  : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ AMOUNT   : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ SAVE_A   : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ EMPLOY   : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ INSTALL_R: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ PERSONAL : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ GUARANTEE: Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ RESIDENCE: int  4 2 3 4 4 4 4 2 4 2 ...
##  $ PROPERTY : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ AGE      : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ INSTALL_P: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ HOUSING  : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ N_EXIST  : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ JOB      : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ N_PEOPLE : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ TEL      : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ FOREIGN  : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CREDIT   : int  0 1 0 0 1 0 0 0 0 1 ...

summary(mybankdata)  ## summary

##  CHECK_A      DURATION    C_HIST       PURPOSE        AMOUNT     
##  A11:274   Min.   : 4.0   A30: 40   A43    :280   Min.   :  250  
##  A12:269   1st Qu.:12.0   A31: 49   A40    :234   1st Qu.: 1366  
##  A13: 63   Median :18.0   A32:530   A42    :181   Median : 2320  
##  A14:394   Mean   :20.9   A33: 88   A41    :103   Mean   : 3271  
##            3rd Qu.:24.0   A34:293   A49    : 97   3rd Qu.: 3972  
##            Max.   :72.0             A46    : 50   Max.   :18424  
##                                     (Other): 55                  
##  SAVE_A    EMPLOY      INSTALL_R     PERSONAL  GUARANTEE    RESIDENCE    
##  A61:603   A71: 62   Min.   :1.000   A91: 50   A101:907   Min.   :1.000  
##  A62:103   A72:172   1st Qu.:2.000   A92:310   A102: 41   1st Qu.:2.000  
##  A63: 63   A73:339   Median :3.000   A93:548   A103: 52   Median :3.000  
##  A64: 48   A74:174   Mean   :2.973   A94: 92              Mean   :2.845  
##  A65:183   A75:253   3rd Qu.:4.000                        3rd Qu.:4.000  
##                      Max.   :4.000                        Max.   :4.000  
##                                                                          
##  PROPERTY        AGE        INSTALL_P  HOUSING       N_EXIST     
##  A121:282   Min.   :19.00   A141:139   A151:179   Min.   :1.000  
##  A122:232   1st Qu.:27.00   A142: 47   A152:713   1st Qu.:1.000  
##  A123:332   Median :33.00   A143:814   A153:108   Median :1.000  
##  A124:154   Mean   :35.55                         Mean   :1.407  
##             3rd Qu.:42.00                         3rd Qu.:2.000  
##             Max.   :75.00                         Max.   :4.000  
##                                                                  
##    JOB         N_PEOPLE       TEL      FOREIGN        CREDIT   
##  A171: 22   Min.   :1.000   A191:596   A201:963   Min.   :0.0  
##  A172:200   1st Qu.:1.000   A192:404   A202: 37   1st Qu.:0.0  
##  A173:630   Median :1.000                         Median :0.0  
##  A174:148   Mean   :1.155                         Mean   :0.3  
##             3rd Qu.:1.000                         3rd Qu.:1.0  
##             Max.   :2.000                         Max.   :1.0  
##

nrow(mybankdata)  # number of rows

## [1] 1000

ncol(mybankdata)

## [1] 21

table(mybankdata$CREDIT) ## shows the number of 0 and 1 from the given data set and this will help us to understand whether data is skewed or not

## 
##   0   1 
## 700 300

The data should be splitted into Train and Test data set to ensure we test the decision tree…We will do this through sampling and such smapling helps to remove biasness during split. The column which has indicator based on random variable should be removed from data frame after spliting to ensure that one does not impact the modelling

set.seed(123)
mybankdata$spl = sample.split(mybankdata$CREDIT,SplitRatio=0.8)
mybankdata_train <- subset(mybankdata, spl == "TRUE")
nrow(mybankdata_train)

## [1] 800

ncol(mybankdata_train)

## [1] 22

mybankdata_test <- subset(mybankdata, spl == "FALSE")
mybankdata_train <- mybankdata_train[,-22]  ## removing the boolean column which was set for sampling
mybankdata_test <- mybankdata_test[,-22]  ## removing the boolean column which was set for sampling

Decision Tree is now run and we plot the tree..Observations from the tree diagram

Root Node: Root Node has dominant 0 and so it is coloured with Blue and labelled as 0 Node 30% of Y value is 1 Node clour is brighter and means it has higher % of 0 100% means all data are present in this node Then using Gini measure of Index it has found that best question to ask is if CHECK_A value is “A13” or “A14”" and it splitted the root node.

mct <- rpart(mybankdata_train$CREDIT ~., method = "class",data=mybankdata_train, control = rpart.control(cp = 0.005))  ## It does the tree generation by using gini index..we are overwriting cp value to have big tree

rpart.plot(mct,shadow.col="gray",cex=0.6)

fancyRpartPlot(mct,cex=0.5)

Lets now see the tree details and its parameter..Sumamry will show all the tree parameters and output while the princp will show complexity parameter details

summary(mct)

## Call:
## rpart(formula = mybankdata_train$CREDIT ~ ., data = mybankdata_train, 
##     method = "class", control = rpart.control(cp = 0.005))
##   n= 800 
## 
##             CP nsplit rel error    xerror       xstd
## 1  0.077083333      0 1.0000000 1.0000000 0.05400617
## 2  0.033333333      2 0.8458333 0.8916667 0.05216743
## 3  0.029166667      3 0.8125000 0.8916667 0.05216743
## 4  0.025000000      4 0.7833333 0.8958333 0.05224454
## 5  0.020833333      5 0.7583333 0.8958333 0.05224454
## 6  0.018750000      8 0.6958333 0.8833333 0.05201162
## 7  0.016666667     10 0.6583333 0.8625000 0.05161266
## 8  0.012500000     11 0.6416667 0.8583333 0.05153124
## 9  0.010416667     12 0.6291667 0.8833333 0.05201162
## 10 0.008333333     14 0.6083333 0.8916667 0.05216743
## 11 0.006250000     24 0.5000000 0.8875000 0.05208979
## 12 0.005000000     26 0.4875000 0.9000000 0.05232112
## 
## Variable importance
##   CHECK_A   PURPOSE    AMOUNT  DURATION    C_HIST    SAVE_A    EMPLOY 
##        20        14        10         8         8         8         6 
##       AGE RESIDENCE INSTALL_P  PROPERTY   HOUSING       JOB  PERSONAL 
##         5         5         4         3         3         2         1 
## INSTALL_R 
##         1 
## 
## Node number 1: 800 observations,    complexity param=0.07708333
##   predicted class=0  expected loss=0.3  P(node) =1
##     class counts:   560   240
##    probabilities: 0.700 0.300 
##   left son=2 (382 obs) right son=3 (418 obs)
##   Primary splits:
##       CHECK_A  splits as  RRLL,        improve=41.81628, (0 missing)
##       DURATION < 15.5    to the left,  improve=15.33350, (0 missing)
##       C_HIST   splits as  RRLLL,       improve=14.23095, (0 missing)
##       SAVE_A   splits as  RRLLL,       improve=12.69991, (0 missing)
##       AMOUNT   < 3913.5  to the left,  improve=10.86037, (0 missing)
##   Surrogate splits:
##       SAVE_A   splits as  RRLLL,       agree=0.604, adj=0.170, (0 split)
##       PURPOSE  splits as  RLRRLRRRRR,  agree=0.575, adj=0.110, (0 split)
##       C_HIST   splits as  RRRRL,       agree=0.566, adj=0.092, (0 split)
##       DURATION < 15.5    to the left,  agree=0.551, adj=0.060, (0 split)
##       EMPLOY   splits as  RRLRL,       agree=0.549, adj=0.055, (0 split)
## 
## Node number 2: 382 observations,    complexity param=0.008333333
##   predicted class=0  expected loss=0.1308901  P(node) =0.4775
##     class counts:   332    50
##    probabilities: 0.869 0.131 
##   left son=4 (323 obs) right son=5 (59 obs)
##   Primary splits:
##       INSTALL_P splits as  RRL,         improve=4.234603, (0 missing)
##       PURPOSE   splits as  LLLLLLRLLR,  improve=3.313501, (0 missing)
##       C_HIST    splits as  RRLRL,       improve=3.290513, (0 missing)
##       EMPLOY    splits as  RRLLL,       improve=2.732240, (0 missing)
##       AMOUNT    < 7447    to the left,  improve=2.099077, (0 missing)
##   Surrogate splits:
##       C_HIST   splits as  LRLLL,       agree=0.856, adj=0.068, (0 split)
##       DURATION < 45      to the left,  agree=0.851, adj=0.034, (0 split)
##       PURPOSE  splits as  LLRLLLLLLL,  agree=0.848, adj=0.017, (0 split)
## 
## Node number 3: 418 observations,    complexity param=0.07708333
##   predicted class=0  expected loss=0.4545455  P(node) =0.5225
##     class counts:   228   190
##    probabilities: 0.545 0.455 
##   left son=6 (225 obs) right son=7 (193 obs)
##   Primary splits:
##       DURATION < 22.5    to the left,  improve=14.319360, (0 missing)
##       PROPERTY splits as  LLRR,        improve= 7.243790, (0 missing)
##       C_HIST   splits as  RRLLL,       improve= 6.714510, (0 missing)
##       SAVE_A   splits as  RRLLL,       improve= 6.521208, (0 missing)
##       AMOUNT   < 3998    to the left,  improve= 5.222644, (0 missing)
##   Surrogate splits:
##       AMOUNT   < 2665    to the left,  agree=0.742, adj=0.440, (0 split)
##       PROPERTY splits as  LLRR,        agree=0.639, adj=0.218, (0 split)
##       PURPOSE  splits as  LRRLLLLLLR,  agree=0.622, adj=0.181, (0 split)
##       C_HIST   splits as  RRLRL,       agree=0.605, adj=0.145, (0 split)
##       HOUSING  splits as  LLR,         agree=0.593, adj=0.119, (0 split)
## 
## Node number 4: 323 observations
##   predicted class=0  expected loss=0.09907121  P(node) =0.40375
##     class counts:   291    32
##    probabilities: 0.901 0.099 
## 
## Node number 5: 59 observations,    complexity param=0.008333333
##   predicted class=0  expected loss=0.3050847  P(node) =0.07375
##     class counts:    41    18
##    probabilities: 0.695 0.305 
##   left son=10 (37 obs) right son=11 (22 obs)
##   Primary splits:
##       PURPOSE  splits as  RLLLL--R-R,  improve=5.731937, (0 missing)
##       EMPLOY   splits as  RLRLL,       improve=3.538688, (0 missing)
##       AGE      < 44.5    to the right, improve=1.481488, (0 missing)
##       AMOUNT   < 1784.5  to the left,  improve=1.423926, (0 missing)
##       PROPERTY splits as  RLRR,        improve=1.186646, (0 missing)
##   Surrogate splits:
##       INSTALL_R < 1.5     to the right, agree=0.695, adj=0.182, (0 split)
##       DURATION  < 42      to the left,  agree=0.678, adj=0.136, (0 split)
##       HOUSING   splits as  LLR,         agree=0.678, adj=0.136, (0 split)
##       C_HIST    splits as  RLLLL,       agree=0.661, adj=0.091, (0 split)
##       PERSONAL  splits as  RLLL,        agree=0.661, adj=0.091, (0 split)
## 
## Node number 6: 225 observations,    complexity param=0.03333333
##   predicted class=0  expected loss=0.3333333  P(node) =0.28125
##     class counts:   150    75
##    probabilities: 0.667 0.333 
##   left son=12 (215 obs) right son=13 (10 obs)
##   Primary splits:
##       C_HIST   splits as  LRLLL,       improve=6.720930, (0 missing)
##       PURPOSE  splits as  LLLLLLLRLL,  improve=5.437553, (0 missing)
##       PROPERTY splits as  LLLR,        improve=5.147059, (0 missing)
##       AMOUNT   < 1281    to the right, improve=3.439803, (0 missing)
##       JOB      splits as  RLRR,        improve=3.123750, (0 missing)
## 
## Node number 7: 193 observations,    complexity param=0.02916667
##   predicted class=1  expected loss=0.4041451  P(node) =0.24125
##     class counts:    78   115
##    probabilities: 0.404 0.596 
##   left son=14 (81 obs) right son=15 (112 obs)
##   Primary splits:
##       PURPOSE   splits as  RLRLRLRL-R,  improve=5.398694, (0 missing)
##       SAVE_A    splits as  RLLLL,       improve=5.023956, (0 missing)
##       AMOUNT    < 1381.5  to the right, improve=3.342035, (0 missing)
##       INSTALL_R < 2.5     to the left,  improve=2.766270, (0 missing)
##       GUARANTEE splits as  LRL,         improve=2.289032, (0 missing)
##   Surrogate splits:
##       CHECK_A splits as  LR--,        agree=0.622, adj=0.099, (0 split)
##       AMOUNT  < 2691.5  to the right, agree=0.611, adj=0.074, (0 split)
##       C_HIST  splits as  RRRRL,       agree=0.606, adj=0.062, (0 split)
##       HOUSING splits as  RRL,         agree=0.606, adj=0.062, (0 split)
##       SAVE_A  splits as  RRLLR,       agree=0.601, adj=0.049, (0 split)
## 
## Node number 10: 37 observations
##   predicted class=0  expected loss=0.1351351  P(node) =0.04625
##     class counts:    32     5
##    probabilities: 0.865 0.135 
## 
## Node number 11: 22 observations,    complexity param=0.008333333
##   predicted class=1  expected loss=0.4090909  P(node) =0.0275
##     class counts:     9    13
##    probabilities: 0.409 0.591 
##   left son=22 (9 obs) right son=23 (13 obs)
##   Primary splits:
##       EMPLOY   splits as  RRRLL,       improve=2.0209790, (0 missing)
##       PERSONAL splits as  LLR-,        improve=1.1720780, (0 missing)
##       AMOUNT   < 3761.5  to the left,  improve=1.0637140, (0 missing)
##       C_HIST   splits as  RLLRR,       improve=0.8181818, (0 missing)
##       JOB      splits as  -LRL,        improve=0.6534577, (0 missing)
##   Surrogate splits:
##       AMOUNT    < 3195.5  to the left,  agree=0.773, adj=0.444, (0 split)
##       C_HIST    splits as  RRLRR,       agree=0.727, adj=0.333, (0 split)
##       AGE       < 45      to the right, agree=0.727, adj=0.333, (0 split)
##       DURATION  < 19.5    to the left,  agree=0.682, adj=0.222, (0 split)
##       RESIDENCE < 2.5     to the right, agree=0.682, adj=0.222, (0 split)
## 
## Node number 12: 215 observations,    complexity param=0.02083333
##   predicted class=0  expected loss=0.3069767  P(node) =0.26875
##     class counts:   149    66
##    probabilities: 0.693 0.307 
##   left son=24 (206 obs) right son=25 (9 obs)
##   Primary splits:
##       PURPOSE  splits as  LLLLLLLRLL,  improve=4.164075, (0 missing)
##       JOB      splits as  RLRR,        improve=3.447757, (0 missing)
##       PROPERTY splits as  LLLR,        improve=2.920543, (0 missing)
##       AMOUNT   < 1281    to the right, improve=2.831182, (0 missing)
##       C_HIST   splits as  R-RLL,       improve=2.450693, (0 missing)
## 
## Node number 13: 10 observations
##   predicted class=1  expected loss=0.1  P(node) =0.0125
##     class counts:     1     9
##    probabilities: 0.100 0.900 
## 
## Node number 14: 81 observations,    complexity param=0.025
##   predicted class=0  expected loss=0.4567901  P(node) =0.10125
##     class counts:    44    37
##    probabilities: 0.543 0.457 
##   left son=28 (67 obs) right son=29 (14 obs)
##   Primary splits:
##       AMOUNT    < 8015.5  to the left,  improve=2.244439, (0 missing)
##       DURATION  < 31.5    to the left,  improve=2.136331, (0 missing)
##       SAVE_A    splits as  RRRLL,       improve=1.582146, (0 missing)
##       PURPOSE   splits as  -L-R-R-R--,  improve=1.568151, (0 missing)
##       INSTALL_R < 2.5     to the left,  improve=1.351868, (0 missing)
##   Surrogate splits:
##       N_EXIST < 2.5     to the left,  agree=0.852, adj=0.143, (0 split)
## 
## Node number 15: 112 observations,    complexity param=0.02083333
##   predicted class=1  expected loss=0.3035714  P(node) =0.14
##     class counts:    34    78
##    probabilities: 0.304 0.696 
##   left son=30 (41 obs) right son=31 (71 obs)
##   Primary splits:
##       SAVE_A    splits as  RLLRL,       improve=7.023237, (0 missing)
##       CHECK_A   splits as  RL--,        improve=3.296115, (0 missing)
##       RESIDENCE < 1.5     to the left,  improve=2.044449, (0 missing)
##       EMPLOY    splits as  LRRLL,       improve=2.009774, (0 missing)
##       AMOUNT    < 1381.5  to the right, improve=1.724490, (0 missing)
##   Surrogate splits:
##       DURATION < 57      to the right, agree=0.652, adj=0.049, (0 split)
##       C_HIST   splits as  RLRLR,       agree=0.652, adj=0.049, (0 split)
##       AMOUNT   < 1464    to the left,  agree=0.643, adj=0.024, (0 split)
##       HOUSING  splits as  RRL,         agree=0.643, adj=0.024, (0 split)
##       JOB      splits as  LRRR,        agree=0.643, adj=0.024, (0 split)
## 
## Node number 22: 9 observations
##   predicted class=0  expected loss=0.3333333  P(node) =0.01125
##     class counts:     6     3
##    probabilities: 0.667 0.333 
## 
## Node number 23: 13 observations
##   predicted class=1  expected loss=0.2307692  P(node) =0.01625
##     class counts:     3    10
##    probabilities: 0.231 0.769 
## 
## Node number 24: 206 observations,    complexity param=0.008333333
##   predicted class=0  expected loss=0.2864078  P(node) =0.2575
##     class counts:   147    59
##    probabilities: 0.714 0.286 
##   left son=48 (56 obs) right son=49 (150 obs)
##   Primary splits:
##       JOB      splits as  RLRR,        improve=3.169598, (0 missing)
##       DURATION < 11.5    to the left,  improve=2.963252, (0 missing)
##       C_HIST   splits as  R-RLL,       improve=2.690141, (0 missing)
##       PROPERTY splits as  LLLR,        improve=2.644673, (0 missing)
##       AMOUNT   < 632     to the left,  improve=2.276422, (0 missing)
##   Surrogate splits:
##       AMOUNT  < 742.5   to the left,  agree=0.743, adj=0.054, (0 split)
##       FOREIGN splits as  RL,          agree=0.738, adj=0.036, (0 split)
##       PURPOSE splits as  RRRRRRR-LR,  agree=0.733, adj=0.018, (0 split)
## 
## Node number 25: 9 observations
##   predicted class=1  expected loss=0.2222222  P(node) =0.01125
##     class counts:     2     7
##    probabilities: 0.222 0.778 
## 
## Node number 28: 67 observations,    complexity param=0.01875
##   predicted class=0  expected loss=0.4029851  P(node) =0.08375
##     class counts:    40    27
##    probabilities: 0.597 0.403 
##   left son=56 (20 obs) right son=57 (47 obs)
##   Primary splits:
##       PURPOSE   splits as  -L-R-R-R--,  improve=3.649444, (0 missing)
##       DURATION  < 46.5    to the left,  improve=2.187959, (0 missing)
##       RESIDENCE < 2.5     to the left,  improve=1.750434, (0 missing)
##       INSTALL_R < 2.5     to the left,  improve=1.520232, (0 missing)
##       C_HIST    splits as  RLLRL,       improve=1.445703, (0 missing)
##   Surrogate splits:
##       AGE    < 57.5    to the right, agree=0.746, adj=0.15, (0 split)
##       JOB    splits as  RRRL,        agree=0.731, adj=0.10, (0 split)
##       AMOUNT < 5432    to the right, agree=0.716, adj=0.05, (0 split)
## 
## Node number 29: 14 observations
##   predicted class=1  expected loss=0.2857143  P(node) =0.0175
##     class counts:     4    10
##    probabilities: 0.286 0.714 
## 
## Node number 30: 41 observations,    complexity param=0.02083333
##   predicted class=0  expected loss=0.4634146  P(node) =0.05125
##     class counts:    22    19
##    probabilities: 0.537 0.463 
##   left son=60 (34 obs) right son=61 (7 obs)
##   Primary splits:
##       AMOUNT    < 1381.5  to the right, improve=4.860832, (0 missing)
##       EMPLOY    splits as  LRRLR,       improve=3.282552, (0 missing)
##       INSTALL_R < 2.5     to the left,  improve=2.638921, (0 missing)
##       RESIDENCE < 3.5     to the left,  improve=1.954346, (0 missing)
##       C_HIST    splits as  RRRLR,       improve=1.835405, (0 missing)
##   Surrogate splits:
##       CHECK_A splits as  RL--, agree=0.927, adj=0.571, (0 split)
## 
## Node number 31: 71 observations
##   predicted class=1  expected loss=0.1690141  P(node) =0.08875
##     class counts:    12    59
##    probabilities: 0.169 0.831 
## 
## Node number 48: 56 observations
##   predicted class=0  expected loss=0.1428571  P(node) =0.07
##     class counts:    48     8
##    probabilities: 0.857 0.143 
## 
## Node number 49: 150 observations,    complexity param=0.008333333
##   predicted class=0  expected loss=0.34  P(node) =0.1875
##     class counts:    99    51
##    probabilities: 0.660 0.340 
##   left son=98 (25 obs) right son=99 (125 obs)
##   Primary splits:
##       PURPOSE splits as  RLLRRLR-LL,  improve=4.056000, (0 missing)
##       SAVE_A  splits as  RLLLR,       improve=3.881538, (0 missing)
##       AMOUNT  < 1373    to the right, improve=1.976863, (0 missing)
##       HOUSING splits as  RLR,         improve=1.976863, (0 missing)
##       EMPLOY  splits as  RRRLR,       improve=1.898073, (0 missing)
## 
## Node number 56: 20 observations
##   predicted class=0  expected loss=0.15  P(node) =0.025
##     class counts:    17     3
##    probabilities: 0.850 0.150 
## 
## Node number 57: 47 observations,    complexity param=0.01875
##   predicted class=1  expected loss=0.4893617  P(node) =0.05875
##     class counts:    23    24
##    probabilities: 0.489 0.511 
##   left son=114 (20 obs) right son=115 (27 obs)
##   Primary splits:
##       RESIDENCE < 2.5     to the left,  improve=3.089362, (0 missing)
##       EMPLOY    splits as  RLLLR,       improve=1.846505, (0 missing)
##       PROPERTY  splits as  LLRL,        improve=1.846505, (0 missing)
##       SAVE_A    splits as  RRRLL,       improve=1.309875, (0 missing)
##       HOUSING   splits as  RLL,         improve=1.104746, (0 missing)
##   Surrogate splits:
##       AMOUNT    < 2382.5  to the left,  agree=0.681, adj=0.25, (0 split)
##       AGE       < 31.5    to the right, agree=0.681, adj=0.25, (0 split)
##       PERSONAL  splits as  LRRL,        agree=0.660, adj=0.20, (0 split)
##       PROPERTY  splits as  RLRR,        agree=0.660, adj=0.20, (0 split)
##       GUARANTEE splits as  RRL,         agree=0.617, adj=0.10, (0 split)
## 
## Node number 60: 34 observations,    complexity param=0.01041667
##   predicted class=0  expected loss=0.3529412  P(node) =0.0425
##     class counts:    22    12
##    probabilities: 0.647 0.353 
##   left son=120 (14 obs) right son=121 (20 obs)
##   Primary splits:
##       EMPLOY    splits as  LRRLR,       improve=2.100840, (0 missing)
##       RESIDENCE < 3.5     to the left,  improve=1.968806, (0 missing)
##       SAVE_A    splits as  -RR-L,       improve=1.812745, (0 missing)
##       AGE       < 25      to the right, improve=1.548643, (0 missing)
##       DURATION  < 25.5    to the left,  improve=1.431634, (0 missing)
##   Surrogate splits:
##       N_PEOPLE < 1.5     to the right, agree=0.676, adj=0.214, (0 split)
##       CHECK_A  splits as  LR--,        agree=0.647, adj=0.143, (0 split)
##       DURATION < 46.5    to the right, agree=0.647, adj=0.143, (0 split)
##       C_HIST   splits as  RLRRR,       agree=0.647, adj=0.143, (0 split)
##       PURPOSE  splits as  R-L-R----L,  agree=0.647, adj=0.143, (0 split)
## 
## Node number 61: 7 observations
##   predicted class=1  expected loss=0  P(node) =0.00875
##     class counts:     0     7
##    probabilities: 0.000 1.000 
## 
## Node number 98: 25 observations
##   predicted class=0  expected loss=0.08  P(node) =0.03125
##     class counts:    23     2
##    probabilities: 0.920 0.080 
## 
## Node number 99: 125 observations,    complexity param=0.008333333
##   predicted class=0  expected loss=0.392  P(node) =0.15625
##     class counts:    76    49
##    probabilities: 0.608 0.392 
##   left son=198 (111 obs) right son=199 (14 obs)
##   Primary splits:
##       INSTALL_P splits as  RLL,         improve=3.275120, (0 missing)
##       SAVE_A    splits as  RRLLR,       improve=2.980552, (0 missing)
##       EMPLOY    splits as  RRRLR,       improve=1.828009, (0 missing)
##       DURATION  < 15.5    to the left,  improve=1.762862, (0 missing)
##       RESIDENCE < 3.5     to the right, improve=1.650773, (0 missing)
## 
## Node number 114: 20 observations,    complexity param=0.01666667
##   predicted class=0  expected loss=0.3  P(node) =0.025
##     class counts:    14     6
##    probabilities: 0.700 0.300 
##   left son=228 (12 obs) right son=229 (8 obs)
##   Primary splits:
##       EMPLOY    splits as  RLLRR,       improve=5.400000, (0 missing)
##       AGE       < 32.5    to the left,  improve=2.137374, (0 missing)
##       AMOUNT    < 3471    to the right, improve=1.600000, (0 missing)
##       PROPERTY  splits as  LRLL,        improve=1.167677, (0 missing)
##       INSTALL_R < 3.5     to the left,  improve=0.356044, (0 missing)
##   Surrogate splits:
##       AGE      < 31      to the left,  agree=0.75, adj=0.375, (0 split)
##       JOB      splits as  LLLR,        agree=0.75, adj=0.375, (0 split)
##       PURPOSE  splits as  ---L-L-R--,  agree=0.70, adj=0.250, (0 split)
##       AMOUNT   < 2303    to the right, agree=0.70, adj=0.250, (0 split)
##       DURATION < 33      to the left,  agree=0.65, adj=0.125, (0 split)
## 
## Node number 115: 27 observations,    complexity param=0.0125
##   predicted class=1  expected loss=0.3333333  P(node) =0.03375
##     class counts:     9    18
##    probabilities: 0.333 0.667 
##   left son=230 (11 obs) right son=231 (16 obs)
##   Primary splits:
##       PROPERTY splits as  LLRR,        improve=3.409091, (0 missing)
##       C_HIST   splits as  RLLRR,       improve=1.729412, (0 missing)
##       DURATION < 27      to the left,  improve=1.200000, (0 missing)
##       TEL      splits as  RL,          improve=1.200000, (0 missing)
##       SAVE_A   splits as  RLLLR,       improve=1.071429, (0 missing)
##   Surrogate splits:
##       C_HIST    splits as  RRLRR,       agree=0.741, adj=0.364, (0 split)
##       INSTALL_R < 3.5     to the right, agree=0.704, adj=0.273, (0 split)
##       INSTALL_P splits as  RLR,         agree=0.704, adj=0.273, (0 split)
##       AMOUNT    < 2614.5  to the left,  agree=0.667, adj=0.182, (0 split)
##       SAVE_A    splits as  RRRLL,       agree=0.667, adj=0.182, (0 split)
## 
## Node number 120: 14 observations
##   predicted class=0  expected loss=0.1428571  P(node) =0.0175
##     class counts:    12     2
##    probabilities: 0.857 0.143 
## 
## Node number 121: 20 observations,    complexity param=0.01041667
##   predicted class=0  expected loss=0.5  P(node) =0.025
##     class counts:    10    10
##    probabilities: 0.500 0.500 
##   left son=242 (13 obs) right son=243 (7 obs)
##   Primary splits:
##       RESIDENCE < 3.5     to the left,  improve=2.7472530, (0 missing)
##       C_HIST    splits as  LRRLL,       improve=1.6666670, (0 missing)
##       SAVE_A    splits as  -RL-L,       improve=1.6666670, (0 missing)
##       PURPOSE   splits as  L-R-R----L,  improve=0.9090909, (0 missing)
##       PROPERTY  splits as  LRLR,        improve=0.9090909, (0 missing)
##   Surrogate splits:
##       AGE       < 41.5    to the left,  agree=0.80, adj=0.429, (0 split)
##       HOUSING   splits as  RLR,         agree=0.80, adj=0.429, (0 split)
##       AMOUNT    < 12296.5 to the left,  agree=0.75, adj=0.286, (0 split)
##       PERSONAL  splits as  LRLL,        agree=0.75, adj=0.286, (0 split)
##       GUARANTEE splits as  LRR,         agree=0.75, adj=0.286, (0 split)
## 
## Node number 198: 111 observations,    complexity param=0.008333333
##   predicted class=0  expected loss=0.3513514  P(node) =0.13875
##     class counts:    72    39
##    probabilities: 0.649 0.351 
##   left son=396 (12 obs) right son=397 (99 obs)
##   Primary splits:
##       SAVE_A    splits as  RLLLR,       improve=3.321867, (0 missing)
##       RESIDENCE < 3.5     to the right, improve=3.142084, (0 missing)
##       GUARANTEE splits as  RRL,         improve=2.418124, (0 missing)
##       C_HIST    splits as  R-RRL,       improve=2.341963, (0 missing)
##       AMOUNT    < 1373    to the right, improve=1.459979, (0 missing)
## 
## Node number 199: 14 observations
##   predicted class=1  expected loss=0.2857143  P(node) =0.0175
##     class counts:     4    10
##    probabilities: 0.286 0.714 
## 
## Node number 228: 12 observations
##   predicted class=0  expected loss=0  P(node) =0.015
##     class counts:    12     0
##    probabilities: 1.000 0.000 
## 
## Node number 229: 8 observations
##   predicted class=1  expected loss=0.25  P(node) =0.01
##     class counts:     2     6
##    probabilities: 0.250 0.750 
## 
## Node number 230: 11 observations
##   predicted class=0  expected loss=0.3636364  P(node) =0.01375
##     class counts:     7     4
##    probabilities: 0.636 0.364 
## 
## Node number 231: 16 observations
##   predicted class=1  expected loss=0.125  P(node) =0.02
##     class counts:     2    14
##    probabilities: 0.125 0.875 
## 
## Node number 242: 13 observations
##   predicted class=0  expected loss=0.3076923  P(node) =0.01625
##     class counts:     9     4
##    probabilities: 0.692 0.308 
## 
## Node number 243: 7 observations
##   predicted class=1  expected loss=0.1428571  P(node) =0.00875
##     class counts:     1     6
##    probabilities: 0.143 0.857 
## 
## Node number 396: 12 observations
##   predicted class=0  expected loss=0  P(node) =0.015
##     class counts:    12     0
##    probabilities: 1.000 0.000 
## 
## Node number 397: 99 observations,    complexity param=0.008333333
##   predicted class=0  expected loss=0.3939394  P(node) =0.12375
##     class counts:    60    39
##    probabilities: 0.606 0.394 
##   left son=794 (40 obs) right son=795 (59 obs)
##   Primary splits:
##       RESIDENCE < 3.5     to the right, improve=3.831202, (0 missing)
##       C_HIST    splits as  R-RLL,       improve=2.344156, (0 missing)
##       GUARANTEE splits as  RRL,         improve=2.337945, (0 missing)
##       AMOUNT    < 1373    to the right, improve=1.900782, (0 missing)
##       EMPLOY    splits as  RRRLR,       improve=1.329870, (0 missing)
##   Surrogate splits:
##       AGE      < 38      to the right, agree=0.707, adj=0.275, (0 split)
##       EMPLOY   splits as  LRRRL,       agree=0.677, adj=0.200, (0 split)
##       HOUSING  splits as  LRL,         agree=0.677, adj=0.200, (0 split)
##       C_HIST   splits as  R-RLL,       agree=0.657, adj=0.150, (0 split)
##       PROPERTY splits as  RRRL,        agree=0.646, adj=0.125, (0 split)
## 
## Node number 794: 40 observations,    complexity param=0.00625
##   predicted class=0  expected loss=0.225  P(node) =0.05
##     class counts:    31     9
##    probabilities: 0.775 0.225 
##   left son=1588 (19 obs) right son=1589 (21 obs)
##   Primary splits:
##       HOUSING splits as  RLR,         improve=2.150501, (0 missing)
##       AGE     < 41.5    to the left,  improve=1.470000, (0 missing)
##       AMOUNT  < 1179.5  to the left,  improve=1.175806, (0 missing)
##       JOB     splits as  R-LR,        improve=1.118459, (0 missing)
##       CHECK_A splits as  RL--,        improve=1.015934, (0 missing)
##   Surrogate splits:
##       CHECK_A splits as  RL--,        agree=0.725, adj=0.421, (0 split)
##       EMPLOY  splits as  RRRLL,       agree=0.675, adj=0.316, (0 split)
##       AGE     < 36.5    to the right, agree=0.675, adj=0.316, (0 split)
##       C_HIST  splits as  R-RLL,       agree=0.650, adj=0.263, (0 split)
##       AMOUNT  < 1979.5  to the left,  agree=0.650, adj=0.263, (0 split)
## 
## Node number 795: 59 observations,    complexity param=0.008333333
##   predicted class=1  expected loss=0.4915254  P(node) =0.07375
##     class counts:    29    30
##    probabilities: 0.492 0.508 
##   left son=1590 (38 obs) right son=1591 (21 obs)
##   Primary splits:
##       AMOUNT   < 1373    to the right, improve=2.762202, (0 missing)
##       C_HIST   splits as  R-RRL,       improve=2.571793, (0 missing)
##       EMPLOY   splits as  RLRLR,       improve=1.912702, (0 missing)
##       DURATION < 11.5    to the left,  improve=1.686293, (0 missing)
##       PROPERTY splits as  LLRR,        improve=1.556743, (0 missing)
##   Surrogate splits:
##       FOREIGN   splits as  LR,         agree=0.695, adj=0.143, (0 split)
##       PURPOSE   splits as  L--LL-R---, agree=0.678, adj=0.095, (0 split)
##       PERSONAL  splits as  RLLL,       agree=0.661, adj=0.048, (0 split)
##       INSTALL_P splits as  -RL,        agree=0.661, adj=0.048, (0 split)
## 
## Node number 1588: 19 observations
##   predicted class=0  expected loss=0.05263158  P(node) =0.02375
##     class counts:    18     1
##    probabilities: 0.947 0.053 
## 
## Node number 1589: 21 observations,    complexity param=0.00625
##   predicted class=0  expected loss=0.3809524  P(node) =0.02625
##     class counts:    13     8
##    probabilities: 0.619 0.381 
##   left son=3178 (14 obs) right son=3179 (7 obs)
##   Primary splits:
##       AGE     < 38      to the left,  improve=2.3333330, (0 missing)
##       C_HIST  splits as  R-LLR,       improve=1.5393770, (0 missing)
##       EMPLOY  splits as  LRRLR,       improve=1.1904760, (0 missing)
##       AMOUNT  < 3057.5  to the left,  improve=0.7619048, (0 missing)
##       N_EXIST < 1.5     to the left,  improve=0.7619048, (0 missing)
##   Surrogate splits:
##       PURPOSE  splits as  L--LL-R---,  agree=0.714, adj=0.143, (0 split)
##       AMOUNT   < 2544.5  to the left,  agree=0.714, adj=0.143, (0 split)
##       PERSONAL splits as  RLLL,        agree=0.714, adj=0.143, (0 split)
##       PROPERTY splits as  LLLR,        agree=0.714, adj=0.143, (0 split)
##       HOUSING  splits as  L-R,         agree=0.714, adj=0.143, (0 split)
## 
## Node number 1590: 38 observations,    complexity param=0.008333333
##   predicted class=0  expected loss=0.3947368  P(node) =0.0475
##     class counts:    23    15
##    probabilities: 0.605 0.395 
##   left son=3180 (20 obs) right son=3181 (18 obs)
##   Primary splits:
##       AGE    < 29.5    to the right, improve=3.202339, (0 missing)
##       TEL    splits as  LR,          improve=2.088330, (0 missing)
##       AMOUNT < 4064.5  to the left,  improve=1.752365, (0 missing)
##       EMPLOY splits as  RLRLR,       improve=1.564057, (0 missing)
##       C_HIST splits as  R-RRL,       improve=1.474561, (0 missing)
##   Surrogate splits:
##       PERSONAL  splits as  LRLL,        agree=0.684, adj=0.333, (0 split)
##       C_HIST    splits as  R-RLL,       agree=0.658, adj=0.278, (0 split)
##       PURPOSE   splits as  L--LR-----,  agree=0.658, adj=0.278, (0 split)
##       RESIDENCE < 2.5     to the right, agree=0.658, adj=0.278, (0 split)
##       AMOUNT    < 1541.5  to the right, agree=0.632, adj=0.222, (0 split)
## 
## Node number 1591: 21 observations
##   predicted class=1  expected loss=0.2857143  P(node) =0.02625
##     class counts:     6    15
##    probabilities: 0.286 0.714 
## 
## Node number 3178: 14 observations
##   predicted class=0  expected loss=0.2142857  P(node) =0.0175
##     class counts:    11     3
##    probabilities: 0.786 0.214 
## 
## Node number 3179: 7 observations
##   predicted class=1  expected loss=0.2857143  P(node) =0.00875
##     class counts:     2     5
##    probabilities: 0.286 0.714 
## 
## Node number 3180: 20 observations
##   predicted class=0  expected loss=0.2  P(node) =0.025
##     class counts:    16     4
##    probabilities: 0.800 0.200 
## 
## Node number 3181: 18 observations
##   predicted class=1  expected loss=0.3888889  P(node) =0.0225
##     class counts:     7    11
##    probabilities: 0.389 0.611

Lets now find out complexity parameter and XError value.. We can see that complexity parameter is coming down as tree increases..Carefully looking at xerror shows that after xerror value of 0.8833333 it is again increasing…The no of tree for the value of xerror 0.90000 is 12. So we will take no of tree as 10.

printcp(mct) ## default value of complexity parameter is 0.01 and after that rpart package will not split unless we override the cp parameter

## 
## Classification tree:
## rpart(formula = mybankdata_train$CREDIT ~ ., data = mybankdata_train, 
##     method = "class", control = rpart.control(cp = 0.005))
## 
## Variables actually used in tree construction:
##  [1] AGE       AMOUNT    C_HIST    CHECK_A   DURATION  EMPLOY    HOUSING  
##  [8] INSTALL_P JOB       PROPERTY  PURPOSE   RESIDENCE SAVE_A   
## 
## Root node error: 240/800 = 0.3
## 
## n= 800 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0770833      0   1.00000 1.00000 0.054006
## 2  0.0333333      2   0.84583 0.89167 0.052167
## 3  0.0291667      3   0.81250 0.89167 0.052167
## 4  0.0250000      4   0.78333 0.89583 0.052245
## 5  0.0208333      5   0.75833 0.89583 0.052245
## 6  0.0187500      8   0.69583 0.88333 0.052012
## 7  0.0166667     10   0.65833 0.86250 0.051613
## 8  0.0125000     11   0.64167 0.85833 0.051531
## 9  0.0104167     12   0.62917 0.88333 0.052012
## 10 0.0083333     14   0.60833 0.89167 0.052167
## 11 0.0062500     24   0.50000 0.88750 0.052090
## 12 0.0050000     26   0.48750 0.90000 0.052321

We will run the tree generation again with no of tree value as 11..CP Value is 0.012500 and so we are pruning the tree

mct_final <- prune(mct,cp = 0.0104167)
rpart.plot(mct_final,shadow.col="gray",cex=0.7)

printcp(mct_final)

## 
## Classification tree:
## rpart(formula = mybankdata_train$CREDIT ~ ., data = mybankdata_train, 
##     method = "class", control = rpart.control(cp = 0.005))
## 
## Variables actually used in tree construction:
## [1] AMOUNT    C_HIST    CHECK_A   DURATION  EMPLOY    PROPERTY  PURPOSE  
## [8] RESIDENCE SAVE_A   
## 
## Root node error: 240/800 = 0.3
## 
## n= 800 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.077083      0   1.00000 1.00000 0.054006
## 2 0.033333      2   0.84583 0.89167 0.052167
## 3 0.029167      3   0.81250 0.89167 0.052167
## 4 0.025000      4   0.78333 0.89583 0.052245
## 5 0.020833      5   0.75833 0.89583 0.052245
## 6 0.018750      8   0.69583 0.88333 0.052012
## 7 0.016667     10   0.65833 0.86250 0.051613
## 8 0.012500     11   0.64167 0.85833 0.051531
## 9 0.010417     12   0.62917 0.88333 0.052012

plotcp(mct_final)

We should now try the pruning of tree to avoid overfitting of the model. This is quite common with Decision tree and we will need to be careful with this. Need to decide optimum number of tree so that model does not overfit.. First we will predict to see the probability output with test data…Each row here has probability of 0 and probability of 1. Predicted value is populated into test data frame column Compare the predicted value and teh available credit column from test data by creating the confusion matrix.. Our accuracy ratio here is 69%

mybankdata_predict <- predict(mct_final,mybankdata_test)

View(mybankdata_predict)

mybankdata_test$CREDIT_PREDICTED <- predict(mct,mybankdata_test, type = 'class')
xlab <- table(actualclass=mybankdata_test$CREDIT,predictedclass=mybankdata_test$CREDIT_PREDICTED)  ## generting the table format
confusionMatrix(xlab) # generating the confusion matrix

## Confusion Matrix and Statistics
## 
##            predictedclass
## actualclass   0   1
##           0 103  37
##           1  33  27
##                                           
##                Accuracy : 0.65            
##                  95% CI : (0.5795, 0.7159)
##     No Information Rate : 0.68            
##     P-Value [Acc > NIR] : 0.8379          
##                                           
##                   Kappa : 0.1822          
##  Mcnemar's Test P-Value : 0.7199          
##                                           
##             Sensitivity : 0.7574          
##             Specificity : 0.4219          
##          Pos Pred Value : 0.7357          
##          Neg Pred Value : 0.4500          
##              Prevalence : 0.6800          
##          Detection Rate : 0.5150          
##    Detection Prevalence : 0.7000          
##       Balanced Accuracy : 0.5896          
##                                           
##        'Positive' Class : 0               
##

decision_tree_demo

Amit Kayal

November 29, 2017