## Warning: package 'ggplot2' was built under R version 3.4.1
## Warning: package 'ggthemes' was built under R version 3.4.1
## Warning: package 'scales' was built under R version 3.4.1
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'mice' was built under R version 3.4.2
## Loading required package: lattice
## Warning: package 'randomForest' was built under R version 3.4.1
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'rpart' was built under R version 3.4.2
## Warning: package 'ROCR' was built under R version 3.4.1
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.4.1
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## Warning: package 'corrr' was built under R version 3.4.1
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
## Warning: package 'glue' was built under R version 3.4.2
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
## Warning: package 'caTools' was built under R version 3.4.1
## Warning: package 'data.table' was built under R version 3.4.2
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 3.4.2
## Loading required package: geosphere
## Warning: package 'geosphere' was built under R version 3.4.2
## Loading required package: gmapsdistance
## Warning: package 'gmapsdistance' was built under R version 3.4.2
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 3.4.2
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:mice':
##
## complete
## Warning: package 'car' was built under R version 3.4.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Warning: package 'caret' was built under R version 3.4.1
## Warning: package 'gclus' was built under R version 3.4.1
## Loading required package: cluster
## Warning: package 'cluster' was built under R version 3.4.2
## Warning: package 'visdat' was built under R version 3.4.1
## Warning: package 'psych' was built under R version 3.4.2
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following object is masked from 'package:randomForest':
##
## outlier
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Warning: package 'leaflet' was built under R version 3.4.1
## Warning: package 'leaflet.extras' was built under R version 3.4.1
## Warning: package 'PerformanceAnalytics' was built under R version 3.4.2
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.1
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.1
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following object is masked from 'package:leaflet':
##
## addLegend
## The following objects are masked from 'package:data.table':
##
## first, last
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:gplots':
##
## textplot
## The following object is masked from 'package:graphics':
##
## legend
## Warning: package 'GPArotation' was built under R version 3.4.1
## Warning: package 'MVN' was built under R version 3.4.2
## sROC 0.1-2 loaded
##
## Attaching package: 'MVN'
## The following object is masked from 'package:psych':
##
## mardia
## Warning: package 'MASS' was built under R version 3.4.1
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Warning: package 'psy' was built under R version 3.4.1
##
## Attaching package: 'psy'
## The following object is masked from 'package:psych':
##
## wkappa
## Warning: package 'corpcor' was built under R version 3.4.1
## Warning: package 'fastmatch' was built under R version 3.4.1
##
## Attaching package: 'fastmatch'
## The following object is masked from 'package:dplyr':
##
## coalesce
## Warning: package 'plyr' was built under R version 3.4.1
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Warning: package 'ggcorrplot' was built under R version 3.4.2
## Warning: package 'rpart.plot' was built under R version 3.4.2
## Warning: package 'rattle' was built under R version 3.4.2
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
## Warning: package 'RColorBrewer' was built under R version 3.4.1
## Warning: package 'maptree' was built under R version 3.4.2
Load the data into data frame and do basic visualisation with descriptive stats..We also try to understand data skewness
myloaddata <- read.csv('data.csv')
mybankdata <- read.csv('g.csv')
str(myloaddata)
## 'data.frame': 11548 obs. of 7 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ delinquent : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Sdelinquent: int 0 0 0 0 0 0 0 0 0 0 ...
## $ term : Factor w/ 2 levels "36 months","60 months": 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : Factor w/ 2 levels ">25","20-25": 2 2 2 2 2 2 2 2 2 2 ...
## $ FICO : Factor w/ 2 levels ">500","300-500": 2 2 2 2 2 2 2 2 2 2 ...
summary(myloaddata)
## ID delinquent Sdelinquent term
## Min. : 1 No :3827 Min. :0.0000 36 months:10589
## 1st Qu.: 2888 Yes:7721 1st Qu.:0.0000 60 months: 959
## Median : 5774 Median :1.0000
## Mean : 5774 Mean :0.6686
## 3rd Qu.: 8661 3rd Qu.:1.0000
## Max. :11548 Max. :1.0000
## gender age FICO
## Female:4993 >25 :5660 >500 :5178
## Male :6555 20-25:5888 300-500:6370
##
##
##
##
str(mybankdata) ## shows the columns of the data structure
## 'data.frame': 1000 obs. of 21 variables:
## $ CHECK_A : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ DURATION : int 6 48 12 42 24 36 24 36 12 30 ...
## $ C_HIST : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ PURPOSE : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ AMOUNT : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ SAVE_A : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ EMPLOY : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ INSTALL_R: int 4 2 2 2 3 2 3 2 2 4 ...
## $ PERSONAL : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ GUARANTEE: Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ RESIDENCE: int 4 2 3 4 4 4 4 2 4 2 ...
## $ PROPERTY : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ AGE : int 67 22 49 45 53 35 53 35 61 28 ...
## $ INSTALL_P: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ HOUSING : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ N_EXIST : int 2 1 1 1 2 1 1 1 1 2 ...
## $ JOB : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ N_PEOPLE : int 1 1 2 2 2 2 1 1 1 1 ...
## $ TEL : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ FOREIGN : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ CREDIT : int 0 1 0 0 1 0 0 0 0 1 ...
summary(mybankdata) ## summary
## CHECK_A DURATION C_HIST PURPOSE AMOUNT
## A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250
## A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366
## A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320
## A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271
## 3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972
## Max. :72.0 A46 : 50 Max. :18424
## (Other): 55
## SAVE_A EMPLOY INSTALL_R PERSONAL GUARANTEE RESIDENCE
## A61:603 A71: 62 Min. :1.000 A91: 50 A101:907 Min. :1.000
## A62:103 A72:172 1st Qu.:2.000 A92:310 A102: 41 1st Qu.:2.000
## A63: 63 A73:339 Median :3.000 A93:548 A103: 52 Median :3.000
## A64: 48 A74:174 Mean :2.973 A94: 92 Mean :2.845
## A65:183 A75:253 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
##
## PROPERTY AGE INSTALL_P HOUSING N_EXIST
## A121:282 Min. :19.00 A141:139 A151:179 Min. :1.000
## A122:232 1st Qu.:27.00 A142: 47 A152:713 1st Qu.:1.000
## A123:332 Median :33.00 A143:814 A153:108 Median :1.000
## A124:154 Mean :35.55 Mean :1.407
## 3rd Qu.:42.00 3rd Qu.:2.000
## Max. :75.00 Max. :4.000
##
## JOB N_PEOPLE TEL FOREIGN CREDIT
## A171: 22 Min. :1.000 A191:596 A201:963 Min. :0.0
## A172:200 1st Qu.:1.000 A192:404 A202: 37 1st Qu.:0.0
## A173:630 Median :1.000 Median :0.0
## A174:148 Mean :1.155 Mean :0.3
## 3rd Qu.:1.000 3rd Qu.:1.0
## Max. :2.000 Max. :1.0
##
nrow(mybankdata) # number of rows
## [1] 1000
ncol(mybankdata)
## [1] 21
table(mybankdata$CREDIT) ## shows the number of 0 and 1 from the given data set and this will help us to understand whether data is skewed or not
##
## 0 1
## 700 300
The data should be splitted into Train and Test data set to ensure we test the decision tree…We will do this through sampling and such smapling helps to remove biasness during split. The column which has indicator based on random variable should be removed from data frame after spliting to ensure that one does not impact the modelling
set.seed(123)
mybankdata$spl = sample.split(mybankdata$CREDIT,SplitRatio=0.8)
mybankdata_train <- subset(mybankdata, spl == "TRUE")
nrow(mybankdata_train)
## [1] 800
ncol(mybankdata_train)
## [1] 22
mybankdata_test <- subset(mybankdata, spl == "FALSE")
mybankdata_train <- mybankdata_train[,-22] ## removing the boolean column which was set for sampling
mybankdata_test <- mybankdata_test[,-22] ## removing the boolean column which was set for sampling
Decision Tree is now run and we plot the tree..Observations from the tree diagram
Root Node: Root Node has dominant 0 and so it is coloured with Blue and labelled as 0 Node 30% of Y value is 1 Node clour is brighter and means it has higher % of 0 100% means all data are present in this node Then using Gini measure of Index it has found that best question to ask is if CHECK_A value is “A13” or “A14”" and it splitted the root node.
mct <- rpart(mybankdata_train$CREDIT ~., method = "class",data=mybankdata_train, control = rpart.control(cp = 0.005)) ## It does the tree generation by using gini index..we are overwriting cp value to have big tree
rpart.plot(mct,shadow.col="gray",cex=0.6)
fancyRpartPlot(mct,cex=0.5)
Lets now see the tree details and its parameter..Sumamry will show all the tree parameters and output while the princp will show complexity parameter details
summary(mct)
## Call:
## rpart(formula = mybankdata_train$CREDIT ~ ., data = mybankdata_train,
## method = "class", control = rpart.control(cp = 0.005))
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.077083333 0 1.0000000 1.0000000 0.05400617
## 2 0.033333333 2 0.8458333 0.8916667 0.05216743
## 3 0.029166667 3 0.8125000 0.8916667 0.05216743
## 4 0.025000000 4 0.7833333 0.8958333 0.05224454
## 5 0.020833333 5 0.7583333 0.8958333 0.05224454
## 6 0.018750000 8 0.6958333 0.8833333 0.05201162
## 7 0.016666667 10 0.6583333 0.8625000 0.05161266
## 8 0.012500000 11 0.6416667 0.8583333 0.05153124
## 9 0.010416667 12 0.6291667 0.8833333 0.05201162
## 10 0.008333333 14 0.6083333 0.8916667 0.05216743
## 11 0.006250000 24 0.5000000 0.8875000 0.05208979
## 12 0.005000000 26 0.4875000 0.9000000 0.05232112
##
## Variable importance
## CHECK_A PURPOSE AMOUNT DURATION C_HIST SAVE_A EMPLOY
## 20 14 10 8 8 8 6
## AGE RESIDENCE INSTALL_P PROPERTY HOUSING JOB PERSONAL
## 5 5 4 3 3 2 1
## INSTALL_R
## 1
##
## Node number 1: 800 observations, complexity param=0.07708333
## predicted class=0 expected loss=0.3 P(node) =1
## class counts: 560 240
## probabilities: 0.700 0.300
## left son=2 (382 obs) right son=3 (418 obs)
## Primary splits:
## CHECK_A splits as RRLL, improve=41.81628, (0 missing)
## DURATION < 15.5 to the left, improve=15.33350, (0 missing)
## C_HIST splits as RRLLL, improve=14.23095, (0 missing)
## SAVE_A splits as RRLLL, improve=12.69991, (0 missing)
## AMOUNT < 3913.5 to the left, improve=10.86037, (0 missing)
## Surrogate splits:
## SAVE_A splits as RRLLL, agree=0.604, adj=0.170, (0 split)
## PURPOSE splits as RLRRLRRRRR, agree=0.575, adj=0.110, (0 split)
## C_HIST splits as RRRRL, agree=0.566, adj=0.092, (0 split)
## DURATION < 15.5 to the left, agree=0.551, adj=0.060, (0 split)
## EMPLOY splits as RRLRL, agree=0.549, adj=0.055, (0 split)
##
## Node number 2: 382 observations, complexity param=0.008333333
## predicted class=0 expected loss=0.1308901 P(node) =0.4775
## class counts: 332 50
## probabilities: 0.869 0.131
## left son=4 (323 obs) right son=5 (59 obs)
## Primary splits:
## INSTALL_P splits as RRL, improve=4.234603, (0 missing)
## PURPOSE splits as LLLLLLRLLR, improve=3.313501, (0 missing)
## C_HIST splits as RRLRL, improve=3.290513, (0 missing)
## EMPLOY splits as RRLLL, improve=2.732240, (0 missing)
## AMOUNT < 7447 to the left, improve=2.099077, (0 missing)
## Surrogate splits:
## C_HIST splits as LRLLL, agree=0.856, adj=0.068, (0 split)
## DURATION < 45 to the left, agree=0.851, adj=0.034, (0 split)
## PURPOSE splits as LLRLLLLLLL, agree=0.848, adj=0.017, (0 split)
##
## Node number 3: 418 observations, complexity param=0.07708333
## predicted class=0 expected loss=0.4545455 P(node) =0.5225
## class counts: 228 190
## probabilities: 0.545 0.455
## left son=6 (225 obs) right son=7 (193 obs)
## Primary splits:
## DURATION < 22.5 to the left, improve=14.319360, (0 missing)
## PROPERTY splits as LLRR, improve= 7.243790, (0 missing)
## C_HIST splits as RRLLL, improve= 6.714510, (0 missing)
## SAVE_A splits as RRLLL, improve= 6.521208, (0 missing)
## AMOUNT < 3998 to the left, improve= 5.222644, (0 missing)
## Surrogate splits:
## AMOUNT < 2665 to the left, agree=0.742, adj=0.440, (0 split)
## PROPERTY splits as LLRR, agree=0.639, adj=0.218, (0 split)
## PURPOSE splits as LRRLLLLLLR, agree=0.622, adj=0.181, (0 split)
## C_HIST splits as RRLRL, agree=0.605, adj=0.145, (0 split)
## HOUSING splits as LLR, agree=0.593, adj=0.119, (0 split)
##
## Node number 4: 323 observations
## predicted class=0 expected loss=0.09907121 P(node) =0.40375
## class counts: 291 32
## probabilities: 0.901 0.099
##
## Node number 5: 59 observations, complexity param=0.008333333
## predicted class=0 expected loss=0.3050847 P(node) =0.07375
## class counts: 41 18
## probabilities: 0.695 0.305
## left son=10 (37 obs) right son=11 (22 obs)
## Primary splits:
## PURPOSE splits as RLLLL--R-R, improve=5.731937, (0 missing)
## EMPLOY splits as RLRLL, improve=3.538688, (0 missing)
## AGE < 44.5 to the right, improve=1.481488, (0 missing)
## AMOUNT < 1784.5 to the left, improve=1.423926, (0 missing)
## PROPERTY splits as RLRR, improve=1.186646, (0 missing)
## Surrogate splits:
## INSTALL_R < 1.5 to the right, agree=0.695, adj=0.182, (0 split)
## DURATION < 42 to the left, agree=0.678, adj=0.136, (0 split)
## HOUSING splits as LLR, agree=0.678, adj=0.136, (0 split)
## C_HIST splits as RLLLL, agree=0.661, adj=0.091, (0 split)
## PERSONAL splits as RLLL, agree=0.661, adj=0.091, (0 split)
##
## Node number 6: 225 observations, complexity param=0.03333333
## predicted class=0 expected loss=0.3333333 P(node) =0.28125
## class counts: 150 75
## probabilities: 0.667 0.333
## left son=12 (215 obs) right son=13 (10 obs)
## Primary splits:
## C_HIST splits as LRLLL, improve=6.720930, (0 missing)
## PURPOSE splits as LLLLLLLRLL, improve=5.437553, (0 missing)
## PROPERTY splits as LLLR, improve=5.147059, (0 missing)
## AMOUNT < 1281 to the right, improve=3.439803, (0 missing)
## JOB splits as RLRR, improve=3.123750, (0 missing)
##
## Node number 7: 193 observations, complexity param=0.02916667
## predicted class=1 expected loss=0.4041451 P(node) =0.24125
## class counts: 78 115
## probabilities: 0.404 0.596
## left son=14 (81 obs) right son=15 (112 obs)
## Primary splits:
## PURPOSE splits as RLRLRLRL-R, improve=5.398694, (0 missing)
## SAVE_A splits as RLLLL, improve=5.023956, (0 missing)
## AMOUNT < 1381.5 to the right, improve=3.342035, (0 missing)
## INSTALL_R < 2.5 to the left, improve=2.766270, (0 missing)
## GUARANTEE splits as LRL, improve=2.289032, (0 missing)
## Surrogate splits:
## CHECK_A splits as LR--, agree=0.622, adj=0.099, (0 split)
## AMOUNT < 2691.5 to the right, agree=0.611, adj=0.074, (0 split)
## C_HIST splits as RRRRL, agree=0.606, adj=0.062, (0 split)
## HOUSING splits as RRL, agree=0.606, adj=0.062, (0 split)
## SAVE_A splits as RRLLR, agree=0.601, adj=0.049, (0 split)
##
## Node number 10: 37 observations
## predicted class=0 expected loss=0.1351351 P(node) =0.04625
## class counts: 32 5
## probabilities: 0.865 0.135
##
## Node number 11: 22 observations, complexity param=0.008333333
## predicted class=1 expected loss=0.4090909 P(node) =0.0275
## class counts: 9 13
## probabilities: 0.409 0.591
## left son=22 (9 obs) right son=23 (13 obs)
## Primary splits:
## EMPLOY splits as RRRLL, improve=2.0209790, (0 missing)
## PERSONAL splits as LLR-, improve=1.1720780, (0 missing)
## AMOUNT < 3761.5 to the left, improve=1.0637140, (0 missing)
## C_HIST splits as RLLRR, improve=0.8181818, (0 missing)
## JOB splits as -LRL, improve=0.6534577, (0 missing)
## Surrogate splits:
## AMOUNT < 3195.5 to the left, agree=0.773, adj=0.444, (0 split)
## C_HIST splits as RRLRR, agree=0.727, adj=0.333, (0 split)
## AGE < 45 to the right, agree=0.727, adj=0.333, (0 split)
## DURATION < 19.5 to the left, agree=0.682, adj=0.222, (0 split)
## RESIDENCE < 2.5 to the right, agree=0.682, adj=0.222, (0 split)
##
## Node number 12: 215 observations, complexity param=0.02083333
## predicted class=0 expected loss=0.3069767 P(node) =0.26875
## class counts: 149 66
## probabilities: 0.693 0.307
## left son=24 (206 obs) right son=25 (9 obs)
## Primary splits:
## PURPOSE splits as LLLLLLLRLL, improve=4.164075, (0 missing)
## JOB splits as RLRR, improve=3.447757, (0 missing)
## PROPERTY splits as LLLR, improve=2.920543, (0 missing)
## AMOUNT < 1281 to the right, improve=2.831182, (0 missing)
## C_HIST splits as R-RLL, improve=2.450693, (0 missing)
##
## Node number 13: 10 observations
## predicted class=1 expected loss=0.1 P(node) =0.0125
## class counts: 1 9
## probabilities: 0.100 0.900
##
## Node number 14: 81 observations, complexity param=0.025
## predicted class=0 expected loss=0.4567901 P(node) =0.10125
## class counts: 44 37
## probabilities: 0.543 0.457
## left son=28 (67 obs) right son=29 (14 obs)
## Primary splits:
## AMOUNT < 8015.5 to the left, improve=2.244439, (0 missing)
## DURATION < 31.5 to the left, improve=2.136331, (0 missing)
## SAVE_A splits as RRRLL, improve=1.582146, (0 missing)
## PURPOSE splits as -L-R-R-R--, improve=1.568151, (0 missing)
## INSTALL_R < 2.5 to the left, improve=1.351868, (0 missing)
## Surrogate splits:
## N_EXIST < 2.5 to the left, agree=0.852, adj=0.143, (0 split)
##
## Node number 15: 112 observations, complexity param=0.02083333
## predicted class=1 expected loss=0.3035714 P(node) =0.14
## class counts: 34 78
## probabilities: 0.304 0.696
## left son=30 (41 obs) right son=31 (71 obs)
## Primary splits:
## SAVE_A splits as RLLRL, improve=7.023237, (0 missing)
## CHECK_A splits as RL--, improve=3.296115, (0 missing)
## RESIDENCE < 1.5 to the left, improve=2.044449, (0 missing)
## EMPLOY splits as LRRLL, improve=2.009774, (0 missing)
## AMOUNT < 1381.5 to the right, improve=1.724490, (0 missing)
## Surrogate splits:
## DURATION < 57 to the right, agree=0.652, adj=0.049, (0 split)
## C_HIST splits as RLRLR, agree=0.652, adj=0.049, (0 split)
## AMOUNT < 1464 to the left, agree=0.643, adj=0.024, (0 split)
## HOUSING splits as RRL, agree=0.643, adj=0.024, (0 split)
## JOB splits as LRRR, agree=0.643, adj=0.024, (0 split)
##
## Node number 22: 9 observations
## predicted class=0 expected loss=0.3333333 P(node) =0.01125
## class counts: 6 3
## probabilities: 0.667 0.333
##
## Node number 23: 13 observations
## predicted class=1 expected loss=0.2307692 P(node) =0.01625
## class counts: 3 10
## probabilities: 0.231 0.769
##
## Node number 24: 206 observations, complexity param=0.008333333
## predicted class=0 expected loss=0.2864078 P(node) =0.2575
## class counts: 147 59
## probabilities: 0.714 0.286
## left son=48 (56 obs) right son=49 (150 obs)
## Primary splits:
## JOB splits as RLRR, improve=3.169598, (0 missing)
## DURATION < 11.5 to the left, improve=2.963252, (0 missing)
## C_HIST splits as R-RLL, improve=2.690141, (0 missing)
## PROPERTY splits as LLLR, improve=2.644673, (0 missing)
## AMOUNT < 632 to the left, improve=2.276422, (0 missing)
## Surrogate splits:
## AMOUNT < 742.5 to the left, agree=0.743, adj=0.054, (0 split)
## FOREIGN splits as RL, agree=0.738, adj=0.036, (0 split)
## PURPOSE splits as RRRRRRR-LR, agree=0.733, adj=0.018, (0 split)
##
## Node number 25: 9 observations
## predicted class=1 expected loss=0.2222222 P(node) =0.01125
## class counts: 2 7
## probabilities: 0.222 0.778
##
## Node number 28: 67 observations, complexity param=0.01875
## predicted class=0 expected loss=0.4029851 P(node) =0.08375
## class counts: 40 27
## probabilities: 0.597 0.403
## left son=56 (20 obs) right son=57 (47 obs)
## Primary splits:
## PURPOSE splits as -L-R-R-R--, improve=3.649444, (0 missing)
## DURATION < 46.5 to the left, improve=2.187959, (0 missing)
## RESIDENCE < 2.5 to the left, improve=1.750434, (0 missing)
## INSTALL_R < 2.5 to the left, improve=1.520232, (0 missing)
## C_HIST splits as RLLRL, improve=1.445703, (0 missing)
## Surrogate splits:
## AGE < 57.5 to the right, agree=0.746, adj=0.15, (0 split)
## JOB splits as RRRL, agree=0.731, adj=0.10, (0 split)
## AMOUNT < 5432 to the right, agree=0.716, adj=0.05, (0 split)
##
## Node number 29: 14 observations
## predicted class=1 expected loss=0.2857143 P(node) =0.0175
## class counts: 4 10
## probabilities: 0.286 0.714
##
## Node number 30: 41 observations, complexity param=0.02083333
## predicted class=0 expected loss=0.4634146 P(node) =0.05125
## class counts: 22 19
## probabilities: 0.537 0.463
## left son=60 (34 obs) right son=61 (7 obs)
## Primary splits:
## AMOUNT < 1381.5 to the right, improve=4.860832, (0 missing)
## EMPLOY splits as LRRLR, improve=3.282552, (0 missing)
## INSTALL_R < 2.5 to the left, improve=2.638921, (0 missing)
## RESIDENCE < 3.5 to the left, improve=1.954346, (0 missing)
## C_HIST splits as RRRLR, improve=1.835405, (0 missing)
## Surrogate splits:
## CHECK_A splits as RL--, agree=0.927, adj=0.571, (0 split)
##
## Node number 31: 71 observations
## predicted class=1 expected loss=0.1690141 P(node) =0.08875
## class counts: 12 59
## probabilities: 0.169 0.831
##
## Node number 48: 56 observations
## predicted class=0 expected loss=0.1428571 P(node) =0.07
## class counts: 48 8
## probabilities: 0.857 0.143
##
## Node number 49: 150 observations, complexity param=0.008333333
## predicted class=0 expected loss=0.34 P(node) =0.1875
## class counts: 99 51
## probabilities: 0.660 0.340
## left son=98 (25 obs) right son=99 (125 obs)
## Primary splits:
## PURPOSE splits as RLLRRLR-LL, improve=4.056000, (0 missing)
## SAVE_A splits as RLLLR, improve=3.881538, (0 missing)
## AMOUNT < 1373 to the right, improve=1.976863, (0 missing)
## HOUSING splits as RLR, improve=1.976863, (0 missing)
## EMPLOY splits as RRRLR, improve=1.898073, (0 missing)
##
## Node number 56: 20 observations
## predicted class=0 expected loss=0.15 P(node) =0.025
## class counts: 17 3
## probabilities: 0.850 0.150
##
## Node number 57: 47 observations, complexity param=0.01875
## predicted class=1 expected loss=0.4893617 P(node) =0.05875
## class counts: 23 24
## probabilities: 0.489 0.511
## left son=114 (20 obs) right son=115 (27 obs)
## Primary splits:
## RESIDENCE < 2.5 to the left, improve=3.089362, (0 missing)
## EMPLOY splits as RLLLR, improve=1.846505, (0 missing)
## PROPERTY splits as LLRL, improve=1.846505, (0 missing)
## SAVE_A splits as RRRLL, improve=1.309875, (0 missing)
## HOUSING splits as RLL, improve=1.104746, (0 missing)
## Surrogate splits:
## AMOUNT < 2382.5 to the left, agree=0.681, adj=0.25, (0 split)
## AGE < 31.5 to the right, agree=0.681, adj=0.25, (0 split)
## PERSONAL splits as LRRL, agree=0.660, adj=0.20, (0 split)
## PROPERTY splits as RLRR, agree=0.660, adj=0.20, (0 split)
## GUARANTEE splits as RRL, agree=0.617, adj=0.10, (0 split)
##
## Node number 60: 34 observations, complexity param=0.01041667
## predicted class=0 expected loss=0.3529412 P(node) =0.0425
## class counts: 22 12
## probabilities: 0.647 0.353
## left son=120 (14 obs) right son=121 (20 obs)
## Primary splits:
## EMPLOY splits as LRRLR, improve=2.100840, (0 missing)
## RESIDENCE < 3.5 to the left, improve=1.968806, (0 missing)
## SAVE_A splits as -RR-L, improve=1.812745, (0 missing)
## AGE < 25 to the right, improve=1.548643, (0 missing)
## DURATION < 25.5 to the left, improve=1.431634, (0 missing)
## Surrogate splits:
## N_PEOPLE < 1.5 to the right, agree=0.676, adj=0.214, (0 split)
## CHECK_A splits as LR--, agree=0.647, adj=0.143, (0 split)
## DURATION < 46.5 to the right, agree=0.647, adj=0.143, (0 split)
## C_HIST splits as RLRRR, agree=0.647, adj=0.143, (0 split)
## PURPOSE splits as R-L-R----L, agree=0.647, adj=0.143, (0 split)
##
## Node number 61: 7 observations
## predicted class=1 expected loss=0 P(node) =0.00875
## class counts: 0 7
## probabilities: 0.000 1.000
##
## Node number 98: 25 observations
## predicted class=0 expected loss=0.08 P(node) =0.03125
## class counts: 23 2
## probabilities: 0.920 0.080
##
## Node number 99: 125 observations, complexity param=0.008333333
## predicted class=0 expected loss=0.392 P(node) =0.15625
## class counts: 76 49
## probabilities: 0.608 0.392
## left son=198 (111 obs) right son=199 (14 obs)
## Primary splits:
## INSTALL_P splits as RLL, improve=3.275120, (0 missing)
## SAVE_A splits as RRLLR, improve=2.980552, (0 missing)
## EMPLOY splits as RRRLR, improve=1.828009, (0 missing)
## DURATION < 15.5 to the left, improve=1.762862, (0 missing)
## RESIDENCE < 3.5 to the right, improve=1.650773, (0 missing)
##
## Node number 114: 20 observations, complexity param=0.01666667
## predicted class=0 expected loss=0.3 P(node) =0.025
## class counts: 14 6
## probabilities: 0.700 0.300
## left son=228 (12 obs) right son=229 (8 obs)
## Primary splits:
## EMPLOY splits as RLLRR, improve=5.400000, (0 missing)
## AGE < 32.5 to the left, improve=2.137374, (0 missing)
## AMOUNT < 3471 to the right, improve=1.600000, (0 missing)
## PROPERTY splits as LRLL, improve=1.167677, (0 missing)
## INSTALL_R < 3.5 to the left, improve=0.356044, (0 missing)
## Surrogate splits:
## AGE < 31 to the left, agree=0.75, adj=0.375, (0 split)
## JOB splits as LLLR, agree=0.75, adj=0.375, (0 split)
## PURPOSE splits as ---L-L-R--, agree=0.70, adj=0.250, (0 split)
## AMOUNT < 2303 to the right, agree=0.70, adj=0.250, (0 split)
## DURATION < 33 to the left, agree=0.65, adj=0.125, (0 split)
##
## Node number 115: 27 observations, complexity param=0.0125
## predicted class=1 expected loss=0.3333333 P(node) =0.03375
## class counts: 9 18
## probabilities: 0.333 0.667
## left son=230 (11 obs) right son=231 (16 obs)
## Primary splits:
## PROPERTY splits as LLRR, improve=3.409091, (0 missing)
## C_HIST splits as RLLRR, improve=1.729412, (0 missing)
## DURATION < 27 to the left, improve=1.200000, (0 missing)
## TEL splits as RL, improve=1.200000, (0 missing)
## SAVE_A splits as RLLLR, improve=1.071429, (0 missing)
## Surrogate splits:
## C_HIST splits as RRLRR, agree=0.741, adj=0.364, (0 split)
## INSTALL_R < 3.5 to the right, agree=0.704, adj=0.273, (0 split)
## INSTALL_P splits as RLR, agree=0.704, adj=0.273, (0 split)
## AMOUNT < 2614.5 to the left, agree=0.667, adj=0.182, (0 split)
## SAVE_A splits as RRRLL, agree=0.667, adj=0.182, (0 split)
##
## Node number 120: 14 observations
## predicted class=0 expected loss=0.1428571 P(node) =0.0175
## class counts: 12 2
## probabilities: 0.857 0.143
##
## Node number 121: 20 observations, complexity param=0.01041667
## predicted class=0 expected loss=0.5 P(node) =0.025
## class counts: 10 10
## probabilities: 0.500 0.500
## left son=242 (13 obs) right son=243 (7 obs)
## Primary splits:
## RESIDENCE < 3.5 to the left, improve=2.7472530, (0 missing)
## C_HIST splits as LRRLL, improve=1.6666670, (0 missing)
## SAVE_A splits as -RL-L, improve=1.6666670, (0 missing)
## PURPOSE splits as L-R-R----L, improve=0.9090909, (0 missing)
## PROPERTY splits as LRLR, improve=0.9090909, (0 missing)
## Surrogate splits:
## AGE < 41.5 to the left, agree=0.80, adj=0.429, (0 split)
## HOUSING splits as RLR, agree=0.80, adj=0.429, (0 split)
## AMOUNT < 12296.5 to the left, agree=0.75, adj=0.286, (0 split)
## PERSONAL splits as LRLL, agree=0.75, adj=0.286, (0 split)
## GUARANTEE splits as LRR, agree=0.75, adj=0.286, (0 split)
##
## Node number 198: 111 observations, complexity param=0.008333333
## predicted class=0 expected loss=0.3513514 P(node) =0.13875
## class counts: 72 39
## probabilities: 0.649 0.351
## left son=396 (12 obs) right son=397 (99 obs)
## Primary splits:
## SAVE_A splits as RLLLR, improve=3.321867, (0 missing)
## RESIDENCE < 3.5 to the right, improve=3.142084, (0 missing)
## GUARANTEE splits as RRL, improve=2.418124, (0 missing)
## C_HIST splits as R-RRL, improve=2.341963, (0 missing)
## AMOUNT < 1373 to the right, improve=1.459979, (0 missing)
##
## Node number 199: 14 observations
## predicted class=1 expected loss=0.2857143 P(node) =0.0175
## class counts: 4 10
## probabilities: 0.286 0.714
##
## Node number 228: 12 observations
## predicted class=0 expected loss=0 P(node) =0.015
## class counts: 12 0
## probabilities: 1.000 0.000
##
## Node number 229: 8 observations
## predicted class=1 expected loss=0.25 P(node) =0.01
## class counts: 2 6
## probabilities: 0.250 0.750
##
## Node number 230: 11 observations
## predicted class=0 expected loss=0.3636364 P(node) =0.01375
## class counts: 7 4
## probabilities: 0.636 0.364
##
## Node number 231: 16 observations
## predicted class=1 expected loss=0.125 P(node) =0.02
## class counts: 2 14
## probabilities: 0.125 0.875
##
## Node number 242: 13 observations
## predicted class=0 expected loss=0.3076923 P(node) =0.01625
## class counts: 9 4
## probabilities: 0.692 0.308
##
## Node number 243: 7 observations
## predicted class=1 expected loss=0.1428571 P(node) =0.00875
## class counts: 1 6
## probabilities: 0.143 0.857
##
## Node number 396: 12 observations
## predicted class=0 expected loss=0 P(node) =0.015
## class counts: 12 0
## probabilities: 1.000 0.000
##
## Node number 397: 99 observations, complexity param=0.008333333
## predicted class=0 expected loss=0.3939394 P(node) =0.12375
## class counts: 60 39
## probabilities: 0.606 0.394
## left son=794 (40 obs) right son=795 (59 obs)
## Primary splits:
## RESIDENCE < 3.5 to the right, improve=3.831202, (0 missing)
## C_HIST splits as R-RLL, improve=2.344156, (0 missing)
## GUARANTEE splits as RRL, improve=2.337945, (0 missing)
## AMOUNT < 1373 to the right, improve=1.900782, (0 missing)
## EMPLOY splits as RRRLR, improve=1.329870, (0 missing)
## Surrogate splits:
## AGE < 38 to the right, agree=0.707, adj=0.275, (0 split)
## EMPLOY splits as LRRRL, agree=0.677, adj=0.200, (0 split)
## HOUSING splits as LRL, agree=0.677, adj=0.200, (0 split)
## C_HIST splits as R-RLL, agree=0.657, adj=0.150, (0 split)
## PROPERTY splits as RRRL, agree=0.646, adj=0.125, (0 split)
##
## Node number 794: 40 observations, complexity param=0.00625
## predicted class=0 expected loss=0.225 P(node) =0.05
## class counts: 31 9
## probabilities: 0.775 0.225
## left son=1588 (19 obs) right son=1589 (21 obs)
## Primary splits:
## HOUSING splits as RLR, improve=2.150501, (0 missing)
## AGE < 41.5 to the left, improve=1.470000, (0 missing)
## AMOUNT < 1179.5 to the left, improve=1.175806, (0 missing)
## JOB splits as R-LR, improve=1.118459, (0 missing)
## CHECK_A splits as RL--, improve=1.015934, (0 missing)
## Surrogate splits:
## CHECK_A splits as RL--, agree=0.725, adj=0.421, (0 split)
## EMPLOY splits as RRRLL, agree=0.675, adj=0.316, (0 split)
## AGE < 36.5 to the right, agree=0.675, adj=0.316, (0 split)
## C_HIST splits as R-RLL, agree=0.650, adj=0.263, (0 split)
## AMOUNT < 1979.5 to the left, agree=0.650, adj=0.263, (0 split)
##
## Node number 795: 59 observations, complexity param=0.008333333
## predicted class=1 expected loss=0.4915254 P(node) =0.07375
## class counts: 29 30
## probabilities: 0.492 0.508
## left son=1590 (38 obs) right son=1591 (21 obs)
## Primary splits:
## AMOUNT < 1373 to the right, improve=2.762202, (0 missing)
## C_HIST splits as R-RRL, improve=2.571793, (0 missing)
## EMPLOY splits as RLRLR, improve=1.912702, (0 missing)
## DURATION < 11.5 to the left, improve=1.686293, (0 missing)
## PROPERTY splits as LLRR, improve=1.556743, (0 missing)
## Surrogate splits:
## FOREIGN splits as LR, agree=0.695, adj=0.143, (0 split)
## PURPOSE splits as L--LL-R---, agree=0.678, adj=0.095, (0 split)
## PERSONAL splits as RLLL, agree=0.661, adj=0.048, (0 split)
## INSTALL_P splits as -RL, agree=0.661, adj=0.048, (0 split)
##
## Node number 1588: 19 observations
## predicted class=0 expected loss=0.05263158 P(node) =0.02375
## class counts: 18 1
## probabilities: 0.947 0.053
##
## Node number 1589: 21 observations, complexity param=0.00625
## predicted class=0 expected loss=0.3809524 P(node) =0.02625
## class counts: 13 8
## probabilities: 0.619 0.381
## left son=3178 (14 obs) right son=3179 (7 obs)
## Primary splits:
## AGE < 38 to the left, improve=2.3333330, (0 missing)
## C_HIST splits as R-LLR, improve=1.5393770, (0 missing)
## EMPLOY splits as LRRLR, improve=1.1904760, (0 missing)
## AMOUNT < 3057.5 to the left, improve=0.7619048, (0 missing)
## N_EXIST < 1.5 to the left, improve=0.7619048, (0 missing)
## Surrogate splits:
## PURPOSE splits as L--LL-R---, agree=0.714, adj=0.143, (0 split)
## AMOUNT < 2544.5 to the left, agree=0.714, adj=0.143, (0 split)
## PERSONAL splits as RLLL, agree=0.714, adj=0.143, (0 split)
## PROPERTY splits as LLLR, agree=0.714, adj=0.143, (0 split)
## HOUSING splits as L-R, agree=0.714, adj=0.143, (0 split)
##
## Node number 1590: 38 observations, complexity param=0.008333333
## predicted class=0 expected loss=0.3947368 P(node) =0.0475
## class counts: 23 15
## probabilities: 0.605 0.395
## left son=3180 (20 obs) right son=3181 (18 obs)
## Primary splits:
## AGE < 29.5 to the right, improve=3.202339, (0 missing)
## TEL splits as LR, improve=2.088330, (0 missing)
## AMOUNT < 4064.5 to the left, improve=1.752365, (0 missing)
## EMPLOY splits as RLRLR, improve=1.564057, (0 missing)
## C_HIST splits as R-RRL, improve=1.474561, (0 missing)
## Surrogate splits:
## PERSONAL splits as LRLL, agree=0.684, adj=0.333, (0 split)
## C_HIST splits as R-RLL, agree=0.658, adj=0.278, (0 split)
## PURPOSE splits as L--LR-----, agree=0.658, adj=0.278, (0 split)
## RESIDENCE < 2.5 to the right, agree=0.658, adj=0.278, (0 split)
## AMOUNT < 1541.5 to the right, agree=0.632, adj=0.222, (0 split)
##
## Node number 1591: 21 observations
## predicted class=1 expected loss=0.2857143 P(node) =0.02625
## class counts: 6 15
## probabilities: 0.286 0.714
##
## Node number 3178: 14 observations
## predicted class=0 expected loss=0.2142857 P(node) =0.0175
## class counts: 11 3
## probabilities: 0.786 0.214
##
## Node number 3179: 7 observations
## predicted class=1 expected loss=0.2857143 P(node) =0.00875
## class counts: 2 5
## probabilities: 0.286 0.714
##
## Node number 3180: 20 observations
## predicted class=0 expected loss=0.2 P(node) =0.025
## class counts: 16 4
## probabilities: 0.800 0.200
##
## Node number 3181: 18 observations
## predicted class=1 expected loss=0.3888889 P(node) =0.0225
## class counts: 7 11
## probabilities: 0.389 0.611
Lets now find out complexity parameter and XError value.. We can see that complexity parameter is coming down as tree increases..Carefully looking at xerror shows that after xerror value of 0.8833333 it is again increasing…The no of tree for the value of xerror 0.90000 is 12. So we will take no of tree as 10.
printcp(mct) ## default value of complexity parameter is 0.01 and after that rpart package will not split unless we override the cp parameter
##
## Classification tree:
## rpart(formula = mybankdata_train$CREDIT ~ ., data = mybankdata_train,
## method = "class", control = rpart.control(cp = 0.005))
##
## Variables actually used in tree construction:
## [1] AGE AMOUNT C_HIST CHECK_A DURATION EMPLOY HOUSING
## [8] INSTALL_P JOB PROPERTY PURPOSE RESIDENCE SAVE_A
##
## Root node error: 240/800 = 0.3
##
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.0770833 0 1.00000 1.00000 0.054006
## 2 0.0333333 2 0.84583 0.89167 0.052167
## 3 0.0291667 3 0.81250 0.89167 0.052167
## 4 0.0250000 4 0.78333 0.89583 0.052245
## 5 0.0208333 5 0.75833 0.89583 0.052245
## 6 0.0187500 8 0.69583 0.88333 0.052012
## 7 0.0166667 10 0.65833 0.86250 0.051613
## 8 0.0125000 11 0.64167 0.85833 0.051531
## 9 0.0104167 12 0.62917 0.88333 0.052012
## 10 0.0083333 14 0.60833 0.89167 0.052167
## 11 0.0062500 24 0.50000 0.88750 0.052090
## 12 0.0050000 26 0.48750 0.90000 0.052321
We will run the tree generation again with no of tree value as 11..CP Value is 0.012500 and so we are pruning the tree
mct_final <- prune(mct,cp = 0.0104167)
rpart.plot(mct_final,shadow.col="gray",cex=0.7)
printcp(mct_final)
##
## Classification tree:
## rpart(formula = mybankdata_train$CREDIT ~ ., data = mybankdata_train,
## method = "class", control = rpart.control(cp = 0.005))
##
## Variables actually used in tree construction:
## [1] AMOUNT C_HIST CHECK_A DURATION EMPLOY PROPERTY PURPOSE
## [8] RESIDENCE SAVE_A
##
## Root node error: 240/800 = 0.3
##
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.077083 0 1.00000 1.00000 0.054006
## 2 0.033333 2 0.84583 0.89167 0.052167
## 3 0.029167 3 0.81250 0.89167 0.052167
## 4 0.025000 4 0.78333 0.89583 0.052245
## 5 0.020833 5 0.75833 0.89583 0.052245
## 6 0.018750 8 0.69583 0.88333 0.052012
## 7 0.016667 10 0.65833 0.86250 0.051613
## 8 0.012500 11 0.64167 0.85833 0.051531
## 9 0.010417 12 0.62917 0.88333 0.052012
plotcp(mct_final)
We should now try the pruning of tree to avoid overfitting of the model. This is quite common with Decision tree and we will need to be careful with this. Need to decide optimum number of tree so that model does not overfit.. First we will predict to see the probability output with test data…Each row here has probability of 0 and probability of 1. Predicted value is populated into test data frame column Compare the predicted value and teh available credit column from test data by creating the confusion matrix.. Our accuracy ratio here is 69%
mybankdata_predict <- predict(mct_final,mybankdata_test)
View(mybankdata_predict)
mybankdata_test$CREDIT_PREDICTED <- predict(mct,mybankdata_test, type = 'class')
xlab <- table(actualclass=mybankdata_test$CREDIT,predictedclass=mybankdata_test$CREDIT_PREDICTED) ## generting the table format
confusionMatrix(xlab) # generating the confusion matrix
## Confusion Matrix and Statistics
##
## predictedclass
## actualclass 0 1
## 0 103 37
## 1 33 27
##
## Accuracy : 0.65
## 95% CI : (0.5795, 0.7159)
## No Information Rate : 0.68
## P-Value [Acc > NIR] : 0.8379
##
## Kappa : 0.1822
## Mcnemar's Test P-Value : 0.7199
##
## Sensitivity : 0.7574
## Specificity : 0.4219
## Pos Pred Value : 0.7357
## Neg Pred Value : 0.4500
## Prevalence : 0.6800
## Detection Rate : 0.5150
## Detection Prevalence : 0.7000
## Balanced Accuracy : 0.5896
##
## 'Positive' Class : 0
##