DEFINE PROBLEM STATEMENT
Banks play a vital role in providing financial services to their customers, so there will be a high risk involved and it is very important to assess risk of payment defaulters.
Using 'ccdefault' dataset can we predict whether the customers repayment is successful or not?
Variable selection
This study reviewed the literature and shortlisted the following 21 variables as explanatory variables:
ccdefault<-ccdefault[,c(-1,-3,-5)]
DATASET (displaying only 5 observations)
| 20000 |
2 |
24 |
2 |
2 |
-1 |
-1 |
-2 |
-2 |
3913 |
3102 |
689 |
0 |
0 |
0 |
0 |
689 |
0 |
0 |
0 |
0 |
1 |
| 120000 |
2 |
26 |
-1 |
2 |
0 |
0 |
0 |
2 |
2682 |
1725 |
2682 |
3272 |
3455 |
3261 |
0 |
1000 |
1000 |
1000 |
0 |
2000 |
1 |
| 90000 |
2 |
34 |
0 |
0 |
0 |
0 |
0 |
0 |
29239 |
14027 |
13559 |
14331 |
14948 |
15549 |
1518 |
1500 |
1000 |
1000 |
1000 |
5000 |
0 |
| 50000 |
2 |
37 |
0 |
0 |
0 |
0 |
0 |
0 |
46990 |
48233 |
49291 |
28314 |
28959 |
29547 |
2000 |
2019 |
1200 |
1100 |
1069 |
1000 |
0 |
| 50000 |
2 |
57 |
-1 |
0 |
-1 |
0 |
0 |
0 |
8617 |
5670 |
35835 |
20940 |
19146 |
19131 |
2000 |
36681 |
10000 |
9000 |
689 |
679 |
0 |
Check for missing values
sum(is.na(ccdefault))
[1] 0
summary(ccdefault)
LIMIT_BAL EDUCATION AGE PAY_0
Min. : 10000 Min. :0.000 Min. :21.00 Min. :-2.0000
1st Qu.: 50000 1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000
Median : 140000 Median :2.000 Median :34.00 Median : 0.0000
Mean : 167484 Mean :1.853 Mean :35.49 Mean :-0.0167
3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000
Max. :1000000 Max. :6.000 Max. :79.00 Max. : 8.0000
PAY_2 PAY_3 PAY_4 PAY_5
Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :-0.1338 Mean :-0.1662 Mean :-0.2207 Mean :-0.2662
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3
Min. :-2.0000 Min. :-165580 Min. :-69777 Min. :-157264
1st Qu.:-1.0000 1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666
Median : 0.0000 Median : 22382 Median : 21200 Median : 20089
Mean :-0.2911 Mean : 51223 Mean : 49179 Mean : 47013
3rd Qu.: 0.0000 3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165
Max. : 8.0000 Max. : 964511 Max. :983931 Max. :1664089
BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1
Min. :-170000 Min. :-81334 Min. :-339603 Min. : 0
1st Qu.: 2327 1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000
Median : 19052 Median : 18105 Median : 17071 Median : 2100
Mean : 43263 Mean : 40311 Mean : 38872 Mean : 5664
3rd Qu.: 54506 3rd Qu.: 50191 3rd Qu.: 49198 3rd Qu.: 5006
Max. : 891586 Max. :927171 Max. : 961664 Max. :873552
PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5
Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.: 833 1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5
Median : 2009 Median : 1800 Median : 1500 Median : 1500.0
Mean : 5921 Mean : 5226 Mean : 4826 Mean : 4799.4
3rd Qu.: 5000 3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5
Max. :1684259 Max. :896040 Max. :621000 Max. :426529.0
PAY_AMT6 default payment next month
Min. : 0.0 Min. :0.0000
1st Qu.: 117.8 1st Qu.:0.0000
Median : 1500.0 Median :0.0000
Mean : 5215.5 Mean :0.2212
3rd Qu.: 4000.0 3rd Qu.:0.0000
Max. :528666.0 Max. :1.0000
By using the above 'sum' function which says there are no 'NA' values or from summary we can easily say that there are no available missing values.
Check for outliers
boxplot(ccdefault)

dim(ccdefault)
[1] 30000 22
From boxplot, we can infer that some values are different from most of the values in which some are extreme from the outliers itself which can be removed as these values may effect the model accuracy.
Remove extreme outliers
ccdefault<-ccdefault[ccdefault$LIMIT_BAL<1000000,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT1<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT1>7e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT2>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT3>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT3<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT4>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT5>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT5<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT6>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT6<0,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT1>3e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT2>5e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT3>4e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT4>3e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT5>350000,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT6>3e+05,]
boxplot(ccdefault)

dim(ccdefault)
[1] 28337 22
Here, after eliminating some extreme outliers we can see that all the remaining outliers are close to each other which shoudn't be removed as it will be more biased if removed.
Now lets use this preprocessed data which has 28337 observations and 22 variables to perform the research but before that lets check the multicollinearity among predectors.
Check for multicollinearity
library(corrplot)
corrplot 0.84 loaded
corrplot(cor(ccdefault[1:21]),method = "number")

The multicollinearity among predictors shouldn't exist so from above correlation matrix plot we see that payment status and amount bill statement has a high correlation among themselves as expected since the they are reflecting the cumulative status and amounts respectively of the previous months,except this there is a low correlation among other variables which is satisfactory.
Resampling the pre-processed data
Lets split the data into train data with 11335 observations, test1 data with 8501 observations and test2 data with 8501 observations by using validation set/hold out approach.
library(caret)
Loading required package: lattice
Loading required package: ggplot2
splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)
[1] 11335 22
dim(test1)
[1] 8501 22
dim(test2)
[1] 8501 22
Now the model has to be built for train data by using different algorithmic models and the best model is choosen out of it based on different parameters and the outcomes are predicted using this best model.
BINARY LOGISTIC REGRESSION
Model Building
model_bin<-glm(`default payment next month` ~ ., data=train,family = binomial(link="logit"))
exp(model_bin$coefficients)
(Intercept) LIMIT_BAL EDUCATION AGE PAY_0 PAY_2
0.2711845 0.9999993 0.9394406 1.0120411 1.8252271 1.0652141
PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2
1.0810926 1.0412221 1.0463908 1.0002311 0.9999952 0.9999979
BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
1.0000043 0.9999970 1.0000024 1.0000026 0.9999907 0.9999879
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
0.9999921 0.9999947 0.9999910 1.0000013
Here, the model is built using the generalised linear model(binary logistic regression).
By using exponential function we convert the values into odds ratio for interpretation, when all coefficients are zero the odds ratio of payment defaulters will be 0.29,when LIMIT_BAL increases by a unit the payment defaulters increases by 0.99, similarly for other variables with their respective odds ratio.
Predicting On Train Data
pred_train<-ifelse(model_bin$fitted.values>0.5,1,0)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8533 262
1 1862 678
Accuracy : 0.8126
95% CI : (0.8053, 0.8198)
No Information Rate : 0.9171
P-Value [Acc > NIR] : 1
Kappa : 0.3056
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8209
Specificity : 0.7213
Pos Pred Value : 0.9702
Neg Pred Value : 0.2669
Prevalence : 0.9171
Detection Rate : 0.7528
Detection Prevalence : 0.7759
Balanced Accuracy : 0.7711
'Positive' Class : 0
library(ROCR)
Loading required package: gplots
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6185697
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

library(DMwR)
Loading required package: grid
regr.eval(train$`default payment next month`,pred_train)
mae mse rmse mape
0.1873842 0.1873842 0.4328790 NaN
Here,from the contingency table obtained model is predicting 9159 observations correctly i.e,with 81% accuracy but kappa being 28% < 70% inter rater reliability is not satisfactory (i.e, we can't rely on this model).
The true positive rate is 82% and the true negative rate is 70% which is bit satisfactory and 76% of the time the classes '0' and '1' are predicting correctly(from balanced accuracy).
The area under curve is 61% obtained from reciever optimistic characteristic curve which is quite a good model.
NAIVE BAYES METHOD
Model Building
train$`default payment next month`<-as.factor(train$`default payment next month`)
class(train$`default payment next month`)
[1] "factor"
library(naivebayes)
model_nb<-naive_bayes(`default payment next month` ~ ., data=train)
Predicting On Train Data
pred_train<-predict(model_nb,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 7048 1747
1 1006 1534
Accuracy : 0.7571
95% CI : (0.7491, 0.765)
No Information Rate : 0.7105
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3672
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8751
Specificity : 0.4675
Pos Pred Value : 0.8014
Neg Pred Value : 0.6039
Prevalence : 0.7105
Detection Rate : 0.6218
Detection Prevalence : 0.7759
Balanced Accuracy : 0.6713
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7026507
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Naive bayes is probabilistic model where it calculates a MAP(Maximum A Posterior)/future probabilities based on prior probabilities.
Here, as it is predicting 7491 values correctly the accuracy is only 66% and kappa being very less which is 27% and specificity is only 37% and 63% of the time model is predicting the binary classes correctly it is a bad model with area under curve 68%.Only the true positive rate is satisfactory which is 88%.
CART(CLASSIFICATION AND REGRESSION TREES)
Model Building
library(rpart)
model_cart<-rpart(`default payment next month` ~ ., data=train)
summary(model_cart)
Call:
rpart(formula = `default payment next month` ~ ., data = train)
n= 11335
CP nsplit rel error xerror xstd
1 0.1964567 0 1.0000000 1.0000000 0.01747794
2 0.0100000 1 0.8035433 0.8035433 0.01610565
Variable importance
PAY_0 PAY_4 PAY_5 PAY_6 PAY_3 PAY_2
85 4 3 3 3 2
Node number 1: 11335 observations, complexity param=0.1964567
predicted class=0 expected loss=0.2240847 P(node) =1
class counts: 8795 2540
probabilities: 0.776 0.224
left son=2 (10106 obs) right son=3 (1229 obs)
Primary splits:
PAY_0 < 1.5 to the left, improve=632.3547, (0 missing)
PAY_2 < 1.5 to the left, improve=464.2895, (0 missing)
PAY_3 < 1.5 to the left, improve=363.1352, (0 missing)
PAY_4 < 1 to the left, improve=330.6260, (0 missing)
PAY_5 < 1 to the left, improve=315.4737, (0 missing)
Surrogate splits:
PAY_4 < 2.5 to the left, agree=0.896, adj=0.044, (0 split)
PAY_5 < 2.5 to the left, agree=0.896, adj=0.041, (0 split)
PAY_6 < 2.5 to the left, agree=0.895, adj=0.033, (0 split)
PAY_3 < 3.5 to the left, agree=0.895, adj=0.031, (0 split)
PAY_2 < 2.5 to the left, agree=0.894, adj=0.025, (0 split)
Node number 2: 10106 observations
predicted class=0 expected loss=0.1658421 P(node) =0.8915748
class counts: 8430 1676
probabilities: 0.834 0.166
Node number 3: 1229 observations
predicted class=1 expected loss=0.2969894 P(node) =0.1084252
class counts: 365 864
probabilities: 0.297 0.703
Predicting On Train Data
pred_train<-predict(model_cart,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8430 365
1 1676 864
Accuracy : 0.8199
95% CI : (0.8127, 0.827)
No Information Rate : 0.8916
P-Value [Acc > NIR] : 1
Kappa : 0.3658
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8342
Specificity : 0.7030
Pos Pred Value : 0.9585
Neg Pred Value : 0.3402
Prevalence : 0.8916
Detection Rate : 0.7437
Detection Prevalence : 0.7759
Balanced Accuracy : 0.7686
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6493283
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

In CART model the node(splitting variable) is selected based on GINI index i.e,the attribute with smallest GINI index is selected.It is pruned with the complexity parameter of 0.01.
Here, it is predicting 9254 values correctly with accuracy 82%, inter rate reliability of 35%,true positive rate is 83%,true negative rate is 69%, balanced accuracy is 76% and area under curve is 64%.
C5.0 DECISION TREE
Model Building
library(C50)
train$`default payment next month`<-as.factor(train$`default payment next month`)
class(train$`default payment next month`)
[1] "factor"
model_c50<-C5.0(`default payment next month` ~ ., data=train)
summary(model_c50)
Call:
C5.0.formula(formula = `default payment next month` ~ ., data = train)
C5.0 [Release 2.07 GPL Edition] Wed Aug 15 08:37:56 2018
-------------------------------
Class specified by attribute `outcome'
Read 11335 cases (22 attributes) from undefined.data
Decision tree:
PAY_0 > 1: 1 (1229/365)
PAY_0 <= 1:
:...PAY_2 <= 1: 0 (9258/1321)
PAY_2 > 1:
:...PAY_6 <= 0: 0 (590/210)
PAY_6 > 0: 1 (258/113)
Evaluation on training data (11335 cases):
Decision Tree
----------------
Size Errors
4 2009(17.7%) <<
(a) (b) <-classified as
---- ----
8317 478 (a): class 0
1531 1009 (b): class 1
Attribute usage:
100.00% PAY_0
89.16% PAY_2
7.48% PAY_6
Time: 0.5 secs
Predicting On Train Data
pred_train<-predict(model_c50,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8317 478
1 1531 1009
Accuracy : 0.8228
95% CI : (0.8156, 0.8298)
No Information Rate : 0.8688
P-Value [Acc > NIR] : 1
Kappa : 0.4022
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8445
Specificity : 0.6785
Pos Pred Value : 0.9457
Neg Pred Value : 0.3972
Prevalence : 0.8688
Detection Rate : 0.7337
Detection Prevalence : 0.7759
Balanced Accuracy : 0.7615
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6714475
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in c50 the node is selected based on information gain ratio and pruning is done by confidence threshold of 0.25.
It is predicting 9297 values correctly with accuracy 82%, kappa being 36%,true positive rate being 83%,true negative rate being 71%,balanced accuracy is 77% and area under curve is 65%.
SVM (SUPPORT VECTOR MACHINES) ADVANCED MODEL
Model Building
library(e1071)
model_svm<-svm(`default payment next month` ~ ., data=train,cost=100,gamma=1)
summary(model_svm)
Call:
svm(formula = `default payment next month` ~ ., data = train,
cost = 100, gamma = 1)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 100
gamma: 1
Number of Support Vectors: 7723
( 2310 5413 )
Number of Classes: 2
Levels:
0 1
Predicting On Train Data
pred_train<-predict(model_svm,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8780 15
1 163 2377
Accuracy : 0.9843
95% CI : (0.9818, 0.9865)
No Information Rate : 0.789
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9539
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9818
Specificity : 0.9937
Pos Pred Value : 0.9983
Neg Pred Value : 0.9358
Prevalence : 0.7890
Detection Rate : 0.7746
Detection Prevalence : 0.7759
Balanced Accuracy : 0.9878
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9670606
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here,the model is built using radial basis function where parameter gamma of 1 is taken and the classes of outcome variable is separated using hyperplane concept.
Here, it is predicting 11117 values correctly with accuracy of 98%,kappa being 94%, true positive rate being 98%,true negative rate being 99%,98% of time the binary classes are predicted correctly and area under curve being 96%, it is a very good model with vey high performance.
CART BAGGING ENSEMBLE MODEL
Model Building
library(ipred)
model_cartbag<-bagging(`default payment next month` ~ ., data=train)
varImp(model_cartbag)
Overall
AGE 910.9295
BILL_AMT1 1011.4888
BILL_AMT2 922.3781
BILL_AMT3 813.0373
BILL_AMT4 754.9019
BILL_AMT5 710.5565
BILL_AMT6 656.7559
EDUCATION 322.6778
LIMIT_BAL 810.4052
PAY_0 927.7674
PAY_2 777.1373
PAY_3 690.3627
PAY_4 646.8209
PAY_5 582.1123
PAY_6 214.4319
PAY_AMT1 636.2980
PAY_AMT2 648.5776
PAY_AMT3 621.9875
PAY_AMT4 574.8440
PAY_AMT5 573.7682
PAY_AMT6 535.8119
Predicting On Train Data
pred_train<-predict(model_cartbag,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8789 6
1 59 2481
Accuracy : 0.9943
95% CI : (0.9927, 0.9956)
No Information Rate : 0.7806
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9834
Mcnemar's Test P-Value : 1.12e-10
Sensitivity : 0.9933
Specificity : 0.9976
Pos Pred Value : 0.9993
Neg Pred Value : 0.9768
Prevalence : 0.7806
Detection Rate : 0.7754
Detection Prevalence : 0.7759
Balanced Accuracy : 0.9955
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9880447
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here,in cart bagging different models on same algorithm is generated and takes the average accuracy/error by default it builds 500 pruned trees which reduces the variance.
It predicts 11294 values correctly with an accuracy of 99%,inter rater reliability being 99% model is reliable,true positive rate being 99%,true negative rate being nearl 100%,balanced accuracy of 99% and area under curve being 99%, it is a very good model.
ADAPTIVE BOOSTING ENSEMBLE MODEL
Model Building
library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada
Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential",
type = "discrete", iter = 100)
Loss: exponential Method: discrete Iteration: 100
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 8381 414
1 1564 976
Train Error: 0.175
Out-Of-Bag Error: 0.175 iteration= 82
Additional Estimates of number of iterations:
train.err1 train.kap1
100 23
plot(model_ada)

Predicting On Train Data
pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8381 414
1 1564 976
Accuracy : 0.8255
95% CI : (0.8184, 0.8324)
No Information Rate : 0.8774
P-Value [Acc > NIR] : 1
Kappa : 0.4019
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8427
Specificity : 0.7022
Pos Pred Value : 0.9529
Neg Pred Value : 0.3843
Prevalence : 0.8774
Detection Rate : 0.7394
Detection Prevalence : 0.7759
Balanced Accuracy : 0.7724
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6685899
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in boosting it classifies weak learners and makes them strong learners using decision stump.It performs 100 iterations by exponential method .It has a train and test error of 18% with automatically splitting the data into train data(68% of data) and test data(32% of data).
It predicts 9290 values correctly with accuracy of 82%, kappa being 38%,true positive rate of 84%,true negative rate of 68%,balanced accuracy of 76% and area under curve of 66%,it is quite a good model but kappa is very low.
XTREME GRADIENT BOOSTING ENSEMBLE MODEL
Model Building
library(xgboost)
x<-train[,1:21]
y<-train[,22]
model_xgb<-xgboost(data=as.matrix(x),label = as.matrix(y),nrounds = 100)
[1] train-rmse:0.434421
[2] train-rmse:0.396022
[3] train-rmse:0.374531
[4] train-rmse:0.361911
[5] train-rmse:0.353549
[6] train-rmse:0.347930
[7] train-rmse:0.345045
[8] train-rmse:0.342586
[9] train-rmse:0.339969
[10] train-rmse:0.337570
[11] train-rmse:0.335436
[12] train-rmse:0.334377
[13] train-rmse:0.331047
[14] train-rmse:0.330220
[15] train-rmse:0.328446
[16] train-rmse:0.326319
[17] train-rmse:0.324078
[18] train-rmse:0.323263
[19] train-rmse:0.322829
[20] train-rmse:0.320912
[21] train-rmse:0.319064
[22] train-rmse:0.317763
[23] train-rmse:0.316315
[24] train-rmse:0.314133
[25] train-rmse:0.312600
[26] train-rmse:0.311811
[27] train-rmse:0.311520
[28] train-rmse:0.310819
[29] train-rmse:0.310101
[30] train-rmse:0.307327
[31] train-rmse:0.304829
[32] train-rmse:0.303094
[33] train-rmse:0.300805
[34] train-rmse:0.300008
[35] train-rmse:0.298665
[36] train-rmse:0.296838
[37] train-rmse:0.295095
[38] train-rmse:0.294165
[39] train-rmse:0.292532
[40] train-rmse:0.292150
[41] train-rmse:0.291544
[42] train-rmse:0.289300
[43] train-rmse:0.288217
[44] train-rmse:0.286501
[45] train-rmse:0.284888
[46] train-rmse:0.283152
[47] train-rmse:0.282668
[48] train-rmse:0.282393
[49] train-rmse:0.281483
[50] train-rmse:0.280589
[51] train-rmse:0.279338
[52] train-rmse:0.279012
[53] train-rmse:0.277776
[54] train-rmse:0.276287
[55] train-rmse:0.275341
[56] train-rmse:0.274861
[57] train-rmse:0.273163
[58] train-rmse:0.272174
[59] train-rmse:0.271597
[60] train-rmse:0.269858
[61] train-rmse:0.269611
[62] train-rmse:0.267767
[63] train-rmse:0.267065
[64] train-rmse:0.265849
[65] train-rmse:0.264567
[66] train-rmse:0.263567
[67] train-rmse:0.262096
[68] train-rmse:0.261212
[69] train-rmse:0.260519
[70] train-rmse:0.259512
[71] train-rmse:0.257833
[72] train-rmse:0.256280
[73] train-rmse:0.255731
[74] train-rmse:0.255442
[75] train-rmse:0.253063
[76] train-rmse:0.252047
[77] train-rmse:0.250519
[78] train-rmse:0.249685
[79] train-rmse:0.248708
[80] train-rmse:0.248013
[81] train-rmse:0.246565
[82] train-rmse:0.246429
[83] train-rmse:0.245796
[84] train-rmse:0.244609
[85] train-rmse:0.243741
[86] train-rmse:0.243224
[87] train-rmse:0.242179
[88] train-rmse:0.241366
[89] train-rmse:0.240682
[90] train-rmse:0.239672
[91] train-rmse:0.238898
[92] train-rmse:0.238148
[93] train-rmse:0.237244
[94] train-rmse:0.236181
[95] train-rmse:0.235219
[96] train-rmse:0.234477
[97] train-rmse:0.233583
[98] train-rmse:0.232551
[99] train-rmse:0.231134
[100] train-rmse:0.230896
Predicting On Train Data
pred<-predict(model_xgb,as.matrix(x))
pred_train<-round(pred)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8722 73
1 686 1854
Accuracy : 0.933
95% CI : (0.9283, 0.9376)
No Information Rate : 0.83
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7894
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9271
Specificity : 0.9621
Pos Pred Value : 0.9917
Neg Pred Value : 0.7299
Prevalence : 0.8300
Detection Rate : 0.7695
Detection Prevalence : 0.7759
Balanced Accuracy : 0.9446
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.8608105
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in xtreme gradient boosting model on train data root mean square error is calculated for all 100 iterations and with lowest rmse is selected with 23.8%.
It builded the model with 93% accuracy by predicting 10548 values correctly,inter rater reliability being 78% we can trust our model,true positive rate being 92%,true negative rate being 96%,balanced accuracy of 94% and area under curve of 85%, it is a good model.
Let's, compare the different algorithmic models based on various parameters discussed till now and decide which is a better model to perform our research.
SUMMARY OF DIFFERENT MODELS ON TRAIN DATA
SUMMARY
| cart.bagging_train |
99 |
99 |
99 |
99 |
99.6 |
99 |
| svm_train |
98 |
94 |
98 |
99 |
98.4 |
96 |
| xgb.boosting_train |
93 |
78 |
92 |
96 |
94.3 |
85 |
| ada.boosting_train |
82 |
38 |
84 |
69 |
76.3 |
66 |
| c50_train |
82 |
37 |
83 |
71 |
77.0 |
65 |
| cart_train |
81 |
35 |
83 |
69 |
76.1 |
64 |
| binary.logistic.regression_train |
81 |
28 |
82 |
70 |
76.0 |
61 |
| naive.bayes_train |
66 |
27 |
88 |
37 |
62.6 |
68 |

From the model accuracy plot, we can conclude that naive bayes model is eliminated due to its low accuracy .CART bagging model,svm model and xg boosting model have high accuracy and from summary also we can infer that all these parameters are high for these models,so lets predict the payment defaulters on test1 and test2 data and check whether the models are biased/varianced or not.
So,therefore lets predict the outcome on CART bagging,svm,xg boosting,adaptive boosting and c50 decision tree and compare their performance related to bias and variance.
PREDICTING FUTURE DATA (i.e on test1 and on test2 data)
Using Cart Bagging Ensemble Model
library(caret)
pred_test1<-predict(model_cartbag,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6151 469
1 1152 729
Accuracy : 0.8093
95% CI : (0.8008, 0.8176)
No Information Rate : 0.8591
P-Value [Acc > NIR] : 1
Kappa : 0.364
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8423
Specificity : 0.6085
Pos Pred Value : 0.9292
Neg Pred Value : 0.3876
Prevalence : 0.8591
Detection Rate : 0.7236
Detection Prevalence : 0.7787
Balanced Accuracy : 0.7254
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6583569
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_cartbag,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6087 476
1 1188 750
Accuracy : 0.8043
95% CI : (0.7957, 0.8126)
No Information Rate : 0.8558
P-Value [Acc > NIR] : 1
Kappa : 0.3612
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8367
Specificity : 0.6117
Pos Pred Value : 0.9275
Neg Pred Value : 0.3870
Prevalence : 0.8558
Detection Rate : 0.7160
Detection Prevalence : 0.7720
Balanced Accuracy : 0.7242
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6572345
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except kappa remaining all parameters have good values, so overall model is good.
Using Support Vector Machines Advanced Model
pred_test1<-predict(model_svm,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5839 781
1 1378 503
Accuracy : 0.746
95% CI : (0.7366, 0.7553)
No Information Rate : 0.849
P-Value [Acc > NIR] : 1
Kappa : 0.1686
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8091
Specificity : 0.3917
Pos Pred Value : 0.8820
Neg Pred Value : 0.2674
Prevalence : 0.8490
Detection Rate : 0.6869
Detection Prevalence : 0.7787
Balanced Accuracy : 0.6004
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.5747176
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_svm,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5797 766
1 1383 555
Accuracy : 0.7472
95% CI : (0.7378, 0.7564)
No Information Rate : 0.8446
P-Value [Acc > NIR] : 1
Kappa : 0.1911
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8074
Specificity : 0.4201
Pos Pred Value : 0.8833
Neg Pred Value : 0.2864
Prevalence : 0.8446
Detection Rate : 0.6819
Detection Prevalence : 0.7720
Balanced Accuracy : 0.6138
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.5848314
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except accuracy and sensitivity remaining all parameters have bad values, so overall model is bad.
Using Xtreme Gradient Boosting Ensemble Model
x1<-test1[,1:21]
pred<-predict(model_xgb,as.matrix(x1))
pred_test1<-round(pred)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6159 461
1 1177 704
Accuracy : 0.8073
95% CI : (0.7988, 0.8157)
No Information Rate : 0.863
P-Value [Acc > NIR] : 1
Kappa : 0.3527
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8396
Specificity : 0.6043
Pos Pred Value : 0.9304
Neg Pred Value : 0.3743
Prevalence : 0.8630
Detection Rate : 0.7245
Detection Prevalence : 0.7787
Balanced Accuracy : 0.7219
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6523158
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

x2<-test2[,1:21]
pred<-predict(model_xgb,as.matrix(x2))
pred_test2<-round(pred)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6087 476
1 1218 720
Accuracy : 0.8007
95% CI : (0.7921, 0.8092)
No Information Rate : 0.8593
P-Value [Acc > NIR] : 1
Kappa : 0.3456
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8333
Specificity : 0.6020
Pos Pred Value : 0.9275
Neg Pred Value : 0.3715
Prevalence : 0.8593
Detection Rate : 0.7160
Detection Prevalence : 0.7720
Balanced Accuracy : 0.7176
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6494946
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except kappa, specificity and area under curve remaining all parameters have good values, so overall model is bit bad one.
Using Adaptive Boosting Ensemble Model
pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6275 345
1 1200 681
Accuracy : 0.8183
95% CI : (0.8099, 0.8264)
No Information Rate : 0.8793
P-Value [Acc > NIR] : 1
Kappa : 0.3701
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8395
Specificity : 0.6637
Pos Pred Value : 0.9479
Neg Pred Value : 0.3620
Prevalence : 0.8793
Detection Rate : 0.7381
Detection Prevalence : 0.7787
Balanced Accuracy : 0.7516
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6549633
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6206 357
1 1220 718
Accuracy : 0.8145
95% CI : (0.8061, 0.8227)
No Information Rate : 0.8735
P-Value [Acc > NIR] : 1
Kappa : 0.3749
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8357
Specificity : 0.6679
Pos Pred Value : 0.9456
Neg Pred Value : 0.3705
Prevalence : 0.8735
Detection Rate : 0.7300
Detection Prevalence : 0.7720
Balanced Accuracy : 0.7518
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6580446
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except sensitivity and accuracy remaining all parameters have quite good values, so overall model is good.
Using C5.0 Decision Tree
pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6251 369
1 1154 727
Accuracy : 0.8208
95% CI : (0.8125, 0.8289)
No Information Rate : 0.8711
P-Value [Acc > NIR] : 1
Kappa : 0.3888
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8442
Specificity : 0.6633
Pos Pred Value : 0.9443
Neg Pred Value : 0.3865
Prevalence : 0.8711
Detection Rate : 0.7353
Detection Prevalence : 0.7787
Balanced Accuracy : 0.7537
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6653782
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6172 391
1 1183 755
Accuracy : 0.8148
95% CI : (0.8064, 0.8231)
No Information Rate : 0.8652
P-Value [Acc > NIR] : 1
Kappa : 0.3855
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8392
Specificity : 0.6588
Pos Pred Value : 0.9404
Neg Pred Value : 0.3896
Prevalence : 0.8652
Detection Rate : 0.7260
Detection Prevalence : 0.7720
Balanced Accuracy : 0.7490
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6650002
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except sensitivity and accuracy remaining all parameters have bad values, so overall model is bad.
SUMMARY ON TEST1 AND TEST2 DATA
SUMMARY
| _ |
train |
test1 |
test2 |
train |
test1 |
test2 |
train |
test1 |
test2 |
train |
test1 |
test2 |
train |
test1 |
test2 |
train |
test1 |
test2 |
| cart.bagging |
99.7 |
81 |
81 |
99 |
37 |
37 |
99.6 |
84 |
84 |
99.8 |
61 |
63 |
99.6 |
72 |
73 |
99 |
66 |
66 |
| svm |
98 |
75 |
76 |
94 |
19 |
21 |
98 |
81 |
81 |
99 |
41 |
45 |
98.4 |
61 |
63 |
96 |
58 |
59 |
| xgb.boosting |
93 |
81 |
81 |
78 |
37 |
37 |
92 |
84 |
84 |
96 |
61 |
63 |
94 |
73 |
73 |
85 |
66 |
66 |
| ada.boosting |
82 |
82 |
82 |
38 |
38 |
38 |
84 |
84 |
84 |
69 |
68 |
69 |
76 |
76 |
76 |
66 |
66 |
66 |
| c50 |
82 |
82 |
82 |
37 |
38 |
37 |
83 |
84 |
83 |
71 |
70 |
71 |
77 |
77 |
77 |
65 |
66 |
65 |
| cart |
81 |
_ |
_ |
35 |
_ |
_ |
83 |
_ |
_ |
69 |
_ |
_ |
76 |
_ |
_ |
64 |
_ |
_ |
| binary.logistic.regression |
81 |
_ |
_ |
28 |
_ |
_ |
82 |
_ |
_ |
70 |
_ |
_ |
76 |
_ |
_ |
61 |
_ |
_ |
| naive.bayes |
66 |
_ |
_ |
27 |
_ |
_ |
88 |
_ |
_ |
37 |
_ |
_ |
63 |
_ |
_ |
68 |
_ |
_ |

From above summary, checking the consistency of all the parameters,we can infer that the CART bagging,support vector machines and xtreme gradient boosting are more biased(performance is good on train data but performance is comparatively bad on test1 and test2 data) and varianced(error rate on train,test1 and test2 data is high).
But adaptive boosting and c50 decision tree is less biased and varianced and has a good accuracy too, so therefore lets consider adaptive boosting in this analysis and still check its consistency of its performance on differnt train, test1, test2 data by shuffling the same ccdefault dataset.
SHUFFLE THE DATA
set.seed(1)
RESAMPLING THE DATA
library(caret)
splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)
[1] 11335 22
dim(test1)
[1] 8501 22
dim(test2)
[1] 8501 22
MODEL BUILDING USING ADAPTIVE BOOSTING
library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada
Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential",
type = "discrete", iter = 100)
Loss: exponential Method: discrete Iteration: 100
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 8385 404
1 1604 942
Train Error: 0.177
Out-Of-Bag Error: 0.177 iteration= 27
Additional Estimates of number of iterations:
train.err1 train.kap1
21 34
plot(model_ada)

PREDICTING ON TRAIN DATA
pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8385 404
1 1604 942
Accuracy : 0.8228
95% CI : (0.8157, 0.8298)
No Information Rate : 0.8813
P-Value [Acc > NIR] : 1
Kappa : 0.3892
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8394
Specificity : 0.6999
Pos Pred Value : 0.9540
Neg Pred Value : 0.3700
Prevalence : 0.8813
Detection Rate : 0.7397
Detection Prevalence : 0.7754
Balanced Accuracy : 0.7696
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6620128
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICTING ON TEST1 AND TEST2 DATA
pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6228 349
1 1218 706
Accuracy : 0.8157
95% CI : (0.8073, 0.8239)
No Information Rate : 0.8759
P-Value [Acc > NIR] : 1
Kappa : 0.3736
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8364
Specificity : 0.6692
Pos Pred Value : 0.9469
Neg Pred Value : 0.3669
Prevalence : 0.8759
Detection Rate : 0.7326
Detection Prevalence : 0.7737
Balanced Accuracy : 0.7528
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6569401
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6277 335
1 1207 682
Accuracy : 0.8186
95% CI : (0.8103, 0.8267)
No Information Rate : 0.8804
P-Value [Acc > NIR] : 1
Kappa : 0.3716
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8387
Specificity : 0.6706
Pos Pred Value : 0.9493
Neg Pred Value : 0.3610
Prevalence : 0.8804
Detection Rate : 0.7384
Detection Prevalence : 0.7778
Balanced Accuracy : 0.7547
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6551861
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA
set.seed(2)
RESAMPLING THE DATA
library(caret)
splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)
[1] 11335 22
dim(test1)
[1] 8501 22
dim(test2)
[1] 8501 22
MODEL BUILDING USING ADAPTIVE BOOSTING
library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada
Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential",
type = "discrete", iter = 100)
Loss: exponential Method: discrete Iteration: 100
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 8338 441
1 1559 997
Train Error: 0.176
Out-Of-Bag Error: 0.176 iteration= 100
Additional Estimates of number of iterations:
train.err1 train.kap1
78 10
plot(model_ada)

PREDICTING ON TRAIN DATA
pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8338 441
1 1559 997
Accuracy : 0.8236
95% CI : (0.8164, 0.8305)
No Information Rate : 0.8731
P-Value [Acc > NIR] : 1
Kappa : 0.4022
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8425
Specificity : 0.6933
Pos Pred Value : 0.9498
Neg Pred Value : 0.3901
Prevalence : 0.8731
Detection Rate : 0.7356
Detection Prevalence : 0.7745
Balanced Accuracy : 0.7679
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6699145
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICTING ON TEST1 AND TEST2 DATA
pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6222 324
1 1220 735
Accuracy : 0.8184
95% CI : (0.81, 0.8265)
No Information Rate : 0.8754
P-Value [Acc > NIR] : 1
Kappa : 0.389
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8361
Specificity : 0.6941
Pos Pred Value : 0.9505
Neg Pred Value : 0.3760
Prevalence : 0.8754
Detection Rate : 0.7319
Detection Prevalence : 0.7700
Balanced Accuracy : 0.7651
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6632316
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 6278 375
1 1160 688
Accuracy : 0.8194
95% CI : (0.8111, 0.8276)
No Information Rate : 0.875
P-Value [Acc > NIR] : 1
Kappa : 0.3732
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8440
Specificity : 0.6472
Pos Pred Value : 0.9436
Neg Pred Value : 0.3723
Prevalence : 0.8750
Detection Rate : 0.7385
Detection Prevalence : 0.7826
Balanced Accuracy : 0.7456
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6579644
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

SUMMARY ON DIFFERENT SHUFFLED DATA
Here is the summary of model performance on shuffled datasets by using adaptive boosting.
FINAL SUMMARY
| train |
82 |
37.8 |
83.8 |
68.7 |
76.3 |
65.7 |
| test1 |
82 |
38.1 |
84 |
67.7 |
76 |
65.9 |
| test2 |
81.7 |
37.6 |
83.5 |
68.6 |
76 |
65.7 |
| reshuffle_data1 |
. |
. |
. |
. |
. |
. |
| train |
82.3 |
39 |
84 |
70 |
77 |
66.3 |
| test1 |
81.6 |
37.4 |
83.6 |
66.9 |
75.3 |
65.7 |
| test2 |
81.9 |
37.2 |
83.9 |
67 |
75.5 |
65.5 |
| reshuffle_data2 |
. |
. |
. |
. |
. |
. |
| train |
82.3 |
38.9 |
83.9 |
69.9 |
76.9 |
66.2 |
| test1 |
81.5 |
37.3 |
83.6 |
66.9 |
75.2 |
65.6 |
| test2 |
81.8 |
37.1 |
83.8 |
67 |
75.4 |
65.5 |
| reshuffle_data3 |
. |
. |
. |
. |
. |
. |
| train |
82.3 |
40.2 |
84.2 |
69.3 |
76.8 |
67 |
| test1 |
81.8 |
39 |
83.6 |
69.4 |
76.5 |
66.3 |
| test2 |
81.9 |
37.3 |
84.4 |
64.7 |
74.5 |
65.8 |
From the above summary we can conclude that the model is consistently performing good on all the parameters so hence adaptive boosting is the best model on this dataset.
FINAL DATA WITH PREDICTED VALUES
library(caret)
predicted<-predict(model_ada,ccdefault)
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:xgboost':
slice
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
ccdefault<-mutate(ccdefault,expected_default.payment.next.month=predicted)
FINAL OBSERVED AND EXPECTED PAYMENT DEFAULTERS (displaying only 33 observations)
| 20000 |
2 |
24 |
2 |
2 |
-1 |
-1 |
-2 |
-2 |
3913 |
3102 |
689 |
0 |
0 |
0 |
0 |
689 |
0 |
0 |
0 |
0 |
1 |
1 |
| 120000 |
2 |
26 |
-1 |
2 |
0 |
0 |
0 |
2 |
2682 |
1725 |
2682 |
3272 |
3455 |
3261 |
0 |
1000 |
1000 |
1000 |
0 |
2000 |
1 |
0 |
| 90000 |
2 |
34 |
0 |
0 |
0 |
0 |
0 |
0 |
29239 |
14027 |
13559 |
14331 |
14948 |
15549 |
1518 |
1500 |
1000 |
1000 |
1000 |
5000 |
0 |
0 |
| 50000 |
2 |
37 |
0 |
0 |
0 |
0 |
0 |
0 |
46990 |
48233 |
49291 |
28314 |
28959 |
29547 |
2000 |
2019 |
1200 |
1100 |
1069 |
1000 |
0 |
0 |
| 50000 |
2 |
57 |
-1 |
0 |
-1 |
0 |
0 |
0 |
8617 |
5670 |
35835 |
20940 |
19146 |
19131 |
2000 |
36681 |
10000 |
9000 |
689 |
679 |
0 |
0 |
| 50000 |
1 |
37 |
0 |
0 |
0 |
0 |
0 |
0 |
64400 |
57069 |
57608 |
19394 |
19619 |
20024 |
2500 |
1815 |
657 |
1000 |
1000 |
800 |
0 |
0 |
| 500000 |
1 |
29 |
0 |
0 |
0 |
0 |
0 |
0 |
367965 |
412023 |
445007 |
542653 |
483003 |
473944 |
55000 |
40000 |
38000 |
20239 |
13750 |
13770 |
0 |
0 |
| 140000 |
3 |
28 |
0 |
0 |
2 |
0 |
0 |
0 |
11285 |
14096 |
12108 |
12211 |
11793 |
3719 |
3329 |
0 |
432 |
1000 |
1000 |
1000 |
0 |
0 |
| 20000 |
3 |
35 |
-2 |
-2 |
-2 |
-2 |
-1 |
-1 |
0 |
0 |
0 |
0 |
13007 |
13912 |
0 |
0 |
0 |
13007 |
1122 |
0 |
0 |
0 |
| 200000 |
3 |
34 |
0 |
0 |
2 |
0 |
0 |
-1 |
11073 |
9787 |
5535 |
2513 |
1828 |
3731 |
2306 |
12 |
50 |
300 |
3738 |
66 |
0 |
0 |
| 260000 |
1 |
51 |
-1 |
-1 |
-1 |
-1 |
-1 |
2 |
12261 |
21670 |
9966 |
8517 |
22287 |
13668 |
21818 |
9966 |
8583 |
22301 |
0 |
3640 |
0 |
0 |
| 630000 |
2 |
41 |
-1 |
0 |
-1 |
-1 |
-1 |
-1 |
12137 |
6500 |
6500 |
6500 |
6500 |
2870 |
1000 |
6500 |
6500 |
6500 |
2870 |
0 |
0 |
0 |
| 70000 |
2 |
30 |
1 |
2 |
2 |
0 |
0 |
2 |
65802 |
67369 |
65701 |
66782 |
36137 |
36894 |
3200 |
0 |
3000 |
3000 |
1500 |
0 |
1 |
0 |
| 250000 |
1 |
29 |
0 |
0 |
0 |
0 |
0 |
0 |
70887 |
67060 |
63561 |
59696 |
56875 |
55512 |
3000 |
3000 |
3000 |
3000 |
3000 |
3000 |
0 |
0 |
| 50000 |
3 |
23 |
1 |
2 |
0 |
0 |
0 |
0 |
50614 |
29173 |
28116 |
28771 |
29531 |
30211 |
0 |
1500 |
1100 |
1200 |
1300 |
1100 |
0 |
0 |
| 20000 |
1 |
24 |
0 |
0 |
2 |
2 |
2 |
2 |
15376 |
18010 |
17428 |
18338 |
17905 |
19104 |
3200 |
0 |
1500 |
0 |
1650 |
0 |
1 |
0 |
| 320000 |
1 |
49 |
0 |
0 |
0 |
-1 |
-1 |
-1 |
253286 |
246536 |
194663 |
70074 |
5856 |
195599 |
10358 |
10000 |
75940 |
20000 |
195599 |
50000 |
0 |
0 |
| 360000 |
1 |
49 |
1 |
-2 |
-2 |
-2 |
-2 |
-2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| 180000 |
1 |
29 |
1 |
-2 |
-2 |
-2 |
-2 |
-2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| 130000 |
3 |
39 |
0 |
0 |
0 |
0 |
0 |
-1 |
38358 |
27688 |
24489 |
20616 |
11802 |
930 |
3000 |
1537 |
1000 |
2000 |
930 |
33764 |
0 |
0 |
| 120000 |
2 |
39 |
-1 |
-1 |
-1 |
-1 |
-1 |
-1 |
316 |
316 |
316 |
0 |
632 |
316 |
316 |
316 |
0 |
632 |
316 |
0 |
1 |
0 |
| 70000 |
2 |
26 |
2 |
0 |
0 |
2 |
2 |
2 |
41087 |
42445 |
45020 |
44006 |
46905 |
46012 |
2007 |
3582 |
0 |
3601 |
0 |
1820 |
1 |
1 |
| 450000 |
1 |
40 |
-2 |
-2 |
-2 |
-2 |
-2 |
-2 |
5512 |
19420 |
1473 |
560 |
0 |
0 |
19428 |
1473 |
560 |
0 |
0 |
1128 |
1 |
0 |
| 90000 |
1 |
23 |
0 |
0 |
0 |
-1 |
0 |
0 |
4744 |
7070 |
0 |
5398 |
6360 |
8292 |
5757 |
0 |
5398 |
1200 |
2045 |
2000 |
0 |
0 |
| 50000 |
3 |
23 |
0 |
0 |
0 |
0 |
0 |
0 |
47620 |
41810 |
36023 |
28967 |
29829 |
30046 |
1973 |
1426 |
1001 |
1432 |
1062 |
997 |
0 |
0 |
| 50000 |
3 |
30 |
0 |
0 |
0 |
0 |
0 |
0 |
22541 |
16138 |
17163 |
17878 |
18931 |
19617 |
1300 |
1300 |
1000 |
1500 |
1000 |
1012 |
0 |
0 |
| 50000 |
3 |
47 |
-1 |
-1 |
-1 |
-1 |
-1 |
-1 |
650 |
3415 |
3416 |
2040 |
30430 |
257 |
3415 |
3421 |
2044 |
30430 |
257 |
0 |
0 |
0 |
| 50000 |
1 |
26 |
0 |
0 |
0 |
0 |
0 |
0 |
15329 |
16575 |
17496 |
17907 |
18375 |
11400 |
1500 |
1500 |
1000 |
1000 |
1600 |
0 |
0 |
0 |
| 230000 |
1 |
27 |
-1 |
-1 |
-1 |
-1 |
-1 |
-1 |
16646 |
17265 |
13266 |
15339 |
14307 |
36923 |
17270 |
13281 |
15339 |
14307 |
37292 |
0 |
0 |
0 |
| 50000 |
2 |
33 |
2 |
0 |
0 |
0 |
0 |
0 |
30518 |
29618 |
22102 |
22734 |
23217 |
23680 |
1718 |
1500 |
1000 |
1000 |
1000 |
716 |
1 |
1 |
| 100000 |
1 |
32 |
0 |
0 |
0 |
0 |
0 |
0 |
93036 |
84071 |
82880 |
80958 |
78703 |
75589 |
3023 |
3511 |
3302 |
3204 |
3200 |
2504 |
0 |
0 |
| 500000 |
2 |
54 |
-2 |
-2 |
-2 |
-2 |
-2 |
-2 |
10929 |
4152 |
22722 |
7521 |
71439 |
8981 |
4152 |
22827 |
7521 |
71439 |
981 |
51582 |
0 |
0 |
| 500000 |
1 |
58 |
-2 |
-2 |
-2 |
-2 |
-2 |
-2 |
13709 |
5006 |
31130 |
3180 |
0 |
5293 |
5006 |
31178 |
3180 |
0 |
5293 |
768 |
0 |
0 |
CONCLUSION
As banks play a vital role in providing financial services to their customers,banks should avoid wrong customers who can default and cause loss to the banks and other financial institutions.
In order to achieve their goals we can conclude that the model developed has used all the factors and data and has a high potential in predicting whether the customer would fail/succeed in making the next payment which will benefit the bank/financial institutions in making their decisions so hence we can deploy the model.