DEFINE PROBLEM STATEMENT

    Banks play a vital role in providing financial services to their customers, so there will be a high risk involved and it is very important to assess risk of payment defaulters.
    Using 'ccdefault' dataset can we predict whether the customers repayment is successful or not? 
    

DATASET INFORMATION

 This research aimed at the case of customer default payments in Taiwan which includes credible and non credible clients which employed payment defaulters as the response/outcome/dependent variable(y) which has the binary outcome with '1' indicating the payment defaulter and '0' as the non-payment defaulter and remaining 24 variables as the explanatory/predictor/independent variables(x).i.e, the role of these 24 variables in causation of defaulters where payment defaulters will be the effect due to the cause of these 24 variables.
CREDIT DEFAULTERS DATASET (displaying only 5 observations)
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
1 20000 2 2 1 24 2 2 -1 -1 -2 -2 3913 3102 689 0 0 0 0 689 0 0 0 0 1
2 120000 2 2 2 26 -1 2 0 0 0 2 2682 1725 2682 3272 3455 3261 0 1000 1000 1000 0 2000 1
3 90000 2 2 2 34 0 0 0 0 0 0 29239 14027 13559 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
4 50000 2 2 1 37 0 0 0 0 0 0 46990 48233 49291 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
5 50000 1 2 1 57 -1 0 -1 0 0 0 8617 5670 35835 20940 19146 19131 2000 36681 10000 9000 689 679 0

Attributes information

LIMIT_BAL is the amount of the given credit which includes both the individual consumer credit and his/her family (supplementary) credit in New Taiwan dollar.
SEX is the gender indicating whether he is the male or female ('1' being male and '2' being female).
EDUCATION indicates whether he/she is a school graduate/university/high school graduate/others.
MARRIAGE indicates whether he/she are married/not married/others.
AGE indicates the age of a person in years.
PAY_0,PAY2,...,PAY6 indicates the payment status for the previous months from april to september for the year 2005 with PAY_0 in september to PAY_6 in april.
BILL_AMT1,BILL_AMT2,..,BILL_AMT6  indicates the amount of bill statement from september,august,...,april respectively for the year 2005 in New Taiwan dollar.
PAY_AMT1,PAY_AMT2,...,PAY_AMT6 indicates the amount paid in september,august,...,april respectively in the year 2005 in New Taiwan dollar.

DATA PREPROCESSING

Variable selection

This study reviewed the literature and shortlisted the following 21 variables as explanatory variables:
ccdefault<-ccdefault[,c(-1,-3,-5)]
DATASET (displaying only 5 observations)
LIMIT_BAL EDUCATION AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
20000 2 24 2 2 -1 -1 -2 -2 3913 3102 689 0 0 0 0 689 0 0 0 0 1
120000 2 26 -1 2 0 0 0 2 2682 1725 2682 3272 3455 3261 0 1000 1000 1000 0 2000 1
90000 2 34 0 0 0 0 0 0 29239 14027 13559 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
50000 2 37 0 0 0 0 0 0 46990 48233 49291 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
50000 2 57 -1 0 -1 0 0 0 8617 5670 35835 20940 19146 19131 2000 36681 10000 9000 689 679 0

Check for missing values

sum(is.na(ccdefault))
[1] 0
summary(ccdefault)
   LIMIT_BAL         EDUCATION          AGE            PAY_0        
 Min.   :  10000   Min.   :0.000   Min.   :21.00   Min.   :-2.0000  
 1st Qu.:  50000   1st Qu.:1.000   1st Qu.:28.00   1st Qu.:-1.0000  
 Median : 140000   Median :2.000   Median :34.00   Median : 0.0000  
 Mean   : 167484   Mean   :1.853   Mean   :35.49   Mean   :-0.0167  
 3rd Qu.: 240000   3rd Qu.:2.000   3rd Qu.:41.00   3rd Qu.: 0.0000  
 Max.   :1000000   Max.   :6.000   Max.   :79.00   Max.   : 8.0000  
     PAY_2             PAY_3             PAY_4             PAY_5        
 Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000  
 1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   :-0.1338   Mean   :-0.1662   Mean   :-0.2207   Mean   :-0.2662  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000  
     PAY_6           BILL_AMT1         BILL_AMT2        BILL_AMT3      
 Min.   :-2.0000   Min.   :-165580   Min.   :-69777   Min.   :-157264  
 1st Qu.:-1.0000   1st Qu.:   3559   1st Qu.:  2985   1st Qu.:   2666  
 Median : 0.0000   Median :  22382   Median : 21200   Median :  20089  
 Mean   :-0.2911   Mean   :  51223   Mean   : 49179   Mean   :  47013  
 3rd Qu.: 0.0000   3rd Qu.:  67091   3rd Qu.: 64006   3rd Qu.:  60165  
 Max.   : 8.0000   Max.   : 964511   Max.   :983931   Max.   :1664089  
   BILL_AMT4         BILL_AMT5        BILL_AMT6          PAY_AMT1     
 Min.   :-170000   Min.   :-81334   Min.   :-339603   Min.   :     0  
 1st Qu.:   2327   1st Qu.:  1763   1st Qu.:   1256   1st Qu.:  1000  
 Median :  19052   Median : 18105   Median :  17071   Median :  2100  
 Mean   :  43263   Mean   : 40311   Mean   :  38872   Mean   :  5664  
 3rd Qu.:  54506   3rd Qu.: 50191   3rd Qu.:  49198   3rd Qu.:  5006  
 Max.   : 891586   Max.   :927171   Max.   : 961664   Max.   :873552  
    PAY_AMT2          PAY_AMT3         PAY_AMT4         PAY_AMT5       
 Min.   :      0   Min.   :     0   Min.   :     0   Min.   :     0.0  
 1st Qu.:    833   1st Qu.:   390   1st Qu.:   296   1st Qu.:   252.5  
 Median :   2009   Median :  1800   Median :  1500   Median :  1500.0  
 Mean   :   5921   Mean   :  5226   Mean   :  4826   Mean   :  4799.4  
 3rd Qu.:   5000   3rd Qu.:  4505   3rd Qu.:  4013   3rd Qu.:  4031.5  
 Max.   :1684259   Max.   :896040   Max.   :621000   Max.   :426529.0  
    PAY_AMT6        default payment next month
 Min.   :     0.0   Min.   :0.0000            
 1st Qu.:   117.8   1st Qu.:0.0000            
 Median :  1500.0   Median :0.0000            
 Mean   :  5215.5   Mean   :0.2212            
 3rd Qu.:  4000.0   3rd Qu.:0.0000            
 Max.   :528666.0   Max.   :1.0000            
By using the above 'sum' function which says there are no 'NA' values or from summary we can easily say that there are no available missing values.

Check for outliers

boxplot(ccdefault)

dim(ccdefault)
[1] 30000    22
From boxplot, we can infer that some values are different from most of the values in which some are extreme from the outliers itself which can be removed as these values may effect the model accuracy.

Remove extreme outliers

ccdefault<-ccdefault[ccdefault$LIMIT_BAL<1000000,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT1<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT1>7e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT2>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT3>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT3<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT4>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT5>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT5<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT6>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT6<0,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT1>3e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT2>5e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT3>4e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT4>3e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT5>350000,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT6>3e+05,]
boxplot(ccdefault)

dim(ccdefault)
[1] 28337    22
Here, after eliminating some extreme outliers we can see that all the remaining outliers are close to each other which shoudn't be removed as it will be more biased if removed.
Now lets use this preprocessed data which has 28337 observations and 22 variables to perform the research but before that lets check the multicollinearity among predectors.

Check for multicollinearity

library(corrplot)
corrplot 0.84 loaded
corrplot(cor(ccdefault[1:21]),method = "number")

The multicollinearity among predictors shouldn't exist so from above correlation matrix plot we see that payment status and amount bill statement has a high correlation among themselves as expected since the they are reflecting the cumulative status and amounts respectively of the previous months,except this there is a low correlation among other variables which is satisfactory.

Resampling the pre-processed data

 Lets split the data into train data with 11335 observations, test1 data with 8501 observations and test2 data with 8501 observations by using validation set/hold out approach.
library(caret)
Loading required package: lattice
Loading required package: ggplot2
splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)
[1] 11335    22
dim(test1)
[1] 8501   22
dim(test2)
[1] 8501   22
Now the model has to be built for train data by using different algorithmic models and the best model is choosen out of it based on different parameters and the outcomes are predicted using this best model.

BINARY LOGISTIC REGRESSION

Model Building

model_bin<-glm(`default payment next month` ~ ., data=train,family = binomial(link="logit"))
exp(model_bin$coefficients)
(Intercept)   LIMIT_BAL   EDUCATION         AGE       PAY_0       PAY_2 
  0.2711845   0.9999993   0.9394406   1.0120411   1.8252271   1.0652141 
      PAY_3       PAY_4       PAY_5       PAY_6   BILL_AMT1   BILL_AMT2 
  1.0810926   1.0412221   1.0463908   1.0002311   0.9999952   0.9999979 
  BILL_AMT3   BILL_AMT4   BILL_AMT5   BILL_AMT6    PAY_AMT1    PAY_AMT2 
  1.0000043   0.9999970   1.0000024   1.0000026   0.9999907   0.9999879 
   PAY_AMT3    PAY_AMT4    PAY_AMT5    PAY_AMT6 
  0.9999921   0.9999947   0.9999910   1.0000013 
Here, the model is built using the generalised linear model(binary logistic regression).
By using exponential function we convert the values into odds ratio for interpretation, when all coefficients are zero the odds ratio of payment defaulters will be 0.29,when LIMIT_BAL increases by a unit the payment defaulters increases by 0.99, similarly for other variables with their respective odds ratio.  

Predicting On Train Data

pred_train<-ifelse(model_bin$fitted.values>0.5,1,0)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8533  262
         1 1862  678
                                          
               Accuracy : 0.8126          
                 95% CI : (0.8053, 0.8198)
    No Information Rate : 0.9171          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3056          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8209          
            Specificity : 0.7213          
         Pos Pred Value : 0.9702          
         Neg Pred Value : 0.2669          
             Prevalence : 0.9171          
         Detection Rate : 0.7528          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.7711          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
Loading required package: gplots

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6185697
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

library(DMwR)
Loading required package: grid
regr.eval(train$`default payment next month`,pred_train)
      mae       mse      rmse      mape 
0.1873842 0.1873842 0.4328790       NaN 
Here,from the contingency table obtained model is predicting 9159 observations correctly i.e,with 81% accuracy but kappa being 28% < 70% inter rater reliability is not satisfactory (i.e, we can't rely on this model).
The true positive rate is 82% and the true negative rate is 70% which is bit satisfactory and 76% of the time the classes '0' and '1' are predicting correctly(from balanced accuracy).
The area under curve is 61% obtained from reciever optimistic characteristic curve which is quite a good model.

NAIVE BAYES METHOD

Model Building

train$`default payment next month`<-as.factor(train$`default payment next month`)
class(train$`default payment next month`)
[1] "factor"
library(naivebayes)
model_nb<-naive_bayes(`default payment next month` ~ ., data=train)

Predicting On Train Data

pred_train<-predict(model_nb,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 7048 1747
         1 1006 1534
                                         
               Accuracy : 0.7571         
                 95% CI : (0.7491, 0.765)
    No Information Rate : 0.7105         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.3672         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.8751         
            Specificity : 0.4675         
         Pos Pred Value : 0.8014         
         Neg Pred Value : 0.6039         
             Prevalence : 0.7105         
         Detection Rate : 0.6218         
   Detection Prevalence : 0.7759         
      Balanced Accuracy : 0.6713         
                                         
       'Positive' Class : 0              
                                         
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7026507
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Naive bayes is probabilistic model where it calculates a MAP(Maximum A Posterior)/future probabilities based on prior probabilities.
Here, as it is predicting 7491 values correctly the accuracy is only 66% and kappa being very less which is 27% and specificity is only 37% and 63% of the time model is predicting the binary classes correctly it is a bad model with area under curve 68%.Only the true positive rate is satisfactory which is 88%.

CART(CLASSIFICATION AND REGRESSION TREES)

Model Building

library(rpart)
model_cart<-rpart(`default payment next month` ~ ., data=train)
summary(model_cart)
Call:
rpart(formula = `default payment next month` ~ ., data = train)
  n= 11335 

         CP nsplit rel error    xerror       xstd
1 0.1964567      0 1.0000000 1.0000000 0.01747794
2 0.0100000      1 0.8035433 0.8035433 0.01610565

Variable importance
PAY_0 PAY_4 PAY_5 PAY_6 PAY_3 PAY_2 
   85     4     3     3     3     2 

Node number 1: 11335 observations,    complexity param=0.1964567
  predicted class=0  expected loss=0.2240847  P(node) =1
    class counts:  8795  2540
   probabilities: 0.776 0.224 
  left son=2 (10106 obs) right son=3 (1229 obs)
  Primary splits:
      PAY_0 < 1.5 to the left,  improve=632.3547, (0 missing)
      PAY_2 < 1.5 to the left,  improve=464.2895, (0 missing)
      PAY_3 < 1.5 to the left,  improve=363.1352, (0 missing)
      PAY_4 < 1   to the left,  improve=330.6260, (0 missing)
      PAY_5 < 1   to the left,  improve=315.4737, (0 missing)
  Surrogate splits:
      PAY_4 < 2.5 to the left,  agree=0.896, adj=0.044, (0 split)
      PAY_5 < 2.5 to the left,  agree=0.896, adj=0.041, (0 split)
      PAY_6 < 2.5 to the left,  agree=0.895, adj=0.033, (0 split)
      PAY_3 < 3.5 to the left,  agree=0.895, adj=0.031, (0 split)
      PAY_2 < 2.5 to the left,  agree=0.894, adj=0.025, (0 split)

Node number 2: 10106 observations
  predicted class=0  expected loss=0.1658421  P(node) =0.8915748
    class counts:  8430  1676
   probabilities: 0.834 0.166 

Node number 3: 1229 observations
  predicted class=1  expected loss=0.2969894  P(node) =0.1084252
    class counts:   365   864
   probabilities: 0.297 0.703 

Predicting On Train Data

pred_train<-predict(model_cart,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8430  365
         1 1676  864
                                         
               Accuracy : 0.8199         
                 95% CI : (0.8127, 0.827)
    No Information Rate : 0.8916         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.3658         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.8342         
            Specificity : 0.7030         
         Pos Pred Value : 0.9585         
         Neg Pred Value : 0.3402         
             Prevalence : 0.8916         
         Detection Rate : 0.7437         
   Detection Prevalence : 0.7759         
      Balanced Accuracy : 0.7686         
                                         
       'Positive' Class : 0              
                                         
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6493283
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

In CART model the node(splitting variable) is selected based on GINI index i.e,the attribute with smallest GINI index is selected.It is pruned with the complexity parameter of 0.01.
Here, it is predicting 9254 values correctly with accuracy 82%, inter rate reliability of 35%,true positive rate is 83%,true negative rate is 69%, balanced accuracy is 76% and area under curve is 64%.

C5.0 DECISION TREE

Model Building

library(C50)
train$`default payment next month`<-as.factor(train$`default payment next month`)
class(train$`default payment next month`)
[1] "factor"
model_c50<-C5.0(`default payment next month` ~ ., data=train)
summary(model_c50)

Call:
C5.0.formula(formula = `default payment next month` ~ ., data = train)


C5.0 [Release 2.07 GPL Edition]     Wed Aug 15 08:37:56 2018
-------------------------------

Class specified by attribute `outcome'

Read 11335 cases (22 attributes) from undefined.data

Decision tree:

PAY_0 > 1: 1 (1229/365)
PAY_0 <= 1:
:...PAY_2 <= 1: 0 (9258/1321)
    PAY_2 > 1:
    :...PAY_6 <= 0: 0 (590/210)
        PAY_6 > 0: 1 (258/113)


Evaluation on training data (11335 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         4 2009(17.7%)   <<


       (a)   (b)    <-classified as
      ----  ----
      8317   478    (a): class 0
      1531  1009    (b): class 1


    Attribute usage:

    100.00% PAY_0
     89.16% PAY_2
      7.48% PAY_6


Time: 0.5 secs

Predicting On Train Data

pred_train<-predict(model_c50,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8317  478
         1 1531 1009
                                          
               Accuracy : 0.8228          
                 95% CI : (0.8156, 0.8298)
    No Information Rate : 0.8688          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4022          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8445          
            Specificity : 0.6785          
         Pos Pred Value : 0.9457          
         Neg Pred Value : 0.3972          
             Prevalence : 0.8688          
         Detection Rate : 0.7337          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.7615          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6714475
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in c50 the node is selected based on information gain ratio and pruning is done by confidence threshold of 0.25.
It is predicting 9297 values correctly with accuracy 82%, kappa being 36%,true positive rate being 83%,true negative rate being 71%,balanced accuracy is 77% and area under curve is 65%.

SVM (SUPPORT VECTOR MACHINES) ADVANCED MODEL

Model Building

library(e1071)
model_svm<-svm(`default payment next month` ~ ., data=train,cost=100,gamma=1)
summary(model_svm)

Call:
svm(formula = `default payment next month` ~ ., data = train, 
    cost = 100, gamma = 1)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  100 
      gamma:  1 

Number of Support Vectors:  7723

 ( 2310 5413 )


Number of Classes:  2 

Levels: 
 0 1

Predicting On Train Data

pred_train<-predict(model_svm,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8780   15
         1  163 2377
                                          
               Accuracy : 0.9843          
                 95% CI : (0.9818, 0.9865)
    No Information Rate : 0.789           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9539          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9818          
            Specificity : 0.9937          
         Pos Pred Value : 0.9983          
         Neg Pred Value : 0.9358          
             Prevalence : 0.7890          
         Detection Rate : 0.7746          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.9878          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9670606
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here,the model is built using radial basis function where parameter gamma of 1 is taken and the classes of outcome variable is separated using hyperplane concept. 
Here, it is predicting 11117 values correctly with accuracy of 98%,kappa being 94%, true positive rate being 98%,true negative rate being 99%,98% of time the binary classes are predicted correctly and area under curve being 96%, it is a very good model with vey high performance.

CART BAGGING ENSEMBLE MODEL

Model Building

library(ipred)
model_cartbag<-bagging(`default payment next month` ~ ., data=train)
varImp(model_cartbag)
            Overall
AGE        910.9295
BILL_AMT1 1011.4888
BILL_AMT2  922.3781
BILL_AMT3  813.0373
BILL_AMT4  754.9019
BILL_AMT5  710.5565
BILL_AMT6  656.7559
EDUCATION  322.6778
LIMIT_BAL  810.4052
PAY_0      927.7674
PAY_2      777.1373
PAY_3      690.3627
PAY_4      646.8209
PAY_5      582.1123
PAY_6      214.4319
PAY_AMT1   636.2980
PAY_AMT2   648.5776
PAY_AMT3   621.9875
PAY_AMT4   574.8440
PAY_AMT5   573.7682
PAY_AMT6   535.8119

Predicting On Train Data

pred_train<-predict(model_cartbag,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8789    6
         1   59 2481
                                          
               Accuracy : 0.9943          
                 95% CI : (0.9927, 0.9956)
    No Information Rate : 0.7806          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9834          
 Mcnemar's Test P-Value : 1.12e-10        
                                          
            Sensitivity : 0.9933          
            Specificity : 0.9976          
         Pos Pred Value : 0.9993          
         Neg Pred Value : 0.9768          
             Prevalence : 0.7806          
         Detection Rate : 0.7754          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.9955          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9880447
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here,in cart bagging different models on same algorithm is generated and takes the average accuracy/error by default it builds 500 pruned trees which reduces the variance.
It predicts 11294 values correctly with an accuracy of 99%,inter rater reliability being 99% model is reliable,true positive rate being 99%,true negative rate being nearl 100%,balanced accuracy of 99% and area under curve being 99%, it is a very good model.

ADAPTIVE BOOSTING ENSEMBLE MODEL

Model Building

library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada
Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential", 
    type = "discrete", iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 8381  414
         1 1564  976

Train Error: 0.175 

Out-Of-Bag Error:  0.175  iteration= 82 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       100         23 
plot(model_ada)

Predicting On Train Data

pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8381  414
         1 1564  976
                                          
               Accuracy : 0.8255          
                 95% CI : (0.8184, 0.8324)
    No Information Rate : 0.8774          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4019          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8427          
            Specificity : 0.7022          
         Pos Pred Value : 0.9529          
         Neg Pred Value : 0.3843          
             Prevalence : 0.8774          
         Detection Rate : 0.7394          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.7724          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6685899
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in boosting it classifies weak learners and makes them strong learners using decision stump.It performs 100 iterations by exponential method .It has a train and test error of 18% with automatically splitting the data into train data(68% of data) and test data(32% of data). 
It predicts 9290 values correctly with accuracy of 82%, kappa being 38%,true positive rate of 84%,true negative rate of 68%,balanced accuracy of 76% and area under curve of 66%,it is quite a good model but kappa is very low.

XTREME GRADIENT BOOSTING ENSEMBLE MODEL

Model Building

library(xgboost)
x<-train[,1:21]
y<-train[,22]
model_xgb<-xgboost(data=as.matrix(x),label = as.matrix(y),nrounds = 100)
[1] train-rmse:0.434421 
[2] train-rmse:0.396022 
[3] train-rmse:0.374531 
[4] train-rmse:0.361911 
[5] train-rmse:0.353549 
[6] train-rmse:0.347930 
[7] train-rmse:0.345045 
[8] train-rmse:0.342586 
[9] train-rmse:0.339969 
[10]    train-rmse:0.337570 
[11]    train-rmse:0.335436 
[12]    train-rmse:0.334377 
[13]    train-rmse:0.331047 
[14]    train-rmse:0.330220 
[15]    train-rmse:0.328446 
[16]    train-rmse:0.326319 
[17]    train-rmse:0.324078 
[18]    train-rmse:0.323263 
[19]    train-rmse:0.322829 
[20]    train-rmse:0.320912 
[21]    train-rmse:0.319064 
[22]    train-rmse:0.317763 
[23]    train-rmse:0.316315 
[24]    train-rmse:0.314133 
[25]    train-rmse:0.312600 
[26]    train-rmse:0.311811 
[27]    train-rmse:0.311520 
[28]    train-rmse:0.310819 
[29]    train-rmse:0.310101 
[30]    train-rmse:0.307327 
[31]    train-rmse:0.304829 
[32]    train-rmse:0.303094 
[33]    train-rmse:0.300805 
[34]    train-rmse:0.300008 
[35]    train-rmse:0.298665 
[36]    train-rmse:0.296838 
[37]    train-rmse:0.295095 
[38]    train-rmse:0.294165 
[39]    train-rmse:0.292532 
[40]    train-rmse:0.292150 
[41]    train-rmse:0.291544 
[42]    train-rmse:0.289300 
[43]    train-rmse:0.288217 
[44]    train-rmse:0.286501 
[45]    train-rmse:0.284888 
[46]    train-rmse:0.283152 
[47]    train-rmse:0.282668 
[48]    train-rmse:0.282393 
[49]    train-rmse:0.281483 
[50]    train-rmse:0.280589 
[51]    train-rmse:0.279338 
[52]    train-rmse:0.279012 
[53]    train-rmse:0.277776 
[54]    train-rmse:0.276287 
[55]    train-rmse:0.275341 
[56]    train-rmse:0.274861 
[57]    train-rmse:0.273163 
[58]    train-rmse:0.272174 
[59]    train-rmse:0.271597 
[60]    train-rmse:0.269858 
[61]    train-rmse:0.269611 
[62]    train-rmse:0.267767 
[63]    train-rmse:0.267065 
[64]    train-rmse:0.265849 
[65]    train-rmse:0.264567 
[66]    train-rmse:0.263567 
[67]    train-rmse:0.262096 
[68]    train-rmse:0.261212 
[69]    train-rmse:0.260519 
[70]    train-rmse:0.259512 
[71]    train-rmse:0.257833 
[72]    train-rmse:0.256280 
[73]    train-rmse:0.255731 
[74]    train-rmse:0.255442 
[75]    train-rmse:0.253063 
[76]    train-rmse:0.252047 
[77]    train-rmse:0.250519 
[78]    train-rmse:0.249685 
[79]    train-rmse:0.248708 
[80]    train-rmse:0.248013 
[81]    train-rmse:0.246565 
[82]    train-rmse:0.246429 
[83]    train-rmse:0.245796 
[84]    train-rmse:0.244609 
[85]    train-rmse:0.243741 
[86]    train-rmse:0.243224 
[87]    train-rmse:0.242179 
[88]    train-rmse:0.241366 
[89]    train-rmse:0.240682 
[90]    train-rmse:0.239672 
[91]    train-rmse:0.238898 
[92]    train-rmse:0.238148 
[93]    train-rmse:0.237244 
[94]    train-rmse:0.236181 
[95]    train-rmse:0.235219 
[96]    train-rmse:0.234477 
[97]    train-rmse:0.233583 
[98]    train-rmse:0.232551 
[99]    train-rmse:0.231134 
[100]   train-rmse:0.230896 

Predicting On Train Data

pred<-predict(model_xgb,as.matrix(x))
pred_train<-round(pred)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8722   73
         1  686 1854
                                          
               Accuracy : 0.933           
                 95% CI : (0.9283, 0.9376)
    No Information Rate : 0.83            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7894          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9271          
            Specificity : 0.9621          
         Pos Pred Value : 0.9917          
         Neg Pred Value : 0.7299          
             Prevalence : 0.8300          
         Detection Rate : 0.7695          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.9446          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.8608105
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in xtreme gradient boosting model on train data root mean square error is calculated for all 100 iterations and with lowest rmse is selected with 23.8%.
It builded the model with 93% accuracy by predicting 10548 values correctly,inter rater reliability being 78% we can trust our model,true positive rate being 92%,true negative rate being 96%,balanced accuracy of 94% and area under curve of 85%, it is a good model.

Let's, compare the different algorithmic models based on various parameters discussed till now and decide which is a better model to perform our research.

SUMMARY OF DIFFERENT MODELS ON TRAIN DATA

SUMMARY
model accuracy kappa sensitivity specificity balanced_accuracy auc
cart.bagging_train 99 99 99 99 99.6 99
svm_train 98 94 98 99 98.4 96
xgb.boosting_train 93 78 92 96 94.3 85
ada.boosting_train 82 38 84 69 76.3 66
c50_train 82 37 83 71 77.0 65
cart_train 81 35 83 69 76.1 64
binary.logistic.regression_train 81 28 82 70 76.0 61
naive.bayes_train 66 27 88 37 62.6 68

From the model accuracy plot, we can conclude that naive bayes model is eliminated due to its low accuracy .CART bagging model,svm model and xg boosting model have high accuracy and from summary also we can infer that all these parameters are high for these models,so lets predict the payment defaulters on test1 and test2 data and check whether the models are biased/varianced or not.
So,therefore lets predict the outcome on CART bagging,svm,xg boosting,adaptive boosting and c50 decision tree and compare their performance related to bias and variance.

PREDICTING FUTURE DATA (i.e on test1 and on test2 data)

Using Cart Bagging Ensemble Model

library(caret)
pred_test1<-predict(model_cartbag,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6151  469
         1 1152  729
                                          
               Accuracy : 0.8093          
                 95% CI : (0.8008, 0.8176)
    No Information Rate : 0.8591          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.364           
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8423          
            Specificity : 0.6085          
         Pos Pred Value : 0.9292          
         Neg Pred Value : 0.3876          
             Prevalence : 0.8591          
         Detection Rate : 0.7236          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.7254          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6583569
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_cartbag,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6087  476
         1 1188  750
                                          
               Accuracy : 0.8043          
                 95% CI : (0.7957, 0.8126)
    No Information Rate : 0.8558          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3612          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8367          
            Specificity : 0.6117          
         Pos Pred Value : 0.9275          
         Neg Pred Value : 0.3870          
             Prevalence : 0.8558          
         Detection Rate : 0.7160          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.7242          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6572345
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except kappa remaining all parameters have good values, so overall model is good.

Using Support Vector Machines Advanced Model

pred_test1<-predict(model_svm,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5839  781
         1 1378  503
                                          
               Accuracy : 0.746           
                 95% CI : (0.7366, 0.7553)
    No Information Rate : 0.849           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1686          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8091          
            Specificity : 0.3917          
         Pos Pred Value : 0.8820          
         Neg Pred Value : 0.2674          
             Prevalence : 0.8490          
         Detection Rate : 0.6869          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.6004          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.5747176
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_svm,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5797  766
         1 1383  555
                                          
               Accuracy : 0.7472          
                 95% CI : (0.7378, 0.7564)
    No Information Rate : 0.8446          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1911          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8074          
            Specificity : 0.4201          
         Pos Pred Value : 0.8833          
         Neg Pred Value : 0.2864          
             Prevalence : 0.8446          
         Detection Rate : 0.6819          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.6138          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.5848314
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except accuracy and sensitivity remaining all parameters have bad values, so overall model is bad.

Using Xtreme Gradient Boosting Ensemble Model

x1<-test1[,1:21]
pred<-predict(model_xgb,as.matrix(x1))
pred_test1<-round(pred)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6159  461
         1 1177  704
                                          
               Accuracy : 0.8073          
                 95% CI : (0.7988, 0.8157)
    No Information Rate : 0.863           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3527          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8396          
            Specificity : 0.6043          
         Pos Pred Value : 0.9304          
         Neg Pred Value : 0.3743          
             Prevalence : 0.8630          
         Detection Rate : 0.7245          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.7219          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6523158
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

x2<-test2[,1:21]
pred<-predict(model_xgb,as.matrix(x2))
pred_test2<-round(pred)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6087  476
         1 1218  720
                                          
               Accuracy : 0.8007          
                 95% CI : (0.7921, 0.8092)
    No Information Rate : 0.8593          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3456          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8333          
            Specificity : 0.6020          
         Pos Pred Value : 0.9275          
         Neg Pred Value : 0.3715          
             Prevalence : 0.8593          
         Detection Rate : 0.7160          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.7176          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6494946
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

  From above analysis, except kappa, specificity and area under curve remaining all parameters have good values, so overall model is bit bad one.
  

Using Adaptive Boosting Ensemble Model

pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6275  345
         1 1200  681
                                          
               Accuracy : 0.8183          
                 95% CI : (0.8099, 0.8264)
    No Information Rate : 0.8793          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3701          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8395          
            Specificity : 0.6637          
         Pos Pred Value : 0.9479          
         Neg Pred Value : 0.3620          
             Prevalence : 0.8793          
         Detection Rate : 0.7381          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.7516          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6549633
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6206  357
         1 1220  718
                                          
               Accuracy : 0.8145          
                 95% CI : (0.8061, 0.8227)
    No Information Rate : 0.8735          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3749          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8357          
            Specificity : 0.6679          
         Pos Pred Value : 0.9456          
         Neg Pred Value : 0.3705          
             Prevalence : 0.8735          
         Detection Rate : 0.7300          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.7518          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6580446
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except sensitivity and accuracy remaining all parameters have quite good values, so overall model is good.    

Using C5.0 Decision Tree

pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6251  369
         1 1154  727
                                          
               Accuracy : 0.8208          
                 95% CI : (0.8125, 0.8289)
    No Information Rate : 0.8711          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3888          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8442          
            Specificity : 0.6633          
         Pos Pred Value : 0.9443          
         Neg Pred Value : 0.3865          
             Prevalence : 0.8711          
         Detection Rate : 0.7353          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.7537          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6653782
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6172  391
         1 1183  755
                                          
               Accuracy : 0.8148          
                 95% CI : (0.8064, 0.8231)
    No Information Rate : 0.8652          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3855          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8392          
            Specificity : 0.6588          
         Pos Pred Value : 0.9404          
         Neg Pred Value : 0.3896          
             Prevalence : 0.8652          
         Detection Rate : 0.7260          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.7490          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6650002
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except sensitivity and accuracy remaining all parameters have bad values, so overall model is bad.

SUMMARY ON TEST1 AND TEST2 DATA

SUMMARY
model accuracy accuracy_t1 accuracy_t2 kappa kappa_t1 kappa_t2 sensitivity sensitivity_t1 sensitivity_t2 specificity specificity_t1 specificity_t2 balanced_acc balanced_acc_t1 balanced_acc_t2 auc auc_t1 auc_t2
_ train test1 test2 train test1 test2 train test1 test2 train test1 test2 train test1 test2 train test1 test2
cart.bagging 99.7 81 81 99 37 37 99.6 84 84 99.8 61 63 99.6 72 73 99 66 66
svm 98 75 76 94 19 21 98 81 81 99 41 45 98.4 61 63 96 58 59
xgb.boosting 93 81 81 78 37 37 92 84 84 96 61 63 94 73 73 85 66 66
ada.boosting 82 82 82 38 38 38 84 84 84 69 68 69 76 76 76 66 66 66
c50 82 82 82 37 38 37 83 84 83 71 70 71 77 77 77 65 66 65
cart 81 _ _ 35 _ _ 83 _ _ 69 _ _ 76 _ _ 64 _ _
binary.logistic.regression 81 _ _ 28 _ _ 82 _ _ 70 _ _ 76 _ _ 61 _ _
naive.bayes 66 _ _ 27 _ _ 88 _ _ 37 _ _ 63 _ _ 68 _ _

  From above summary, checking the consistency of all the parameters,we can infer that the CART bagging,support vector machines and xtreme gradient boosting are more biased(performance is good on train data but performance is comparatively bad on test1 and test2 data) and varianced(error rate on train,test1 and test2 data is high).
  But adaptive boosting and c50 decision tree is less biased and varianced and has a good accuracy too, so therefore lets consider adaptive boosting in this analysis and still check its consistency of its performance on differnt train, test1, test2 data by shuffling the same ccdefault dataset.  
  

SHUFFLE THE DATA

set.seed(1)

RESAMPLING THE DATA

library(caret)
splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)
[1] 11335    22
dim(test1)
[1] 8501   22
dim(test2)
[1] 8501   22

MODEL BUILDING USING ADAPTIVE BOOSTING

library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada
Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential", 
    type = "discrete", iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 8385  404
         1 1604  942

Train Error: 0.177 

Out-Of-Bag Error:  0.177  iteration= 27 

Additional Estimates of number of iterations:

train.err1 train.kap1 
        21         34 
plot(model_ada)

PREDICTING ON TRAIN DATA

pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8385  404
         1 1604  942
                                          
               Accuracy : 0.8228          
                 95% CI : (0.8157, 0.8298)
    No Information Rate : 0.8813          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3892          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8394          
            Specificity : 0.6999          
         Pos Pred Value : 0.9540          
         Neg Pred Value : 0.3700          
             Prevalence : 0.8813          
         Detection Rate : 0.7397          
   Detection Prevalence : 0.7754          
      Balanced Accuracy : 0.7696          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6620128
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICTING ON TEST1 AND TEST2 DATA

pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6228  349
         1 1218  706
                                          
               Accuracy : 0.8157          
                 95% CI : (0.8073, 0.8239)
    No Information Rate : 0.8759          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3736          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8364          
            Specificity : 0.6692          
         Pos Pred Value : 0.9469          
         Neg Pred Value : 0.3669          
             Prevalence : 0.8759          
         Detection Rate : 0.7326          
   Detection Prevalence : 0.7737          
      Balanced Accuracy : 0.7528          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6569401
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6277  335
         1 1207  682
                                          
               Accuracy : 0.8186          
                 95% CI : (0.8103, 0.8267)
    No Information Rate : 0.8804          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3716          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8387          
            Specificity : 0.6706          
         Pos Pred Value : 0.9493          
         Neg Pred Value : 0.3610          
             Prevalence : 0.8804          
         Detection Rate : 0.7384          
   Detection Prevalence : 0.7778          
      Balanced Accuracy : 0.7547          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6551861
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA

set.seed(2)

RESAMPLING THE DATA

library(caret)
splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)
[1] 11335    22
dim(test1)
[1] 8501   22
dim(test2)
[1] 8501   22

MODEL BUILDING USING ADAPTIVE BOOSTING

library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada
Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential", 
    type = "discrete", iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 8338  441
         1 1559  997

Train Error: 0.176 

Out-Of-Bag Error:  0.176  iteration= 100 

Additional Estimates of number of iterations:

train.err1 train.kap1 
        78         10 
plot(model_ada)

PREDICTING ON TRAIN DATA

pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8338  441
         1 1559  997
                                          
               Accuracy : 0.8236          
                 95% CI : (0.8164, 0.8305)
    No Information Rate : 0.8731          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4022          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8425          
            Specificity : 0.6933          
         Pos Pred Value : 0.9498          
         Neg Pred Value : 0.3901          
             Prevalence : 0.8731          
         Detection Rate : 0.7356          
   Detection Prevalence : 0.7745          
      Balanced Accuracy : 0.7679          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6699145
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICTING ON TEST1 AND TEST2 DATA

pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6222  324
         1 1220  735
                                        
               Accuracy : 0.8184        
                 95% CI : (0.81, 0.8265)
    No Information Rate : 0.8754        
    P-Value [Acc > NIR] : 1             
                                        
                  Kappa : 0.389         
 Mcnemar's Test P-Value : <2e-16        
                                        
            Sensitivity : 0.8361        
            Specificity : 0.6941        
         Pos Pred Value : 0.9505        
         Neg Pred Value : 0.3760        
             Prevalence : 0.8754        
         Detection Rate : 0.7319        
   Detection Prevalence : 0.7700        
      Balanced Accuracy : 0.7651        
                                        
       'Positive' Class : 0             
                                        
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6632316
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6278  375
         1 1160  688
                                          
               Accuracy : 0.8194          
                 95% CI : (0.8111, 0.8276)
    No Information Rate : 0.875           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3732          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8440          
            Specificity : 0.6472          
         Pos Pred Value : 0.9436          
         Neg Pred Value : 0.3723          
             Prevalence : 0.8750          
         Detection Rate : 0.7385          
   Detection Prevalence : 0.7826          
      Balanced Accuracy : 0.7456          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6579644
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

SUMMARY ON DIFFERENT SHUFFLED DATA

Here is the summary of model performance on shuffled datasets by using adaptive boosting.
FINAL SUMMARY
ada.boosting.model accuracy kappa sensitivity specificity balanced_accuracy auc
train 82 37.8 83.8 68.7 76.3 65.7
test1 82 38.1 84 67.7 76 65.9
test2 81.7 37.6 83.5 68.6 76 65.7
reshuffle_data1 . . . . . .
train 82.3 39 84 70 77 66.3
test1 81.6 37.4 83.6 66.9 75.3 65.7
test2 81.9 37.2 83.9 67 75.5 65.5
reshuffle_data2 . . . . . .
train 82.3 38.9 83.9 69.9 76.9 66.2
test1 81.5 37.3 83.6 66.9 75.2 65.6
test2 81.8 37.1 83.8 67 75.4 65.5
reshuffle_data3 . . . . . .
train 82.3 40.2 84.2 69.3 76.8 67
test1 81.8 39 83.6 69.4 76.5 66.3
test2 81.9 37.3 84.4 64.7 74.5 65.8
From the above summary we can conclude that the model is consistently performing good on all the parameters so hence adaptive boosting is the best model on this dataset.

FINAL DATA WITH PREDICTED VALUES

library(caret)
predicted<-predict(model_ada,ccdefault)
library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:xgboost':

    slice
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
ccdefault<-mutate(ccdefault,expected_default.payment.next.month=predicted)
FINAL OBSERVED AND EXPECTED PAYMENT DEFAULTERS (displaying only 33 observations)
LIMIT_BAL EDUCATION AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month expected_default.payment.next.month
20000 2 24 2 2 -1 -1 -2 -2 3913 3102 689 0 0 0 0 689 0 0 0 0 1 1
120000 2 26 -1 2 0 0 0 2 2682 1725 2682 3272 3455 3261 0 1000 1000 1000 0 2000 1 0
90000 2 34 0 0 0 0 0 0 29239 14027 13559 14331 14948 15549 1518 1500 1000 1000 1000 5000 0 0
50000 2 37 0 0 0 0 0 0 46990 48233 49291 28314 28959 29547 2000 2019 1200 1100 1069 1000 0 0
50000 2 57 -1 0 -1 0 0 0 8617 5670 35835 20940 19146 19131 2000 36681 10000 9000 689 679 0 0
50000 1 37 0 0 0 0 0 0 64400 57069 57608 19394 19619 20024 2500 1815 657 1000 1000 800 0 0
500000 1 29 0 0 0 0 0 0 367965 412023 445007 542653 483003 473944 55000 40000 38000 20239 13750 13770 0 0
140000 3 28 0 0 2 0 0 0 11285 14096 12108 12211 11793 3719 3329 0 432 1000 1000 1000 0 0
20000 3 35 -2 -2 -2 -2 -1 -1 0 0 0 0 13007 13912 0 0 0 13007 1122 0 0 0
200000 3 34 0 0 2 0 0 -1 11073 9787 5535 2513 1828 3731 2306 12 50 300 3738 66 0 0
260000 1 51 -1 -1 -1 -1 -1 2 12261 21670 9966 8517 22287 13668 21818 9966 8583 22301 0 3640 0 0
630000 2 41 -1 0 -1 -1 -1 -1 12137 6500 6500 6500 6500 2870 1000 6500 6500 6500 2870 0 0 0
70000 2 30 1 2 2 0 0 2 65802 67369 65701 66782 36137 36894 3200 0 3000 3000 1500 0 1 0
250000 1 29 0 0 0 0 0 0 70887 67060 63561 59696 56875 55512 3000 3000 3000 3000 3000 3000 0 0
50000 3 23 1 2 0 0 0 0 50614 29173 28116 28771 29531 30211 0 1500 1100 1200 1300 1100 0 0
20000 1 24 0 0 2 2 2 2 15376 18010 17428 18338 17905 19104 3200 0 1500 0 1650 0 1 0
320000 1 49 0 0 0 -1 -1 -1 253286 246536 194663 70074 5856 195599 10358 10000 75940 20000 195599 50000 0 0
360000 1 49 1 -2 -2 -2 -2 -2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
180000 1 29 1 -2 -2 -2 -2 -2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
130000 3 39 0 0 0 0 0 -1 38358 27688 24489 20616 11802 930 3000 1537 1000 2000 930 33764 0 0
120000 2 39 -1 -1 -1 -1 -1 -1 316 316 316 0 632 316 316 316 0 632 316 0 1 0
70000 2 26 2 0 0 2 2 2 41087 42445 45020 44006 46905 46012 2007 3582 0 3601 0 1820 1 1
450000 1 40 -2 -2 -2 -2 -2 -2 5512 19420 1473 560 0 0 19428 1473 560 0 0 1128 1 0
90000 1 23 0 0 0 -1 0 0 4744 7070 0 5398 6360 8292 5757 0 5398 1200 2045 2000 0 0
50000 3 23 0 0 0 0 0 0 47620 41810 36023 28967 29829 30046 1973 1426 1001 1432 1062 997 0 0
50000 3 30 0 0 0 0 0 0 22541 16138 17163 17878 18931 19617 1300 1300 1000 1500 1000 1012 0 0
50000 3 47 -1 -1 -1 -1 -1 -1 650 3415 3416 2040 30430 257 3415 3421 2044 30430 257 0 0 0
50000 1 26 0 0 0 0 0 0 15329 16575 17496 17907 18375 11400 1500 1500 1000 1000 1600 0 0 0
230000 1 27 -1 -1 -1 -1 -1 -1 16646 17265 13266 15339 14307 36923 17270 13281 15339 14307 37292 0 0 0
50000 2 33 2 0 0 0 0 0 30518 29618 22102 22734 23217 23680 1718 1500 1000 1000 1000 716 1 1
100000 1 32 0 0 0 0 0 0 93036 84071 82880 80958 78703 75589 3023 3511 3302 3204 3200 2504 0 0
500000 2 54 -2 -2 -2 -2 -2 -2 10929 4152 22722 7521 71439 8981 4152 22827 7521 71439 981 51582 0 0
500000 1 58 -2 -2 -2 -2 -2 -2 13709 5006 31130 3180 0 5293 5006 31178 3180 0 5293 768 0 0

CONCLUSION

As banks play a vital role in providing financial services to their customers,banks should avoid wrong customers who can default and cause loss to the banks and other financial institutions.
In order to achieve their goals we can conclude that the model developed has used all the factors and data and has a high potential in predicting whether the customer would fail/succeed in making the next payment which will benefit the bank/financial institutions in making their decisions so hence we can deploy the model.