DEFINE PROBLEM STATEMENT

    Banks play a vital role in providing financial services to their customers, so there will be a high risk involved and it is very important to assess risk of payment defaulters.
    Using 'ccdefault' dataset can we predict whether the customers repayment is successful or not?

DATASET INFORMATION

 This research aimed at the case of customer default payments in Taiwan which includes credible and non credible clients which employed payment defaulters as the response/outcome/dependent variable(y) which has the binary outcome with '1' indicating the payment defaulter and '0' as the non-payment defaulter and remaining 24 variables as the explanatory/predictor/independent variables(x).i.e, the role of these 24 variables in causation of defaulters where payment defaulters will be the effect due to the cause of these 24 variables.

CREDIT DEFAULTERS DATASET (displaying only 5 observations)
ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
1	20000	2	2	1	24	2	2	-1	-1	-2	-2	3913	3102	689	0	0	0	0	689	0	0	0	0	1
2	120000	2	2	2	26	-1	2	0	0	0	2	2682	1725	2682	3272	3455	3261	0	1000	1000	1000	0	2000	1
3	90000	2	2	2	34	0	0	0	0	0	0	29239	14027	13559	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
4	50000	2	2	1	37	0	0	0	0	0	0	46990	48233	49291	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
5	50000	1	2	1	57	-1	0	-1	0	0	0	8617	5670	35835	20940	19146	19131	2000	36681	10000	9000	689	679	0

Attributes information

LIMIT_BAL is the amount of the given credit which includes both the individual consumer credit and his/her family (supplementary) credit in New Taiwan dollar.
SEX is the gender indicating whether he is the male or female ('1' being male and '2' being female).
EDUCATION indicates whether he/she is a school graduate/university/high school graduate/others.
MARRIAGE indicates whether he/she are married/not married/others.
AGE indicates the age of a person in years.
PAY_0,PAY2,...,PAY6 indicates the payment status for the previous months from april to september for the year 2005 with PAY_0 in september to PAY_6 in april.
BILL_AMT1,BILL_AMT2,..,BILL_AMT6  indicates the amount of bill statement from september,august,...,april respectively for the year 2005 in New Taiwan dollar.
PAY_AMT1,PAY_AMT2,...,PAY_AMT6 indicates the amount paid in september,august,...,april respectively in the year 2005 in New Taiwan dollar.

DATA PREPROCESSING

Variable selection

This study reviewed the literature and shortlisted the following 21 variables as explanatory variables:

ccdefault<-ccdefault[,c(-1,-3,-5)]

DATASET (displaying only 5 observations)
LIMIT_BAL	EDUCATION	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
20000	2	24	2	2	-1	-1	-2	-2	3913	3102	689	0	0	0	0	689	0	0	0	0	1
120000	2	26	-1	2	0	0	0	2	2682	1725	2682	3272	3455	3261	0	1000	1000	1000	0	2000	1
90000	2	34	0	0	0	0	0	0	29239	14027	13559	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
50000	2	37	0	0	0	0	0	0	46990	48233	49291	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
50000	2	57	-1	0	-1	0	0	0	8617	5670	35835	20940	19146	19131	2000	36681	10000	9000	689	679	0

Check for missing values

sum(is.na(ccdefault))

[1] 0

summary(ccdefault)

   LIMIT_BAL         EDUCATION          AGE            PAY_0        
 Min.   :  10000   Min.   :0.000   Min.   :21.00   Min.   :-2.0000  
 1st Qu.:  50000   1st Qu.:1.000   1st Qu.:28.00   1st Qu.:-1.0000  
 Median : 140000   Median :2.000   Median :34.00   Median : 0.0000  
 Mean   : 167484   Mean   :1.853   Mean   :35.49   Mean   :-0.0167  
 3rd Qu.: 240000   3rd Qu.:2.000   3rd Qu.:41.00   3rd Qu.: 0.0000  
 Max.   :1000000   Max.   :6.000   Max.   :79.00   Max.   : 8.0000  
     PAY_2             PAY_3             PAY_4             PAY_5        
 Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000  
 1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   :-0.1338   Mean   :-0.1662   Mean   :-0.2207   Mean   :-0.2662  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000  
     PAY_6           BILL_AMT1         BILL_AMT2        BILL_AMT3      
 Min.   :-2.0000   Min.   :-165580   Min.   :-69777   Min.   :-157264  
 1st Qu.:-1.0000   1st Qu.:   3559   1st Qu.:  2985   1st Qu.:   2666  
 Median : 0.0000   Median :  22382   Median : 21200   Median :  20089  
 Mean   :-0.2911   Mean   :  51223   Mean   : 49179   Mean   :  47013  
 3rd Qu.: 0.0000   3rd Qu.:  67091   3rd Qu.: 64006   3rd Qu.:  60165  
 Max.   : 8.0000   Max.   : 964511   Max.   :983931   Max.   :1664089  
   BILL_AMT4         BILL_AMT5        BILL_AMT6          PAY_AMT1     
 Min.   :-170000   Min.   :-81334   Min.   :-339603   Min.   :     0  
 1st Qu.:   2327   1st Qu.:  1763   1st Qu.:   1256   1st Qu.:  1000  
 Median :  19052   Median : 18105   Median :  17071   Median :  2100  
 Mean   :  43263   Mean   : 40311   Mean   :  38872   Mean   :  5664  
 3rd Qu.:  54506   3rd Qu.: 50191   3rd Qu.:  49198   3rd Qu.:  5006  
 Max.   : 891586   Max.   :927171   Max.   : 961664   Max.   :873552  
    PAY_AMT2          PAY_AMT3         PAY_AMT4         PAY_AMT5       
 Min.   :      0   Min.   :     0   Min.   :     0   Min.   :     0.0  
 1st Qu.:    833   1st Qu.:   390   1st Qu.:   296   1st Qu.:   252.5  
 Median :   2009   Median :  1800   Median :  1500   Median :  1500.0  
 Mean   :   5921   Mean   :  5226   Mean   :  4826   Mean   :  4799.4  
 3rd Qu.:   5000   3rd Qu.:  4505   3rd Qu.:  4013   3rd Qu.:  4031.5  
 Max.   :1684259   Max.   :896040   Max.   :621000   Max.   :426529.0  
    PAY_AMT6        default payment next month
 Min.   :     0.0   Min.   :0.0000            
 1st Qu.:   117.8   1st Qu.:0.0000            
 Median :  1500.0   Median :0.0000            
 Mean   :  5215.5   Mean   :0.2212            
 3rd Qu.:  4000.0   3rd Qu.:0.0000            
 Max.   :528666.0   Max.   :1.0000

By using the above 'sum' function which says there are no 'NA' values or from summary we can easily say that there are no available missing values.

Check for outliers

boxplot(ccdefault)

dim(ccdefault)

[1] 30000    22

From boxplot, we can infer that some values are different from most of the values in which some are extreme from the outliers itself which can be removed as these values may effect the model accuracy.

Remove extreme outliers

ccdefault<-ccdefault[ccdefault$LIMIT_BAL<1000000,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT1<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT1>7e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT2>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT3>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT3<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT4>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT5>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT5<0,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT6>6e+05,]
ccdefault<-ccdefault[!ccdefault$BILL_AMT6<0,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT1>3e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT2>5e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT3>4e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT4>3e+05,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT5>350000,]
ccdefault<-ccdefault[!ccdefault$PAY_AMT6>3e+05,]
boxplot(ccdefault)

dim(ccdefault)

[1] 28337    22

Here, after eliminating some extreme outliers we can see that all the remaining outliers are close to each other which shoudn't be removed as it will be more biased if removed.
Now lets use this preprocessed data which has 28337 observations and 22 variables to perform the research but before that lets check the multicollinearity among predectors.

Check for multicollinearity

library(corrplot)

corrplot 0.84 loaded

corrplot(cor(ccdefault[1:21]),method = "number")

The multicollinearity among predictors shouldn't exist so from above correlation matrix plot we see that payment status and amount bill statement has a high correlation among themselves as expected since the they are reflecting the cumulative status and amounts respectively of the previous months,except this there is a low correlation among other variables which is satisfactory.

Resampling the pre-processed data

 Lets split the data into train data with 11335 observations, test1 data with 8501 observations and test2 data with 8501 observations by using validation set/hold out approach.

library(caret)

Loading required package: lattice

Loading required package: ggplot2

splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)

[1] 11335    22

dim(test1)

[1] 8501   22

dim(test2)

[1] 8501   22

Now the model has to be built for train data by using different algorithmic models and the best model is choosen out of it based on different parameters and the outcomes are predicted using this best model.

BINARY LOGISTIC REGRESSION

Model Building

model_bin<-glm(`default payment next month` ~ ., data=train,family = binomial(link="logit"))
exp(model_bin$coefficients)

(Intercept)   LIMIT_BAL   EDUCATION         AGE       PAY_0       PAY_2 
  0.2711845   0.9999993   0.9394406   1.0120411   1.8252271   1.0652141 
      PAY_3       PAY_4       PAY_5       PAY_6   BILL_AMT1   BILL_AMT2 
  1.0810926   1.0412221   1.0463908   1.0002311   0.9999952   0.9999979 
  BILL_AMT3   BILL_AMT4   BILL_AMT5   BILL_AMT6    PAY_AMT1    PAY_AMT2 
  1.0000043   0.9999970   1.0000024   1.0000026   0.9999907   0.9999879 
   PAY_AMT3    PAY_AMT4    PAY_AMT5    PAY_AMT6 
  0.9999921   0.9999947   0.9999910   1.0000013

Here, the model is built using the generalised linear model(binary logistic regression).
By using exponential function we convert the values into odds ratio for interpretation, when all coefficients are zero the odds ratio of payment defaulters will be 0.29,when LIMIT_BAL increases by a unit the payment defaulters increases by 0.99, similarly for other variables with their respective odds ratio.

Predicting On Train Data

pred_train<-ifelse(model_bin$fitted.values>0.5,1,0)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8533  262
         1 1862  678
                                          
               Accuracy : 0.8126          
                 95% CI : (0.8053, 0.8198)
    No Information Rate : 0.9171          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3056          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8209          
            Specificity : 0.7213          
         Pos Pred Value : 0.9702          
         Neg Pred Value : 0.2669          
             Prevalence : 0.9171          
         Detection Rate : 0.7528          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.7711          
                                          
       'Positive' Class : 0

library(ROCR)

Loading required package: gplots


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6185697

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

library(DMwR)

Loading required package: grid

regr.eval(train$`default payment next month`,pred_train)

      mae       mse      rmse      mape 
0.1873842 0.1873842 0.4328790       NaN

Here,from the contingency table obtained model is predicting 9159 observations correctly i.e,with 81% accuracy but kappa being 28% < 70% inter rater reliability is not satisfactory (i.e, we can't rely on this model).
The true positive rate is 82% and the true negative rate is 70% which is bit satisfactory and 76% of the time the classes '0' and '1' are predicting correctly(from balanced accuracy).
The area under curve is 61% obtained from reciever optimistic characteristic curve which is quite a good model.

NAIVE BAYES METHOD

Model Building

train$`default payment next month`<-as.factor(train$`default payment next month`)
class(train$`default payment next month`)

[1] "factor"

library(naivebayes)
model_nb<-naive_bayes(`default payment next month` ~ ., data=train)

Predicting On Train Data

pred_train<-predict(model_nb,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 7048 1747
         1 1006 1534
                                         
               Accuracy : 0.7571         
                 95% CI : (0.7491, 0.765)
    No Information Rate : 0.7105         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.3672         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.8751         
            Specificity : 0.4675         
         Pos Pred Value : 0.8014         
         Neg Pred Value : 0.6039         
             Prevalence : 0.7105         
         Detection Rate : 0.6218         
   Detection Prevalence : 0.7759         
      Balanced Accuracy : 0.6713         
                                         
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7026507

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Naive bayes is probabilistic model where it calculates a MAP(Maximum A Posterior)/future probabilities based on prior probabilities.
Here, as it is predicting 7491 values correctly the accuracy is only 66% and kappa being very less which is 27% and specificity is only 37% and 63% of the time model is predicting the binary classes correctly it is a bad model with area under curve 68%.Only the true positive rate is satisfactory which is 88%.

CART(CLASSIFICATION AND REGRESSION TREES)

Model Building

library(rpart)
model_cart<-rpart(`default payment next month` ~ ., data=train)
summary(model_cart)

Call:
rpart(formula = `default payment next month` ~ ., data = train)
  n= 11335 

         CP nsplit rel error    xerror       xstd
1 0.1964567      0 1.0000000 1.0000000 0.01747794
2 0.0100000      1 0.8035433 0.8035433 0.01610565

Variable importance
PAY_0 PAY_4 PAY_5 PAY_6 PAY_3 PAY_2 
   85     4     3     3     3     2 

Node number 1: 11335 observations,    complexity param=0.1964567
  predicted class=0  expected loss=0.2240847  P(node) =1
    class counts:  8795  2540
   probabilities: 0.776 0.224 
  left son=2 (10106 obs) right son=3 (1229 obs)
  Primary splits:
      PAY_0 < 1.5 to the left,  improve=632.3547, (0 missing)
      PAY_2 < 1.5 to the left,  improve=464.2895, (0 missing)
      PAY_3 < 1.5 to the left,  improve=363.1352, (0 missing)
      PAY_4 < 1   to the left,  improve=330.6260, (0 missing)
      PAY_5 < 1   to the left,  improve=315.4737, (0 missing)
  Surrogate splits:
      PAY_4 < 2.5 to the left,  agree=0.896, adj=0.044, (0 split)
      PAY_5 < 2.5 to the left,  agree=0.896, adj=0.041, (0 split)
      PAY_6 < 2.5 to the left,  agree=0.895, adj=0.033, (0 split)
      PAY_3 < 3.5 to the left,  agree=0.895, adj=0.031, (0 split)
      PAY_2 < 2.5 to the left,  agree=0.894, adj=0.025, (0 split)

Node number 2: 10106 observations
  predicted class=0  expected loss=0.1658421  P(node) =0.8915748
    class counts:  8430  1676
   probabilities: 0.834 0.166 

Node number 3: 1229 observations
  predicted class=1  expected loss=0.2969894  P(node) =0.1084252
    class counts:   365   864
   probabilities: 0.297 0.703

Predicting On Train Data

pred_train<-predict(model_cart,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8430  365
         1 1676  864
                                         
               Accuracy : 0.8199         
                 95% CI : (0.8127, 0.827)
    No Information Rate : 0.8916         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.3658         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.8342         
            Specificity : 0.7030         
         Pos Pred Value : 0.9585         
         Neg Pred Value : 0.3402         
             Prevalence : 0.8916         
         Detection Rate : 0.7437         
   Detection Prevalence : 0.7759         
      Balanced Accuracy : 0.7686         
                                         
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6493283

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

In CART model the node(splitting variable) is selected based on GINI index i.e,the attribute with smallest GINI index is selected.It is pruned with the complexity parameter of 0.01.
Here, it is predicting 9254 values correctly with accuracy 82%, inter rate reliability of 35%,true positive rate is 83%,true negative rate is 69%, balanced accuracy is 76% and area under curve is 64%.

C5.0 DECISION TREE

Model Building

library(C50)
train$`default payment next month`<-as.factor(train$`default payment next month`)
class(train$`default payment next month`)

[1] "factor"

model_c50<-C5.0(`default payment next month` ~ ., data=train)
summary(model_c50)


Call:
C5.0.formula(formula = `default payment next month` ~ ., data = train)


C5.0 [Release 2.07 GPL Edition]     Wed Aug 15 08:37:56 2018
-------------------------------

Class specified by attribute `outcome'

Read 11335 cases (22 attributes) from undefined.data

Decision tree:

PAY_0 > 1: 1 (1229/365)
PAY_0 <= 1:
:...PAY_2 <= 1: 0 (9258/1321)
    PAY_2 > 1:
    :...PAY_6 <= 0: 0 (590/210)
        PAY_6 > 0: 1 (258/113)


Evaluation on training data (11335 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         4 2009(17.7%)   <<


       (a)   (b)    <-classified as
      ----  ----
      8317   478    (a): class 0
      1531  1009    (b): class 1


    Attribute usage:

    100.00% PAY_0
     89.16% PAY_2
      7.48% PAY_6


Time: 0.5 secs

Predicting On Train Data

pred_train<-predict(model_c50,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8317  478
         1 1531 1009
                                          
               Accuracy : 0.8228          
                 95% CI : (0.8156, 0.8298)
    No Information Rate : 0.8688          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4022          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8445          
            Specificity : 0.6785          
         Pos Pred Value : 0.9457          
         Neg Pred Value : 0.3972          
             Prevalence : 0.8688          
         Detection Rate : 0.7337          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.7615          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6714475

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in c50 the node is selected based on information gain ratio and pruning is done by confidence threshold of 0.25.
It is predicting 9297 values correctly with accuracy 82%, kappa being 36%,true positive rate being 83%,true negative rate being 71%,balanced accuracy is 77% and area under curve is 65%.

SVM (SUPPORT VECTOR MACHINES) ADVANCED MODEL

Model Building

library(e1071)
model_svm<-svm(`default payment next month` ~ ., data=train,cost=100,gamma=1)
summary(model_svm)


Call:
svm(formula = `default payment next month` ~ ., data = train, 
    cost = 100, gamma = 1)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  100 
      gamma:  1 

Number of Support Vectors:  7723

 ( 2310 5413 )


Number of Classes:  2 

Levels: 
 0 1

Predicting On Train Data

pred_train<-predict(model_svm,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8780   15
         1  163 2377
                                          
               Accuracy : 0.9843          
                 95% CI : (0.9818, 0.9865)
    No Information Rate : 0.789           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9539          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9818          
            Specificity : 0.9937          
         Pos Pred Value : 0.9983          
         Neg Pred Value : 0.9358          
             Prevalence : 0.7890          
         Detection Rate : 0.7746          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.9878          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.9670606

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here,the model is built using radial basis function where parameter gamma of 1 is taken and the classes of outcome variable is separated using hyperplane concept. 
Here, it is predicting 11117 values correctly with accuracy of 98%,kappa being 94%, true positive rate being 98%,true negative rate being 99%,98% of time the binary classes are predicted correctly and area under curve being 96%, it is a very good model with vey high performance.

CART BAGGING ENSEMBLE MODEL

Model Building

library(ipred)
model_cartbag<-bagging(`default payment next month` ~ ., data=train)
varImp(model_cartbag)

            Overall
AGE        910.9295
BILL_AMT1 1011.4888
BILL_AMT2  922.3781
BILL_AMT3  813.0373
BILL_AMT4  754.9019
BILL_AMT5  710.5565
BILL_AMT6  656.7559
EDUCATION  322.6778
LIMIT_BAL  810.4052
PAY_0      927.7674
PAY_2      777.1373
PAY_3      690.3627
PAY_4      646.8209
PAY_5      582.1123
PAY_6      214.4319
PAY_AMT1   636.2980
PAY_AMT2   648.5776
PAY_AMT3   621.9875
PAY_AMT4   574.8440
PAY_AMT5   573.7682
PAY_AMT6   535.8119

Predicting On Train Data

pred_train<-predict(model_cartbag,train,type="class")
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8789    6
         1   59 2481
                                          
               Accuracy : 0.9943          
                 95% CI : (0.9927, 0.9956)
    No Information Rate : 0.7806          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9834          
 Mcnemar's Test P-Value : 1.12e-10        
                                          
            Sensitivity : 0.9933          
            Specificity : 0.9976          
         Pos Pred Value : 0.9993          
         Neg Pred Value : 0.9768          
             Prevalence : 0.7806          
         Detection Rate : 0.7754          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.9955          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.9880447

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here,in cart bagging different models on same algorithm is generated and takes the average accuracy/error by default it builds 500 pruned trees which reduces the variance.
It predicts 11294 values correctly with an accuracy of 99%,inter rater reliability being 99% model is reliable,true positive rate being 99%,true negative rate being nearl 100%,balanced accuracy of 99% and area under curve being 99%, it is a very good model.

ADAPTIVE BOOSTING ENSEMBLE MODEL

Model Building

library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada

Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential", 
    type = "discrete", iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 8381  414
         1 1564  976

Train Error: 0.175 

Out-Of-Bag Error:  0.175  iteration= 82 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       100         23

plot(model_ada)

Predicting On Train Data

pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8381  414
         1 1564  976
                                          
               Accuracy : 0.8255          
                 95% CI : (0.8184, 0.8324)
    No Information Rate : 0.8774          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4019          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8427          
            Specificity : 0.7022          
         Pos Pred Value : 0.9529          
         Neg Pred Value : 0.3843          
             Prevalence : 0.8774          
         Detection Rate : 0.7394          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.7724          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6685899

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in boosting it classifies weak learners and makes them strong learners using decision stump.It performs 100 iterations by exponential method .It has a train and test error of 18% with automatically splitting the data into train data(68% of data) and test data(32% of data). 
It predicts 9290 values correctly with accuracy of 82%, kappa being 38%,true positive rate of 84%,true negative rate of 68%,balanced accuracy of 76% and area under curve of 66%,it is quite a good model but kappa is very low.

XTREME GRADIENT BOOSTING ENSEMBLE MODEL

Model Building

library(xgboost)
x<-train[,1:21]
y<-train[,22]
model_xgb<-xgboost(data=as.matrix(x),label = as.matrix(y),nrounds = 100)

[1] train-rmse:0.434421 
[2] train-rmse:0.396022 
[3] train-rmse:0.374531 
[4] train-rmse:0.361911 
[5] train-rmse:0.353549 
[6] train-rmse:0.347930 
[7] train-rmse:0.345045 
[8] train-rmse:0.342586 
[9] train-rmse:0.339969 
[10]    train-rmse:0.337570 
[11]    train-rmse:0.335436 
[12]    train-rmse:0.334377 
[13]    train-rmse:0.331047 
[14]    train-rmse:0.330220 
[15]    train-rmse:0.328446 
[16]    train-rmse:0.326319 
[17]    train-rmse:0.324078 
[18]    train-rmse:0.323263 
[19]    train-rmse:0.322829 
[20]    train-rmse:0.320912 
[21]    train-rmse:0.319064 
[22]    train-rmse:0.317763 
[23]    train-rmse:0.316315 
[24]    train-rmse:0.314133 
[25]    train-rmse:0.312600 
[26]    train-rmse:0.311811 
[27]    train-rmse:0.311520 
[28]    train-rmse:0.310819 
[29]    train-rmse:0.310101 
[30]    train-rmse:0.307327 
[31]    train-rmse:0.304829 
[32]    train-rmse:0.303094 
[33]    train-rmse:0.300805 
[34]    train-rmse:0.300008 
[35]    train-rmse:0.298665 
[36]    train-rmse:0.296838 
[37]    train-rmse:0.295095 
[38]    train-rmse:0.294165 
[39]    train-rmse:0.292532 
[40]    train-rmse:0.292150 
[41]    train-rmse:0.291544 
[42]    train-rmse:0.289300 
[43]    train-rmse:0.288217 
[44]    train-rmse:0.286501 
[45]    train-rmse:0.284888 
[46]    train-rmse:0.283152 
[47]    train-rmse:0.282668 
[48]    train-rmse:0.282393 
[49]    train-rmse:0.281483 
[50]    train-rmse:0.280589 
[51]    train-rmse:0.279338 
[52]    train-rmse:0.279012 
[53]    train-rmse:0.277776 
[54]    train-rmse:0.276287 
[55]    train-rmse:0.275341 
[56]    train-rmse:0.274861 
[57]    train-rmse:0.273163 
[58]    train-rmse:0.272174 
[59]    train-rmse:0.271597 
[60]    train-rmse:0.269858 
[61]    train-rmse:0.269611 
[62]    train-rmse:0.267767 
[63]    train-rmse:0.267065 
[64]    train-rmse:0.265849 
[65]    train-rmse:0.264567 
[66]    train-rmse:0.263567 
[67]    train-rmse:0.262096 
[68]    train-rmse:0.261212 
[69]    train-rmse:0.260519 
[70]    train-rmse:0.259512 
[71]    train-rmse:0.257833 
[72]    train-rmse:0.256280 
[73]    train-rmse:0.255731 
[74]    train-rmse:0.255442 
[75]    train-rmse:0.253063 
[76]    train-rmse:0.252047 
[77]    train-rmse:0.250519 
[78]    train-rmse:0.249685 
[79]    train-rmse:0.248708 
[80]    train-rmse:0.248013 
[81]    train-rmse:0.246565 
[82]    train-rmse:0.246429 
[83]    train-rmse:0.245796 
[84]    train-rmse:0.244609 
[85]    train-rmse:0.243741 
[86]    train-rmse:0.243224 
[87]    train-rmse:0.242179 
[88]    train-rmse:0.241366 
[89]    train-rmse:0.240682 
[90]    train-rmse:0.239672 
[91]    train-rmse:0.238898 
[92]    train-rmse:0.238148 
[93]    train-rmse:0.237244 
[94]    train-rmse:0.236181 
[95]    train-rmse:0.235219 
[96]    train-rmse:0.234477 
[97]    train-rmse:0.233583 
[98]    train-rmse:0.232551 
[99]    train-rmse:0.231134 
[100]   train-rmse:0.230896

Predicting On Train Data

pred<-predict(model_xgb,as.matrix(x))
pred_train<-round(pred)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8722   73
         1  686 1854
                                          
               Accuracy : 0.933           
                 95% CI : (0.9283, 0.9376)
    No Information Rate : 0.83            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7894          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9271          
            Specificity : 0.9621          
         Pos Pred Value : 0.9917          
         Neg Pred Value : 0.7299          
             Prevalence : 0.8300          
         Detection Rate : 0.7695          
   Detection Prevalence : 0.7759          
      Balanced Accuracy : 0.9446          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.8608105

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here in xtreme gradient boosting model on train data root mean square error is calculated for all 100 iterations and with lowest rmse is selected with 23.8%.
It builded the model with 93% accuracy by predicting 10548 values correctly,inter rater reliability being 78% we can trust our model,true positive rate being 92%,true negative rate being 96%,balanced accuracy of 94% and area under curve of 85%, it is a good model.

Let's, compare the different algorithmic models based on various parameters discussed till now and decide which is a better model to perform our research.

SUMMARY OF DIFFERENT MODELS ON TRAIN DATA

SUMMARY
model	accuracy	kappa	sensitivity	specificity	balanced_accuracy	auc
cart.bagging_train	99	99	99	99	99.6	99
svm_train	98	94	98	99	98.4	96
xgb.boosting_train	93	78	92	96	94.3	85
ada.boosting_train	82	38	84	69	76.3	66
c50_train	82	37	83	71	77.0	65
cart_train	81	35	83	69	76.1	64
binary.logistic.regression_train	81	28	82	70	76.0	61
naive.bayes_train	66	27	88	37	62.6	68

From the model accuracy plot, we can conclude that naive bayes model is eliminated due to its low accuracy .CART bagging model,svm model and xg boosting model have high accuracy and from summary also we can infer that all these parameters are high for these models,so lets predict the payment defaulters on test1 and test2 data and check whether the models are biased/varianced or not.
So,therefore lets predict the outcome on CART bagging,svm,xg boosting,adaptive boosting and c50 decision tree and compare their performance related to bias and variance.

PREDICTING FUTURE DATA (i.e on test1 and on test2 data)

Using Cart Bagging Ensemble Model

library(caret)
pred_test1<-predict(model_cartbag,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6151  469
         1 1152  729
                                          
               Accuracy : 0.8093          
                 95% CI : (0.8008, 0.8176)
    No Information Rate : 0.8591          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.364           
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8423          
            Specificity : 0.6085          
         Pos Pred Value : 0.9292          
         Neg Pred Value : 0.3876          
             Prevalence : 0.8591          
         Detection Rate : 0.7236          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.7254          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6583569

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_cartbag,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6087  476
         1 1188  750
                                          
               Accuracy : 0.8043          
                 95% CI : (0.7957, 0.8126)
    No Information Rate : 0.8558          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3612          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8367          
            Specificity : 0.6117          
         Pos Pred Value : 0.9275          
         Neg Pred Value : 0.3870          
             Prevalence : 0.8558          
         Detection Rate : 0.7160          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.7242          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6572345

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except kappa remaining all parameters have good values, so overall model is good.

Using Support Vector Machines Advanced Model

pred_test1<-predict(model_svm,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5839  781
         1 1378  503
                                          
               Accuracy : 0.746           
                 95% CI : (0.7366, 0.7553)
    No Information Rate : 0.849           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1686          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8091          
            Specificity : 0.3917          
         Pos Pred Value : 0.8820          
         Neg Pred Value : 0.2674          
             Prevalence : 0.8490          
         Detection Rate : 0.6869          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.6004          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.5747176

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_svm,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5797  766
         1 1383  555
                                          
               Accuracy : 0.7472          
                 95% CI : (0.7378, 0.7564)
    No Information Rate : 0.8446          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1911          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8074          
            Specificity : 0.4201          
         Pos Pred Value : 0.8833          
         Neg Pred Value : 0.2864          
             Prevalence : 0.8446          
         Detection Rate : 0.6819          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.6138          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.5848314

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except accuracy and sensitivity remaining all parameters have bad values, so overall model is bad.

Using Xtreme Gradient Boosting Ensemble Model

x1<-test1[,1:21]
pred<-predict(model_xgb,as.matrix(x1))
pred_test1<-round(pred)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6159  461
         1 1177  704
                                          
               Accuracy : 0.8073          
                 95% CI : (0.7988, 0.8157)
    No Information Rate : 0.863           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3527          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8396          
            Specificity : 0.6043          
         Pos Pred Value : 0.9304          
         Neg Pred Value : 0.3743          
             Prevalence : 0.8630          
         Detection Rate : 0.7245          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.7219          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6523158

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

x2<-test2[,1:21]
pred<-predict(model_xgb,as.matrix(x2))
pred_test2<-round(pred)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6087  476
         1 1218  720
                                          
               Accuracy : 0.8007          
                 95% CI : (0.7921, 0.8092)
    No Information Rate : 0.8593          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3456          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8333          
            Specificity : 0.6020          
         Pos Pred Value : 0.9275          
         Neg Pred Value : 0.3715          
             Prevalence : 0.8593          
         Detection Rate : 0.7160          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.7176          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6494946

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

  From above analysis, except kappa, specificity and area under curve remaining all parameters have good values, so overall model is bit bad one.

Using Adaptive Boosting Ensemble Model

pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6275  345
         1 1200  681
                                          
               Accuracy : 0.8183          
                 95% CI : (0.8099, 0.8264)
    No Information Rate : 0.8793          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3701          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8395          
            Specificity : 0.6637          
         Pos Pred Value : 0.9479          
         Neg Pred Value : 0.3620          
             Prevalence : 0.8793          
         Detection Rate : 0.7381          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.7516          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6549633

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6206  357
         1 1220  718
                                          
               Accuracy : 0.8145          
                 95% CI : (0.8061, 0.8227)
    No Information Rate : 0.8735          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3749          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8357          
            Specificity : 0.6679          
         Pos Pred Value : 0.9456          
         Neg Pred Value : 0.3705          
             Prevalence : 0.8735          
         Detection Rate : 0.7300          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.7518          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6580446

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except sensitivity and accuracy remaining all parameters have quite good values, so overall model is good.

Using C5.0 Decision Tree

pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6251  369
         1 1154  727
                                          
               Accuracy : 0.8208          
                 95% CI : (0.8125, 0.8289)
    No Information Rate : 0.8711          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3888          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8442          
            Specificity : 0.6633          
         Pos Pred Value : 0.9443          
         Neg Pred Value : 0.3865          
             Prevalence : 0.8711          
         Detection Rate : 0.7353          
   Detection Prevalence : 0.7787          
      Balanced Accuracy : 0.7537          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6653782

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6172  391
         1 1183  755
                                          
               Accuracy : 0.8148          
                 95% CI : (0.8064, 0.8231)
    No Information Rate : 0.8652          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3855          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8392          
            Specificity : 0.6588          
         Pos Pred Value : 0.9404          
         Neg Pred Value : 0.3896          
             Prevalence : 0.8652          
         Detection Rate : 0.7260          
   Detection Prevalence : 0.7720          
      Balanced Accuracy : 0.7490          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6650002

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

From above analysis, except sensitivity and accuracy remaining all parameters have bad values, so overall model is bad.

SUMMARY ON TEST1 AND TEST2 DATA

SUMMARY
model	accuracy	accuracy_t1	accuracy_t2	kappa	kappa_t1	kappa_t2	sensitivity	sensitivity_t1	sensitivity_t2	specificity	specificity_t1	specificity_t2	balanced_acc	balanced_acc_t1	balanced_acc_t2	auc	auc_t1	auc_t2
_	train	test1	test2	train	test1	test2	train	test1	test2	train	test1	test2	train	test1	test2	train	test1	test2
cart.bagging	99.7	81	81	99	37	37	99.6	84	84	99.8	61	63	99.6	72	73	99	66	66
svm	98	75	76	94	19	21	98	81	81	99	41	45	98.4	61	63	96	58	59
xgb.boosting	93	81	81	78	37	37	92	84	84	96	61	63	94	73	73	85	66	66
ada.boosting	82	82	82	38	38	38	84	84	84	69	68	69	76	76	76	66	66	66
c50	82	82	82	37	38	37	83	84	83	71	70	71	77	77	77	65	66	65
cart	81	_	_	35	_	_	83	_	_	69	_	_	76	_	_	64	_	_
binary.logistic.regression	81	_	_	28	_	_	82	_	_	70	_	_	76	_	_	61	_	_
naive.bayes	66	_	_	27	_	_	88	_	_	37	_	_	63	_	_	68	_	_

  From above summary, checking the consistency of all the parameters,we can infer that the CART bagging,support vector machines and xtreme gradient boosting are more biased(performance is good on train data but performance is comparatively bad on test1 and test2 data) and varianced(error rate on train,test1 and test2 data is high).
  But adaptive boosting and c50 decision tree is less biased and varianced and has a good accuracy too, so therefore lets consider adaptive boosting in this analysis and still check its consistency of its performance on differnt train, test1, test2 data by shuffling the same ccdefault dataset.

SHUFFLE THE DATA

set.seed(1)

RESAMPLING THE DATA

library(caret)
splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)

[1] 11335    22

dim(test1)

[1] 8501   22

dim(test2)

[1] 8501   22

MODEL BUILDING USING ADAPTIVE BOOSTING

library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada

Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential", 
    type = "discrete", iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 8385  404
         1 1604  942

Train Error: 0.177 

Out-Of-Bag Error:  0.177  iteration= 27 

Additional Estimates of number of iterations:

train.err1 train.kap1 
        21         34

plot(model_ada)

PREDICTING ON TRAIN DATA

pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8385  404
         1 1604  942
                                          
               Accuracy : 0.8228          
                 95% CI : (0.8157, 0.8298)
    No Information Rate : 0.8813          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3892          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8394          
            Specificity : 0.6999          
         Pos Pred Value : 0.9540          
         Neg Pred Value : 0.3700          
             Prevalence : 0.8813          
         Detection Rate : 0.7397          
   Detection Prevalence : 0.7754          
      Balanced Accuracy : 0.7696          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6620128

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICTING ON TEST1 AND TEST2 DATA

pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6228  349
         1 1218  706
                                          
               Accuracy : 0.8157          
                 95% CI : (0.8073, 0.8239)
    No Information Rate : 0.8759          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3736          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8364          
            Specificity : 0.6692          
         Pos Pred Value : 0.9469          
         Neg Pred Value : 0.3669          
             Prevalence : 0.8759          
         Detection Rate : 0.7326          
   Detection Prevalence : 0.7737          
      Balanced Accuracy : 0.7528          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6569401

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6277  335
         1 1207  682
                                          
               Accuracy : 0.8186          
                 95% CI : (0.8103, 0.8267)
    No Information Rate : 0.8804          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3716          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8387          
            Specificity : 0.6706          
         Pos Pred Value : 0.9493          
         Neg Pred Value : 0.3610          
             Prevalence : 0.8804          
         Detection Rate : 0.7384          
   Detection Prevalence : 0.7778          
      Balanced Accuracy : 0.7547          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6551861

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA

set.seed(2)

RESAMPLING THE DATA

library(caret)
splitccdefault<-createDataPartition(ccdefault$`default payment next month`,p=0.4,list = FALSE)
train<-ccdefault[splitccdefault,]
test<-ccdefault[-splitccdefault,]
splittest<-createDataPartition(test$`default payment next month`,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(train)

[1] 11335    22

dim(test1)

[1] 8501   22

dim(test2)

[1] 8501   22

MODEL BUILDING USING ADAPTIVE BOOSTING

library(ada)
model_ada<-ada(`default payment next month` ~ ., data=train,loss="exponential",type="discrete",iter=100)
model_ada

Call:
ada(`default payment next month` ~ ., data = train, loss = "exponential", 
    type = "discrete", iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 8338  441
         1 1559  997

Train Error: 0.176 

Out-Of-Bag Error:  0.176  iteration= 100 

Additional Estimates of number of iterations:

train.err1 train.kap1 
        78         10

plot(model_ada)

PREDICTING ON TRAIN DATA

pred_train<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$`default payment next month`),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8338  441
         1 1559  997
                                          
               Accuracy : 0.8236          
                 95% CI : (0.8164, 0.8305)
    No Information Rate : 0.8731          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4022          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8425          
            Specificity : 0.6933          
         Pos Pred Value : 0.9498          
         Neg Pred Value : 0.3901          
             Prevalence : 0.8731          
         Detection Rate : 0.7356          
   Detection Prevalence : 0.7745          
      Balanced Accuracy : 0.7679          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6699145

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICTING ON TEST1 AND TEST2 DATA

pred_test1<-predict(model_ada,test1)
library(caret)
confusionMatrix(as.factor(test1$`default payment next month`),as.factor(pred_test1))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6222  324
         1 1220  735
                                        
               Accuracy : 0.8184        
                 95% CI : (0.81, 0.8265)
    No Information Rate : 0.8754        
    P-Value [Acc > NIR] : 1             
                                        
                  Kappa : 0.389         
 Mcnemar's Test P-Value : <2e-16        
                                        
            Sensitivity : 0.8361        
            Specificity : 0.6941        
         Pos Pred Value : 0.9505        
         Neg Pred Value : 0.3760        
             Prevalence : 0.8754        
         Detection Rate : 0.7319        
   Detection Prevalence : 0.7700        
      Balanced Accuracy : 0.7651        
                                        
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6632316

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred_test2<-predict(model_ada,test2)
confusionMatrix(as.factor(test2$`default payment next month`),as.factor(pred_test2))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6278  375
         1 1160  688
                                          
               Accuracy : 0.8194          
                 95% CI : (0.8111, 0.8276)
    No Information Rate : 0.875           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3732          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8440          
            Specificity : 0.6472          
         Pos Pred Value : 0.9436          
         Neg Pred Value : 0.3723          
             Prevalence : 0.8750          
         Detection Rate : 0.7385          
   Detection Prevalence : 0.7826          
      Balanced Accuracy : 0.7456          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$`default payment next month`))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6579644

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

SUMMARY ON DIFFERENT SHUFFLED DATA

Here is the summary of model performance on shuffled datasets by using adaptive boosting.

FINAL SUMMARY
ada.boosting.model	accuracy	kappa	sensitivity	specificity	balanced_accuracy	auc
train	82	37.8	83.8	68.7	76.3	65.7
test1	82	38.1	84	67.7	76	65.9
test2	81.7	37.6	83.5	68.6	76	65.7
reshuffle_data1	.	.	.	.	.	.
train	82.3	39	84	70	77	66.3
test1	81.6	37.4	83.6	66.9	75.3	65.7
test2	81.9	37.2	83.9	67	75.5	65.5
reshuffle_data2	.	.	.	.	.	.
train	82.3	38.9	83.9	69.9	76.9	66.2
test1	81.5	37.3	83.6	66.9	75.2	65.6
test2	81.8	37.1	83.8	67	75.4	65.5
reshuffle_data3	.	.	.	.	.	.
train	82.3	40.2	84.2	69.3	76.8	67
test1	81.8	39	83.6	69.4	76.5	66.3
test2	81.9	37.3	84.4	64.7	74.5	65.8

From the above summary we can conclude that the model is consistently performing good on all the parameters so hence adaptive boosting is the best model on this dataset.

FINAL DATA WITH PREDICTED VALUES

library(caret)
predicted<-predict(model_ada,ccdefault)
library(dplyr)


Attaching package: 'dplyr'

The following object is masked from 'package:xgboost':

    slice

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

ccdefault<-mutate(ccdefault,expected_default.payment.next.month=predicted)

FINAL OBSERVED AND EXPECTED PAYMENT DEFAULTERS (displaying only 33 observations)
LIMIT_BAL	EDUCATION	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month	expected_default.payment.next.month
20000	2	24	2	2	-1	-1	-2	-2	3913	3102	689	0	0	0	0	689	0	0	0	0	1	1
120000	2	26	-1	2	0	0	0	2	2682	1725	2682	3272	3455	3261	0	1000	1000	1000	0	2000	1	0
90000	2	34	0	0	0	0	0	0	29239	14027	13559	14331	14948	15549	1518	1500	1000	1000	1000	5000	0	0
50000	2	37	0	0	0	0	0	0	46990	48233	49291	28314	28959	29547	2000	2019	1200	1100	1069	1000	0	0
50000	2	57	-1	0	-1	0	0	0	8617	5670	35835	20940	19146	19131	2000	36681	10000	9000	689	679	0	0
50000	1	37	0	0	0	0	0	0	64400	57069	57608	19394	19619	20024	2500	1815	657	1000	1000	800	0	0
500000	1	29	0	0	0	0	0	0	367965	412023	445007	542653	483003	473944	55000	40000	38000	20239	13750	13770	0	0
140000	3	28	0	0	2	0	0	0	11285	14096	12108	12211	11793	3719	3329	0	432	1000	1000	1000	0	0
20000	3	35	-2	-2	-2	-2	-1	-1	0	0	0	0	13007	13912	0	0	0	13007	1122	0	0	0
200000	3	34	0	0	2	0	0	-1	11073	9787	5535	2513	1828	3731	2306	12	50	300	3738	66	0	0
260000	1	51	-1	-1	-1	-1	-1	2	12261	21670	9966	8517	22287	13668	21818	9966	8583	22301	0	3640	0	0
630000	2	41	-1	0	-1	-1	-1	-1	12137	6500	6500	6500	6500	2870	1000	6500	6500	6500	2870	0	0	0
70000	2	30	1	2	2	0	0	2	65802	67369	65701	66782	36137	36894	3200	0	3000	3000	1500	0	1	0
250000	1	29	0	0	0	0	0	0	70887	67060	63561	59696	56875	55512	3000	3000	3000	3000	3000	3000	0	0
50000	3	23	1	2	0	0	0	0	50614	29173	28116	28771	29531	30211	0	1500	1100	1200	1300	1100	0	0
20000	1	24	0	0	2	2	2	2	15376	18010	17428	18338	17905	19104	3200	0	1500	0	1650	0	1	0
320000	1	49	0	0	0	-1	-1	-1	253286	246536	194663	70074	5856	195599	10358	10000	75940	20000	195599	50000	0	0
360000	1	49	1	-2	-2	-2	-2	-2	0	0	0	0	0	0	0	0	0	0	0	0	0	0
180000	1	29	1	-2	-2	-2	-2	-2	0	0	0	0	0	0	0	0	0	0	0	0	0	0
130000	3	39	0	0	0	0	0	-1	38358	27688	24489	20616	11802	930	3000	1537	1000	2000	930	33764	0	0
120000	2	39	-1	-1	-1	-1	-1	-1	316	316	316	0	632	316	316	316	0	632	316	0	1	0
70000	2	26	2	0	0	2	2	2	41087	42445	45020	44006	46905	46012	2007	3582	0	3601	0	1820	1	1
450000	1	40	-2	-2	-2	-2	-2	-2	5512	19420	1473	560	0	0	19428	1473	560	0	0	1128	1	0
90000	1	23	0	0	0	-1	0	0	4744	7070	0	5398	6360	8292	5757	0	5398	1200	2045	2000	0	0
50000	3	23	0	0	0	0	0	0	47620	41810	36023	28967	29829	30046	1973	1426	1001	1432	1062	997	0	0
50000	3	30	0	0	0	0	0	0	22541	16138	17163	17878	18931	19617	1300	1300	1000	1500	1000	1012	0	0
50000	3	47	-1	-1	-1	-1	-1	-1	650	3415	3416	2040	30430	257	3415	3421	2044	30430	257	0	0	0
50000	1	26	0	0	0	0	0	0	15329	16575	17496	17907	18375	11400	1500	1500	1000	1000	1600	0	0	0
230000	1	27	-1	-1	-1	-1	-1	-1	16646	17265	13266	15339	14307	36923	17270	13281	15339	14307	37292	0	0	0
50000	2	33	2	0	0	0	0	0	30518	29618	22102	22734	23217	23680	1718	1500	1000	1000	1000	716	1	1
100000	1	32	0	0	0	0	0	0	93036	84071	82880	80958	78703	75589	3023	3511	3302	3204	3200	2504	0	0
500000	2	54	-2	-2	-2	-2	-2	-2	10929	4152	22722	7521	71439	8981	4152	22827	7521	71439	981	51582	0	0
500000	1	58	-2	-2	-2	-2	-2	-2	13709	5006	31130	3180	0	5293	5006	31178	3180	0	5293	768	0	0

CONCLUSION

As banks play a vital role in providing financial services to their customers,banks should avoid wrong customers who can default and cause loss to the banks and other financial institutions.
In order to achieve their goals we can conclude that the model developed has used all the factors and data and has a high potential in predicting whether the customer would fail/succeed in making the next payment which will benefit the bank/financial institutions in making their decisions so hence we can deploy the model.

PAYMENT DEFAULTERS

shekar

13 August 2018

DEFINE PROBLEM STATEMENT

DATASET INFORMATION

Attributes information

DATA PREPROCESSING

Variable selection

Check for missing values

Check for outliers

Remove extreme outliers

Check for multicollinearity

Resampling the pre-processed data

BINARY LOGISTIC REGRESSION

Model Building

Predicting On Train Data

NAIVE BAYES METHOD

Model Building

Predicting On Train Data

CART(CLASSIFICATION AND REGRESSION TREES)

Model Building

Predicting On Train Data

C5.0 DECISION TREE

Model Building

Predicting On Train Data

SVM (SUPPORT VECTOR MACHINES) ADVANCED MODEL

Model Building

Predicting On Train Data

CART BAGGING ENSEMBLE MODEL

Model Building

Predicting On Train Data

ADAPTIVE BOOSTING ENSEMBLE MODEL

Model Building

Predicting On Train Data

XTREME GRADIENT BOOSTING ENSEMBLE MODEL

Model Building

Predicting On Train Data

SUMMARY OF DIFFERENT MODELS ON TRAIN DATA

PREDICTING FUTURE DATA (i.e on test1 and on test2 data)

Using Cart Bagging Ensemble Model

Using Support Vector Machines Advanced Model

Using Xtreme Gradient Boosting Ensemble Model

Using Adaptive Boosting Ensemble Model

Using C5.0 Decision Tree

SUMMARY ON TEST1 AND TEST2 DATA

SHUFFLE THE DATA

RESAMPLING THE DATA

MODEL BUILDING USING ADAPTIVE BOOSTING

PREDICTING ON TRAIN DATA

PREDICTING ON TEST1 AND TEST2 DATA

RESHUFFLE THE DATA

RESAMPLING THE DATA

MODEL BUILDING USING ADAPTIVE BOOSTING

PREDICTING ON TRAIN DATA

PREDICTING ON TEST1 AND TEST2 DATA

SUMMARY ON DIFFERENT SHUFFLED DATA

FINAL DATA WITH PREDICTED VALUES

CONCLUSION