PROBLEM STATEMENT

Can we rectify that the applicant who is requesting the home equity loans will pay the loan or will be a delinquent without paying the loan?

DATASET INFORMATION

This is the modified version of HMEQ reports dataset. The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral. 
library(readxl)
hmeq1<-read_excel("D:/projects/hmeq.xlsx")
View(hmeq1)
HMEQ DATASET(displaying first 11 observations
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
1 1100 25860 39025 HomeImp Other 10.5 0 0 94.36667 1 9 NA
1 1300 70053 68400 HomeImp Other 7.0 0 2 121.83333 0 14 NA
1 1500 13500 16700 HomeImp Other 4.0 0 0 149.46667 1 10 NA
1 1500 NA NA NA NA NA NA NA NA NA NA NA
0 1700 97800 112000 HomeImp Office 3.0 0 0 93.33333 0 14 NA
1 1700 30548 40320 HomeImp Other 9.0 0 0 101.46600 1 8 37.11361
1 1800 48649 57037 HomeImp Other 5.0 3 2 77.10000 1 17 NA
1 1800 28502 43034 HomeImp Other 11.0 0 0 88.76603 0 8 36.88489
1 2000 32700 46740 HomeImp Other 3.0 0 2 216.93333 1 12 NA
1 2000 NA 62250 HomeImp Sales 16.0 0 0 115.80000 0 13 NA
1 2000 22608 NA NA NA 18.0 NA NA NA NA NA NA

ATTRIBUTES INFORMATION

  BAD: '1' refers to an applicant who defaulted the loan(seriously delinquent) and '0' refers to an applicant who paid the loan(non delinquent).
  LOAN: Amount of the loan request
  ??? MORTDUE: Amount due on existing mortgage
  VALUE: Value of current property
  REASON: DebtCon = debt consolidationn; HomeImp = home improvement
  JOB: Occupational categories(job categories)
  YOJ: Years at present job
  DEROG: Number of major derogatory reports
  DELINQ: Number of delinquent credit lines
  CLAGE: Age of oldest credit line in months
  NINQ: Number of recent credit inquiries
  CLNO: Number of credit lines
  DEBTINC: Debt-to-income ratio
Here, the variable 'BAD' will be an outcome variable with binary values and remaining all variables will be predictors.

DATA PRE-PROCESSING

Convert The Classes

hmeq1$REASON<-as.factor(hmeq1$REASON)
hmeq1$JOB<-as.factor(hmeq1$JOB)
str(hmeq1)
Classes 'tbl_df', 'tbl' and 'data.frame':   5960 obs. of  13 variables:
 $ BAD    : num  1 1 1 1 0 1 1 1 1 1 ...
 $ LOAN   : num  1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
 $ MORTDUE: num  25860 70053 13500 NA 97800 ...
 $ VALUE  : num  39025 68400 16700 NA 112000 ...
 $ REASON : Factor w/ 2 levels "DebtCon","HomeImp": 2 2 2 NA 2 2 2 2 2 2 ...
 $ JOB    : Factor w/ 6 levels "Mgr","Office",..: 3 3 3 NA 2 3 3 3 3 5 ...
 $ YOJ    : num  10.5 7 4 NA 3 9 5 11 3 16 ...
 $ DEROG  : num  0 0 0 NA 0 0 3 0 0 0 ...
 $ DELINQ : num  0 2 0 NA 0 0 2 0 2 0 ...
 $ CLAGE  : num  94.4 121.8 149.5 NA 93.3 ...
 $ NINQ   : num  1 0 1 NA 0 1 1 0 1 0 ...
 $ CLNO   : num  9 14 10 NA 14 8 17 8 12 13 ...
 $ DEBTINC: num  NA NA NA NA NA ...
Here, the variables 'REASON' and 'JOB' are numerical which are to be converted to factor so as to procees the research. 

Check For Missing Values

summary(hmeq1)
      BAD              LOAN          MORTDUE           VALUE       
 Min.   :0.0000   Min.   : 1100   Min.   :  2063   Min.   :  8000  
 1st Qu.:0.0000   1st Qu.:11100   1st Qu.: 46276   1st Qu.: 66076  
 Median :0.0000   Median :16300   Median : 65019   Median : 89236  
 Mean   :0.1995   Mean   :18608   Mean   : 73761   Mean   :101776  
 3rd Qu.:0.0000   3rd Qu.:23300   3rd Qu.: 91488   3rd Qu.:119824  
 Max.   :1.0000   Max.   :89900   Max.   :399550   Max.   :855909  
                                  NA's   :518      NA's   :112     
     REASON          JOB            YOJ             DEROG        
 DebtCon:3928   Mgr    : 767   Min.   : 0.000   Min.   : 0.0000  
 HomeImp:1780   Office : 948   1st Qu.: 3.000   1st Qu.: 0.0000  
 NA's   : 252   Other  :2388   Median : 7.000   Median : 0.0000  
                ProfExe:1276   Mean   : 8.922   Mean   : 0.2546  
                Sales  : 109   3rd Qu.:13.000   3rd Qu.: 0.0000  
                Self   : 193   Max.   :41.000   Max.   :10.0000  
                NA's   : 279   NA's   :515      NA's   :708      
     DELINQ            CLAGE             NINQ             CLNO     
 Min.   : 0.0000   Min.   :   0.0   Min.   : 0.000   Min.   : 0.0  
 1st Qu.: 0.0000   1st Qu.: 115.1   1st Qu.: 0.000   1st Qu.:15.0  
 Median : 0.0000   Median : 173.5   Median : 1.000   Median :20.0  
 Mean   : 0.4494   Mean   : 179.8   Mean   : 1.186   Mean   :21.3  
 3rd Qu.: 0.0000   3rd Qu.: 231.6   3rd Qu.: 2.000   3rd Qu.:26.0  
 Max.   :15.0000   Max.   :1168.2   Max.   :17.000   Max.   :71.0  
 NA's   :580       NA's   :308      NA's   :510      NA's   :222   
    DEBTINC        
 Min.   :  0.5245  
 1st Qu.: 29.1400  
 Median : 34.8183  
 Mean   : 33.7799  
 3rd Qu.: 39.0031  
 Max.   :203.3121  
 NA's   :1267      
table(is.na(hmeq1))

FALSE  TRUE 
72209  5271 
dim(hmeq1)
[1] 5960   13
hmeq1<-na.omit(hmeq1)
dim(hmeq1)
[1] 3364   13
From summary we can see that missing values are available in some variables and these missing values are in small quantity in these variables individually ,so shouldn't be removed in variables individually but when seen in total dataset 2596 missing values are available out of 5960 observations which account to nearly 44% and can be removed from the dataset.

Check For Outliers

boxplot(hmeq1)

dim(hmeq1)
[1] 3364   13
hmeq1<-hmeq1[hmeq1$MORTDUE<4e+05,]
hmeq1<-hmeq1[hmeq1$VALUE<3e+05,]
boxplot(hmeq1)

dim(hmeq1)
[1] 3351   13
View(hmeq1)
Here in boxplot we can see that outliers are available but there are certain extreme outliers(13 extreme outliers) which are far from the outliers and these are omitted.

Therefore, after data preprocessing the dataset has 3351 observations.

RESAMPLING THE PRE-PROCESSED DATA

library(caret)
Loading required package: lattice
Loading required package: ggplot2
split<-createDataPartition(hmeq1$BAD,p=0.6,list = FALSE)
train<-hmeq1[split,]
test<-hmeq1[-split,]
dim(train)
[1] 2011   13
dim(test)
[1] 1340   13
Here, the hmeq1 dataset is splitted into train data with 2011 observations(60% of hmeq1) and test data with 1340 observations(40% of hmeq1).

Now lets build some models on train data and select the best model by comparing these models and later predict on test data.

BINARY LOGISTIC REGRESSION

MODEL BUILDING

model_bin<-glm(BAD ~ ., data=train,family = binomial(link = "logit"))
exp(model_bin$coefficients)
  (Intercept)          LOAN       MORTDUE         VALUE REASONHomeImp 
    0.0144330     0.9999823     0.9999942     1.0000017     0.9653005 
    JOBOffice      JOBOther    JOBProfExe      JOBSales       JOBSelf 
    0.5024415     0.7350595     0.6821242     2.0913331     0.9547095 
          YOJ         DEROG        DELINQ         CLAGE          NINQ 
    0.9935786     1.8228026     2.0282456     0.9939706     1.0980477 
         CLNO       DEBTINC 
    0.9914871     1.1008042 
after building the binary logistic regression model,the interpretation says that without any variables the probability of deciding the defaulter/non-defaulter is 0.0034 ,when LOAN increases by an unit the probability of deciding the defaulter/non-defaulter is 0.99 and similarly for others.

PREDICT ON TRAIN DATA

library(caret)
pred<-predict(model_bin,train)
pred_train<-ifelse(pred>0.5,1,0)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1828    5
         1  149   29
                                          
               Accuracy : 0.9234          
                 95% CI : (0.9109, 0.9347)
    No Information Rate : 0.9831          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2524          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9246          
            Specificity : 0.8529          
         Pos Pred Value : 0.9973          
         Neg Pred Value : 0.1629          
             Prevalence : 0.9831          
         Detection Rate : 0.9090          
   Detection Prevalence : 0.9115          
      Balanced Accuracy : 0.8888          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
Loading required package: gplots

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.5800968
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here 1871 values are predicted correctly on train data with 93% accuracy,kappa being very less of only 35%,sensitivity,specificity and balanced accuracy are good and area under curve being 62%, it is quite satisfactory model.

NAIVE BAYES CLASSIFIER

MODEL BUILDING

library(naivebayes)
train$BAD<-as.factor(train$BAD)
class(train$BAD)
[1] "factor"
model_nb<-naive_bayes(BAD ~ ., data=train)
model_nb$prior

         0          1 
0.91148682 0.08851318 
The algorithm is built based on prior probabilities calculated on outcome variable using gaussian models and based on this posterior probabilities are estimated.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_nb,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1713  120
         1  112   66
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8699, 0.8983)
    No Information Rate : 0.9075          
    P-Value [Acc > NIR] : 0.9997          
                                          
                  Kappa : 0.2992          
 Mcnemar's Test P-Value : 0.6458          
                                          
            Sensitivity : 0.9386          
            Specificity : 0.3548          
         Pos Pred Value : 0.9345          
         Neg Pred Value : 0.3708          
             Prevalence : 0.9075          
         Detection Rate : 0.8518          
   Detection Prevalence : 0.9115          
      Balanced Accuracy : 0.6467          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.65266
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here it predicted 215 values wrongly with accuracy, sensitivity being good and other factors being not bad, this is a bad model.

CART DECISION TREE

MODEL BUILDING

library(rpart)
model_cart<-rpart(BAD ~ ., data=train)
model_cart$cptable
          CP nsplit rel error    xerror       xstd
1 0.20786517      0 1.0000000 1.0000000 0.07155915
2 0.03932584      1 0.7921348 0.8258427 0.06557761
3 0.02808989      2 0.7528090 0.8370787 0.06598678
4 0.02059925      3 0.7247191 0.7977528 0.06453908
5 0.01966292      6 0.6629213 0.7977528 0.06453908
6 0.01685393      8 0.6235955 0.8258427 0.06557761
7 0.01000000     10 0.5898876 0.8089888 0.06495721
model_cart$variable.importance
   DEBTINC      CLAGE     DELINQ       CLNO       LOAN      VALUE 
76.9271006 22.6624597 18.5424709 14.2202477 12.9673186 12.6965240 
     DEROG        JOB    MORTDUE       NINQ        YOJ 
11.5440371  3.3831320  2.0838004  1.9725113  0.6210879 

?? In this decision tree model,the error for complexity parameter of 0.24 is calculated and it goes so until 0.01 where less error is occured. The variable important to estimate the defaulter is shown above with DEBTINC being more important.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_cart,train,type="class")
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1812   21
         1   84   94
                                          
               Accuracy : 0.9478          
                 95% CI : (0.9371, 0.9571)
    No Information Rate : 0.9428          
    P-Value [Acc > NIR] : 0.1814          
                                          
                  Kappa : 0.6149          
 Mcnemar's Test P-Value : 1.443e-09       
                                          
            Sensitivity : 0.9557          
            Specificity : 0.8174          
         Pos Pred Value : 0.9885          
         Neg Pred Value : 0.5281          
             Prevalence : 0.9428          
         Detection Rate : 0.9010          
   Detection Prevalence : 0.9115          
      Balanced Accuracy : 0.8865          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7583166
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here only 101 values are predicted correctly out of total data in train with above parameters being accurate so the model is good.

SUPPORT VECTOR MACHINES

MODEL BUILDING

library(e1071)
model_svm<-train(BAD ~ ., data=train,method="svmLinear",preProcess=c("center","scale"),tuneLength=10)
model_svm
Support Vector Machines with Linear Kernel 

2011 samples
  12 predictor
   2 classes: '0', '1' 

Pre-processing: centered (16), scaled (16) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 2011, 2011, 2011, 2011, 2011, 2011, ... 
Resampling results:

  Accuracy   Kappa     
  0.9153093  0.08143762

Tuning parameter 'C' was held constant at a value of 1
Here, the linear svm model is built where the hyperpalnes separates classes of outcome variable and the best hyperplane is selected out of it which is far from the values.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_svm,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1833    0
         1  174    4
                                          
               Accuracy : 0.9135          
                 95% CI : (0.9003, 0.9254)
    No Information Rate : 0.998           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0402          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.91330         
            Specificity : 1.00000         
         Pos Pred Value : 1.00000         
         Neg Pred Value : 0.02247         
             Prevalence : 0.99801         
         Detection Rate : 0.91149         
   Detection Prevalence : 0.91149         
      Balanced Accuracy : 0.95665         
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.511236
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here all the parameters are good except inter-rater reliability value and area under curve,so the model is bit satisfactory.

RANDOM FOREST

MODEL BUILDING

library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':

    margin
model_rf<-randomForest(BAD ~ ., data=train,mtry=3,ntree=700)
model_rf

Call:
 randomForest(formula = BAD ~ ., data = train, mtry = 3, ntree = 700) 
               Type of random forest: classification
                     Number of trees: 700
No. of variables tried at each split: 3

        OOB estimate of  error rate: 5.82%
Confusion matrix:
     0  1 class.error
0 1828  5 0.002727769
1  112 66 0.629213483
varImpPlot(model_rf)

Here in random forest algorithm 700 unpruned trees are built where pruning is not required as it is building 700 trees and the error on test data(32% of train data) is only 5%. The variable importance plot is plotted where the variable with lower GINI index value is most preferable.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_rf,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1833    0
         1    0  178
                                     
               Accuracy : 1          
                 95% CI : (0.9982, 1)
    No Information Rate : 0.9115     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.9115     
         Detection Rate : 0.9115     
   Detection Prevalence : 0.9115     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here all the values are correctly predicted with 100% accuracy.

ADAPTIVE BOOSTING

MODEL BUILDING

library(ada)
model_ada<-ada(BAD ~ ., data=train,loss='exponential',type='discrete',iter=150)
model_ada
Call:
ada(BAD ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 150)

Loss: exponential Method: discrete   Iteration: 150 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 1833    0
         1   13  165

Train Error: 0.006 

Out-Of-Bag Error:  0.021  iteration= 146 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       143        143 
plot(model_ada)

  Under exponential adaptive boosting method, model with 100 iterations are built and OOB error(test error) is only 0.027 and from the plot we can see that the error is gradually decreasing with increase in iterations.
  

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1833    0
         1   13  165
                                         
               Accuracy : 0.9935         
                 95% CI : (0.989, 0.9966)
    No Information Rate : 0.918          
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9586         
 Mcnemar's Test P-Value : 0.0008741      
                                         
            Sensitivity : 0.9930         
            Specificity : 1.0000         
         Pos Pred Value : 1.0000         
         Neg Pred Value : 0.9270         
             Prevalence : 0.9180         
         Detection Rate : 0.9115         
   Detection Prevalence : 0.9115         
      Balanced Accuracy : 0.9965         
                                         
       'Positive' Class : 0              
                                         
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9634831
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

  Here,the model is built by making the weak learners as strong learners using a decision stump and except 21 values all are predicted correct with 99% accurate.
  Now lets compare the performance of all the above models on train data and decide the best model out of it.
  

SUMMARY

SUMMARY ON TRAIN DATA
MODEL ACCURACY KAPPA SENSITIVITY SPECIFICITY BALANCED_ACC AREA_UNDER_CURVE
BINARY_LOGISTIC_REGRESSION 93 35.5 93 84 88 62
NAIVE_BAYES 89 34.0 94 39 66 67
CART(DECISION_TREE) 95 61.0 95 85 90 75
SVM 93 37.0 93 95 94 62
RANDOM_FOREST 100 100.0 100 100 100 100
ADAPTIVE_BOOSTING 99 97.0 99 100 99 97
  By observing the summary we can infer that CART decision tree model,random forest model and adaptive boosting model performed very well,where as other models have comparatively low performance.
  But here in CART decision tree model inter rater reliability is bit low(generally we prefer kappa more than 70%) so lets predict on test data using adaptive boosting model and check the bias and variance.

USING ADAPTIVE BOOSTING

PREDICTING ON TEST DATA

pred_test<-predict(model_ada,test)
confusionMatrix(as.factor(test$BAD),as.factor(pred_test))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1219    3
         1   69   49
                                          
               Accuracy : 0.9463          
                 95% CI : (0.9328, 0.9577)
    No Information Rate : 0.9612          
    P-Value [Acc > NIR] : 0.9971          
                                          
                  Kappa : 0.5524          
 Mcnemar's Test P-Value : 1.855e-14       
                                          
            Sensitivity : 0.9464          
            Specificity : 0.9423          
         Pos Pred Value : 0.9975          
         Neg Pred Value : 0.4153          
             Prevalence : 0.9612          
         Detection Rate : 0.9097          
   Detection Prevalence : 0.9119          
      Balanced Accuracy : 0.9444          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test),as.numeric(test$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7063996
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The above are the values predicted on test data using adaptive boosting method with good accuracy.Now lets shuffle the data and perform the research analysis on different resampled data.

SHUFFLE THE DATA

set.seed(123)

RESAMPLE THE DATA

library(caret)
split<-createDataPartition(hmeq1$BAD,p=0.55,list = FALSE)
train<-hmeq1[split,]
test<-hmeq1[-split,]
dim(train)
[1] 1844   13
dim(test)
[1] 1507   13

USING ADAPTIVE BOOSTING

MODEL BUILDING

library(ada)
model_ada<-ada(BAD ~ ., data=train,loss='exponential',type='discrete',iter=150)
model_ada
Call:
ada(BAD ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 150)

Loss: exponential Method: discrete   Iteration: 150 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 1667    0
         1   12  165

Train Error: 0.007 

Out-Of-Bag Error:  0.026  iteration= 147 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       140        140 
plot(model_ada)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1667    0
         1   12  165
                                          
               Accuracy : 0.9935          
                 95% CI : (0.9887, 0.9966)
    No Information Rate : 0.9105          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9613          
 Mcnemar's Test P-Value : 0.001496        
                                          
            Sensitivity : 0.9929          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9322          
             Prevalence : 0.9105          
         Detection Rate : 0.9040          
   Detection Prevalence : 0.9040          
      Balanced Accuracy : 0.9964          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9661017
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICTING ON TEST DATA

pred_test<-predict(model_ada,test)
confusionMatrix(as.factor(test$BAD),as.factor(pred_test))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1384    4
         1   75   44
                                          
               Accuracy : 0.9476          
                 95% CI : (0.9351, 0.9583)
    No Information Rate : 0.9681          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.5045          
 Mcnemar's Test P-Value : 3.391e-15       
                                          
            Sensitivity : 0.9486          
            Specificity : 0.9167          
         Pos Pred Value : 0.9971          
         Neg Pred Value : 0.3697          
             Prevalence : 0.9681          
         Detection Rate : 0.9184          
   Detection Prevalence : 0.9210          
      Balanced Accuracy : 0.9326          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test),as.numeric(test$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.683433
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA

set.seed(1234)

RESAMPLE THE DATA

library(caret)
split<-createDataPartition(hmeq1$BAD,p=0.7,list = FALSE)
train<-hmeq1[split,]
test<-hmeq1[-split,]
dim(train)
[1] 2346   13
dim(test)
[1] 1005   13

MODEL BUILDING

library(ada)
model_ada<-ada(BAD ~ ., data=train,loss='exponential',type='discrete',iter=150)
model_ada
Call:
ada(BAD ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 150)

Loss: exponential Method: discrete   Iteration: 150 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 2147    0
         1    8  191

Train Error: 0.003 

Out-Of-Bag Error:  0.02  iteration= 149 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       148        148 
plot(model_ada)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2147    0
         1    8  191
                                          
               Accuracy : 0.9966          
                 95% CI : (0.9933, 0.9985)
    No Information Rate : 0.9186          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.9776          
 Mcnemar's Test P-Value : 0.01333         
                                          
            Sensitivity : 0.9963          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9598          
             Prevalence : 0.9186          
         Detection Rate : 0.9152          
   Detection Prevalence : 0.9152          
      Balanced Accuracy : 0.9981          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9798995
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST DATA

pred_test<-predict(model_ada,test)
confusionMatrix(as.factor(test$BAD),as.factor(pred_test))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 907   1
         1  63  34
                                          
               Accuracy : 0.9363          
                 95% CI : (0.9194, 0.9506)
    No Information Rate : 0.9652          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.489           
 Mcnemar's Test P-Value : 2.44e-14        
                                          
            Sensitivity : 0.9351          
            Specificity : 0.9714          
         Pos Pred Value : 0.9989          
         Neg Pred Value : 0.3505          
             Prevalence : 0.9652          
         Detection Rate : 0.9025          
   Detection Prevalence : 0.9035          
      Balanced Accuracy : 0.9532          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_test),as.numeric(test$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6747071
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA

set.seed(12345)

RESAMPLE THE DATA

library(caret)
split<-createDataPartition(hmeq1$BAD,p=0.8,list = FALSE)
train<-hmeq1[split,]
test<-hmeq1[-split,]
dim(train)
[1] 2681   13
dim(test)
[1] 670  13

MODEL BUILDING

library(ada)
model_ada<-ada(BAD ~ ., data=train,loss='exponential',type='discrete',iter=150)
model_ada
Call:
ada(BAD ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 150)

Loss: exponential Method: discrete   Iteration: 150 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 2444    0
         1   10  227

Train Error: 0.004 

Out-Of-Bag Error:  0.021  iteration= 150 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       135        135 
plot(model_ada)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2444    0
         1   10  227
                                          
               Accuracy : 0.9963          
                 95% CI : (0.9932, 0.9982)
    No Information Rate : 0.9153          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9764          
 Mcnemar's Test P-Value : 0.004427        
                                          
            Sensitivity : 0.9959          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9578          
             Prevalence : 0.9153          
         Detection Rate : 0.9116          
   Detection Prevalence : 0.9116          
      Balanced Accuracy : 0.9980          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.978903
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST DATA

pred_test<-predict(model_ada,test)
confusionMatrix(as.factor(test$BAD),as.factor(pred_test))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 611   0
         1  31  28
                                         
               Accuracy : 0.9537         
                 95% CI : (0.935, 0.9683)
    No Information Rate : 0.9582         
    P-Value [Acc > NIR] : 0.7555         
                                         
                  Kappa : 0.6223         
 Mcnemar's Test P-Value : 7.118e-08      
                                         
            Sensitivity : 0.9517         
            Specificity : 1.0000         
         Pos Pred Value : 1.0000         
         Neg Pred Value : 0.4746         
             Prevalence : 0.9582         
         Detection Rate : 0.9119         
   Detection Prevalence : 0.9119         
      Balanced Accuracy : 0.9759         
                                         
       'Positive' Class : 0              
                                         
library(ROCR)
pr<-prediction(as.numeric(pred_test),as.numeric(test$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7372881
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

FINAL SUMMARY

SUMMARY ON DIFFERENT TRAIN AND TEST DATA
PARAMETERS ACCURACY KAPPA SENSITIVITY SPECIFICITY BALANCED_ACCURACY AREA_UNDER_CURVE
TRAIN 99.0 97.0 99.0 100 99.0 97.0
TRAIN1 99.0 96.0 99.0 100 99.0 96.0
TRAIN2 99.0 97.0 99.0 100 99.0 97.9
TRAIN3 99.0 97.0 99.0 100 99.8 97.8
TEST 94.0 51.0 94.0 95 95.0 68.0
TEST1 94.0 50.4 94.0 91 93.0 68.0
TEST2 93.6 49.0 93.5 97 95.0 67.4
TEST3 95.0 62.0 95.0 100 97.0 73.7

CONCLUSION

From above summary we can see that the adaptive boosting model performed well on different combination of data obtained from original hmeq1 dataset, but in the case of kappa and area under curve there is a slight difference for train and test data,so overall it has low bias-variance tradeoff and therefore we can rectify the customer deliquency (whether customer repays the home loan/not) who is requesting for the equity home loans. 

Hence, we can trust this model and is ready to be deployed.

FINAL DATA WITH PREDICTED VALUES

library(caret)
predicted<-predict(model_ada,hmeq1)
library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:randomForest':

    combine
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
hmeq1<-mutate(hmeq1,expected_BAD=predicted)
library(dplyr)
hmeq1<-rename(hmeq1,observed_BAD=BAD)
FINAL DATASET WITH OBSERVED AND EXPECTED DEFAULTERS (displaying only first 20 observations)
observed_BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC expected_BAD
1 1700 30548 40320 HomeImp Other 9 0 0 101.46600 1 8 37.11361 1
1 1800 28502 43034 HomeImp Other 11 0 0 88.76603 0 8 36.88489 1
0 2300 102370 120953 HomeImp Office 2 0 0 90.99253 0 13 31.58850 0
1 2400 34863 47471 HomeImp Mgr 12 0 0 70.49108 1 21 38.26360 1
0 2400 98449 117195 HomeImp Office 4 0 0 93.81177 0 13 29.68183 0
0 2900 103949 112505 HomeImp Office 1 0 0 96.10233 0 13 30.05114 0
0 2900 104373 120702 HomeImp Office 2 0 0 101.54030 0 13 29.91586 0
1 2900 7750 67996 HomeImp Other 16 3 0 122.20466 2 8 36.21135 1
1 2900 61962 70915 DebtCon Mgr 2 0 0 282.80166 3 37 49.20640 1
0 3000 104570 121729 HomeImp Office 2 0 0 85.88437 0 14 32.05978 0
0 3200 74864 87266 HomeImp ProfExe 7 0 0 250.63127 0 12 42.91000 0
1 3300 130518 164317 DebtCon Other 9 0 6 192.28915 0 33 35.73056 1
0 3600 100693 114743 HomeImp Office 6 0 0 88.47045 0 14 29.39354 0
0 3600 52337 63989 HomeImp Office 20 0 0 204.27250 0 20 20.47092 0
1 3700 17857 21144 HomeImp Other 5 0 0 129.71732 1 9 26.63435 1
0 3800 51180 63459 HomeImp Office 20 0 0 203.75153 0 20 20.06704 0
1 3900 29896 45960 HomeImp Other 11 0 0 146.12324 0 14 24.47888 1
0 3900 102143 118742 HomeImp Office 2 0 0 85.27737 0 13 29.34392 0
0 4000 105164 112774 HomeImp Office 1 0 0 94.72487 0 13 29.39093 0
0 4000 54543 61777 HomeImp Office 21 0 0 205.58668 0 19 21.80656 0