PROBLEM STATEMENT

Can we rectify that the applicant who is requesting the home equity loans will pay the loan or will be a delinquent without paying the loan?

DATASET INFORMATION

This is the modified version of HMEQ reports dataset. The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral.

library(readxl)
hmeq1<-read_excel("D:/projects/hmeq.xlsx")
View(hmeq1)

HMEQ DATASET(displaying first 11 observations
BAD	LOAN	MORTDUE	VALUE	REASON	JOB	YOJ	DEROG	DELINQ	CLAGE	NINQ	CLNO	DEBTINC
1	1100	25860	39025	HomeImp	Other	10.5	0	0	94.36667	1	9	NA
1	1300	70053	68400	HomeImp	Other	7.0	0	2	121.83333	0	14	NA
1	1500	13500	16700	HomeImp	Other	4.0	0	0	149.46667	1	10	NA
1	1500	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
0	1700	97800	112000	HomeImp	Office	3.0	0	0	93.33333	0	14	NA
1	1700	30548	40320	HomeImp	Other	9.0	0	0	101.46600	1	8	37.11361
1	1800	48649	57037	HomeImp	Other	5.0	3	2	77.10000	1	17	NA
1	1800	28502	43034	HomeImp	Other	11.0	0	0	88.76603	0	8	36.88489
1	2000	32700	46740	HomeImp	Other	3.0	0	2	216.93333	1	12	NA
1	2000	NA	62250	HomeImp	Sales	16.0	0	0	115.80000	0	13	NA
1	2000	22608	NA	NA	NA	18.0	NA	NA	NA	NA	NA	NA

ATTRIBUTES INFORMATION

  BAD: '1' refers to an applicant who defaulted the loan(seriously delinquent) and '0' refers to an applicant who paid the loan(non delinquent).
  LOAN: Amount of the loan request
  ??? MORTDUE: Amount due on existing mortgage
  VALUE: Value of current property
  REASON: DebtCon = debt consolidationn; HomeImp = home improvement
  JOB: Occupational categories(job categories)
  YOJ: Years at present job
  DEROG: Number of major derogatory reports
  DELINQ: Number of delinquent credit lines
  CLAGE: Age of oldest credit line in months
  NINQ: Number of recent credit inquiries
  CLNO: Number of credit lines
  DEBTINC: Debt-to-income ratio
Here, the variable 'BAD' will be an outcome variable with binary values and remaining all variables will be predictors.

DATA PRE-PROCESSING

Convert The Classes

hmeq1$REASON<-as.factor(hmeq1$REASON)
hmeq1$JOB<-as.factor(hmeq1$JOB)
str(hmeq1)

Classes 'tbl_df', 'tbl' and 'data.frame':   5960 obs. of  13 variables:
 $ BAD    : num  1 1 1 1 0 1 1 1 1 1 ...
 $ LOAN   : num  1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
 $ MORTDUE: num  25860 70053 13500 NA 97800 ...
 $ VALUE  : num  39025 68400 16700 NA 112000 ...
 $ REASON : Factor w/ 2 levels "DebtCon","HomeImp": 2 2 2 NA 2 2 2 2 2 2 ...
 $ JOB    : Factor w/ 6 levels "Mgr","Office",..: 3 3 3 NA 2 3 3 3 3 5 ...
 $ YOJ    : num  10.5 7 4 NA 3 9 5 11 3 16 ...
 $ DEROG  : num  0 0 0 NA 0 0 3 0 0 0 ...
 $ DELINQ : num  0 2 0 NA 0 0 2 0 2 0 ...
 $ CLAGE  : num  94.4 121.8 149.5 NA 93.3 ...
 $ NINQ   : num  1 0 1 NA 0 1 1 0 1 0 ...
 $ CLNO   : num  9 14 10 NA 14 8 17 8 12 13 ...
 $ DEBTINC: num  NA NA NA NA NA ...

Here, the variables 'REASON' and 'JOB' are numerical which are to be converted to factor so as to procees the research.

Check For Missing Values

summary(hmeq1)

      BAD              LOAN          MORTDUE           VALUE       
 Min.   :0.0000   Min.   : 1100   Min.   :  2063   Min.   :  8000  
 1st Qu.:0.0000   1st Qu.:11100   1st Qu.: 46276   1st Qu.: 66076  
 Median :0.0000   Median :16300   Median : 65019   Median : 89236  
 Mean   :0.1995   Mean   :18608   Mean   : 73761   Mean   :101776  
 3rd Qu.:0.0000   3rd Qu.:23300   3rd Qu.: 91488   3rd Qu.:119824  
 Max.   :1.0000   Max.   :89900   Max.   :399550   Max.   :855909  
                                  NA's   :518      NA's   :112     
     REASON          JOB            YOJ             DEROG        
 DebtCon:3928   Mgr    : 767   Min.   : 0.000   Min.   : 0.0000  
 HomeImp:1780   Office : 948   1st Qu.: 3.000   1st Qu.: 0.0000  
 NA's   : 252   Other  :2388   Median : 7.000   Median : 0.0000  
                ProfExe:1276   Mean   : 8.922   Mean   : 0.2546  
                Sales  : 109   3rd Qu.:13.000   3rd Qu.: 0.0000  
                Self   : 193   Max.   :41.000   Max.   :10.0000  
                NA's   : 279   NA's   :515      NA's   :708      
     DELINQ            CLAGE             NINQ             CLNO     
 Min.   : 0.0000   Min.   :   0.0   Min.   : 0.000   Min.   : 0.0  
 1st Qu.: 0.0000   1st Qu.: 115.1   1st Qu.: 0.000   1st Qu.:15.0  
 Median : 0.0000   Median : 173.5   Median : 1.000   Median :20.0  
 Mean   : 0.4494   Mean   : 179.8   Mean   : 1.186   Mean   :21.3  
 3rd Qu.: 0.0000   3rd Qu.: 231.6   3rd Qu.: 2.000   3rd Qu.:26.0  
 Max.   :15.0000   Max.   :1168.2   Max.   :17.000   Max.   :71.0  
 NA's   :580       NA's   :308      NA's   :510      NA's   :222   
    DEBTINC        
 Min.   :  0.5245  
 1st Qu.: 29.1400  
 Median : 34.8183  
 Mean   : 33.7799  
 3rd Qu.: 39.0031  
 Max.   :203.3121  
 NA's   :1267

table(is.na(hmeq1))


FALSE  TRUE 
72209  5271

dim(hmeq1)

[1] 5960   13

hmeq1<-na.omit(hmeq1)
dim(hmeq1)

[1] 3364   13

From summary we can see that missing values are available in some variables and these missing values are in small quantity in these variables individually ,so shouldn't be removed in variables individually but when seen in total dataset 2596 missing values are available out of 5960 observations which account to nearly 44% and can be removed from the dataset.

Check For Outliers

boxplot(hmeq1)

dim(hmeq1)

[1] 3364   13

hmeq1<-hmeq1[hmeq1$MORTDUE<4e+05,]
hmeq1<-hmeq1[hmeq1$VALUE<3e+05,]
boxplot(hmeq1)

dim(hmeq1)

[1] 3351   13

View(hmeq1)

Here in boxplot we can see that outliers are available but there are certain extreme outliers(13 extreme outliers) which are far from the outliers and these are omitted.

Therefore, after data preprocessing the dataset has 3351 observations.

RESAMPLING THE PRE-PROCESSED DATA

library(caret)

Loading required package: lattice

Loading required package: ggplot2

split<-createDataPartition(hmeq1$BAD,p=0.6,list = FALSE)
train<-hmeq1[split,]
test<-hmeq1[-split,]
dim(train)

[1] 2011   13

dim(test)

[1] 1340   13

Here, the hmeq1 dataset is splitted into train data with 2011 observations(60% of hmeq1) and test data with 1340 observations(40% of hmeq1).

Now lets build some models on train data and select the best model by comparing these models and later predict on test data.

BINARY LOGISTIC REGRESSION

MODEL BUILDING

model_bin<-glm(BAD ~ ., data=train,family = binomial(link = "logit"))
exp(model_bin$coefficients)

  (Intercept)          LOAN       MORTDUE         VALUE REASONHomeImp 
    0.0144330     0.9999823     0.9999942     1.0000017     0.9653005 
    JOBOffice      JOBOther    JOBProfExe      JOBSales       JOBSelf 
    0.5024415     0.7350595     0.6821242     2.0913331     0.9547095 
          YOJ         DEROG        DELINQ         CLAGE          NINQ 
    0.9935786     1.8228026     2.0282456     0.9939706     1.0980477 
         CLNO       DEBTINC 
    0.9914871     1.1008042

after building the binary logistic regression model,the interpretation says that without any variables the probability of deciding the defaulter/non-defaulter is 0.0034 ,when LOAN increases by an unit the probability of deciding the defaulter/non-defaulter is 0.99 and similarly for others.

PREDICT ON TRAIN DATA

library(caret)
pred<-predict(model_bin,train)
pred_train<-ifelse(pred>0.5,1,0)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1828    5
         1  149   29
                                          
               Accuracy : 0.9234          
                 95% CI : (0.9109, 0.9347)
    No Information Rate : 0.9831          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2524          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9246          
            Specificity : 0.8529          
         Pos Pred Value : 0.9973          
         Neg Pred Value : 0.1629          
             Prevalence : 0.9831          
         Detection Rate : 0.9090          
   Detection Prevalence : 0.9115          
      Balanced Accuracy : 0.8888          
                                          
       'Positive' Class : 0

library(ROCR)

Loading required package: gplots


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.5800968

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here 1871 values are predicted correctly on train data with 93% accuracy,kappa being very less of only 35%,sensitivity,specificity and balanced accuracy are good and area under curve being 62%, it is quite satisfactory model.

NAIVE BAYES CLASSIFIER

MODEL BUILDING

library(naivebayes)
train$BAD<-as.factor(train$BAD)
class(train$BAD)

[1] "factor"

model_nb<-naive_bayes(BAD ~ ., data=train)
model_nb$prior


         0          1 
0.91148682 0.08851318

The algorithm is built based on prior probabilities calculated on outcome variable using gaussian models and based on this posterior probabilities are estimated.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_nb,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1713  120
         1  112   66
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8699, 0.8983)
    No Information Rate : 0.9075          
    P-Value [Acc > NIR] : 0.9997          
                                          
                  Kappa : 0.2992          
 Mcnemar's Test P-Value : 0.6458          
                                          
            Sensitivity : 0.9386          
            Specificity : 0.3548          
         Pos Pred Value : 0.9345          
         Neg Pred Value : 0.3708          
             Prevalence : 0.9075          
         Detection Rate : 0.8518          
   Detection Prevalence : 0.9115          
      Balanced Accuracy : 0.6467          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.65266

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here it predicted 215 values wrongly with accuracy, sensitivity being good and other factors being not bad, this is a bad model.

CART DECISION TREE

MODEL BUILDING

library(rpart)
model_cart<-rpart(BAD ~ ., data=train)
model_cart$cptable

          CP nsplit rel error    xerror       xstd
1 0.20786517      0 1.0000000 1.0000000 0.07155915
2 0.03932584      1 0.7921348 0.8258427 0.06557761
3 0.02808989      2 0.7528090 0.8370787 0.06598678
4 0.02059925      3 0.7247191 0.7977528 0.06453908
5 0.01966292      6 0.6629213 0.7977528 0.06453908
6 0.01685393      8 0.6235955 0.8258427 0.06557761
7 0.01000000     10 0.5898876 0.8089888 0.06495721

model_cart$variable.importance

   DEBTINC      CLAGE     DELINQ       CLNO       LOAN      VALUE 
76.9271006 22.6624597 18.5424709 14.2202477 12.9673186 12.6965240 
     DEROG        JOB    MORTDUE       NINQ        YOJ 
11.5440371  3.3831320  2.0838004  1.9725113  0.6210879

?? In this decision tree model,the error for complexity parameter of 0.24 is calculated and it goes so until 0.01 where less error is occured. The variable important to estimate the defaulter is shown above with DEBTINC being more important.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_cart,train,type="class")
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1812   21
         1   84   94
                                          
               Accuracy : 0.9478          
                 95% CI : (0.9371, 0.9571)
    No Information Rate : 0.9428          
    P-Value [Acc > NIR] : 0.1814          
                                          
                  Kappa : 0.6149          
 Mcnemar's Test P-Value : 1.443e-09       
                                          
            Sensitivity : 0.9557          
            Specificity : 0.8174          
         Pos Pred Value : 0.9885          
         Neg Pred Value : 0.5281          
             Prevalence : 0.9428          
         Detection Rate : 0.9010          
   Detection Prevalence : 0.9115          
      Balanced Accuracy : 0.8865          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7583166

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here only 101 values are predicted correctly out of total data in train with above parameters being accurate so the model is good.

SUPPORT VECTOR MACHINES

MODEL BUILDING

library(e1071)
model_svm<-train(BAD ~ ., data=train,method="svmLinear",preProcess=c("center","scale"),tuneLength=10)
model_svm

Support Vector Machines with Linear Kernel 

2011 samples
  12 predictor
   2 classes: '0', '1' 

Pre-processing: centered (16), scaled (16) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 2011, 2011, 2011, 2011, 2011, 2011, ... 
Resampling results:

  Accuracy   Kappa     
  0.9153093  0.08143762

Tuning parameter 'C' was held constant at a value of 1

Here, the linear svm model is built where the hyperpalnes separates classes of outcome variable and the best hyperplane is selected out of it which is far from the values.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_svm,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1833    0
         1  174    4
                                          
               Accuracy : 0.9135          
                 95% CI : (0.9003, 0.9254)
    No Information Rate : 0.998           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0402          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.91330         
            Specificity : 1.00000         
         Pos Pred Value : 1.00000         
         Neg Pred Value : 0.02247         
             Prevalence : 0.99801         
         Detection Rate : 0.91149         
   Detection Prevalence : 0.91149         
      Balanced Accuracy : 0.95665         
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.511236

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here all the parameters are good except inter-rater reliability value and area under curve,so the model is bit satisfactory.

RANDOM FOREST

MODEL BUILDING

library(randomForest)

randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

model_rf<-randomForest(BAD ~ ., data=train,mtry=3,ntree=700)
model_rf


Call:
 randomForest(formula = BAD ~ ., data = train, mtry = 3, ntree = 700) 
               Type of random forest: classification
                     Number of trees: 700
No. of variables tried at each split: 3

        OOB estimate of  error rate: 5.82%
Confusion matrix:
     0  1 class.error
0 1828  5 0.002727769
1  112 66 0.629213483

varImpPlot(model_rf)

Here in random forest algorithm 700 unpruned trees are built where pruning is not required as it is building 700 trees and the error on test data(32% of train data) is only 5%. The variable importance plot is plotted where the variable with lower GINI index value is most preferable.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_rf,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1833    0
         1    0  178
                                     
               Accuracy : 1          
                 95% CI : (0.9982, 1)
    No Information Rate : 0.9115     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.9115     
         Detection Rate : 0.9115     
   Detection Prevalence : 0.9115     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 1

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here all the values are correctly predicted with 100% accuracy.

ADAPTIVE BOOSTING

MODEL BUILDING

library(ada)
model_ada<-ada(BAD ~ ., data=train,loss='exponential',type='discrete',iter=150)
model_ada

Call:
ada(BAD ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 150)

Loss: exponential Method: discrete   Iteration: 150 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 1833    0
         1   13  165

Train Error: 0.006 

Out-Of-Bag Error:  0.021  iteration= 146 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       143        143

plot(model_ada)

  Under exponential adaptive boosting method, model with 100 iterations are built and OOB error(test error) is only 0.027 and from the plot we can see that the error is gradually decreasing with increase in iterations.

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1833    0
         1   13  165
                                         
               Accuracy : 0.9935         
                 95% CI : (0.989, 0.9966)
    No Information Rate : 0.918          
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9586         
 Mcnemar's Test P-Value : 0.0008741      
                                         
            Sensitivity : 0.9930         
            Specificity : 1.0000         
         Pos Pred Value : 1.0000         
         Neg Pred Value : 0.9270         
             Prevalence : 0.9180         
         Detection Rate : 0.9115         
   Detection Prevalence : 0.9115         
      Balanced Accuracy : 0.9965         
                                         
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.9634831

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

  Here,the model is built by making the weak learners as strong learners using a decision stump and except 21 values all are predicted correct with 99% accurate.
  Now lets compare the performance of all the above models on train data and decide the best model out of it.

SUMMARY

SUMMARY ON TRAIN DATA
MODEL	ACCURACY	KAPPA	SENSITIVITY	SPECIFICITY	BALANCED_ACC	AREA_UNDER_CURVE
BINARY_LOGISTIC_REGRESSION	93	35.5	93	84	88	62
NAIVE_BAYES	89	34.0	94	39	66	67
CART(DECISION_TREE)	95	61.0	95	85	90	75
SVM	93	37.0	93	95	94	62
RANDOM_FOREST	100	100.0	100	100	100	100
ADAPTIVE_BOOSTING	99	97.0	99	100	99	97

  By observing the summary we can infer that CART decision tree model,random forest model and adaptive boosting model performed very well,where as other models have comparatively low performance.
  But here in CART decision tree model inter rater reliability is bit low(generally we prefer kappa more than 70%) so lets predict on test data using adaptive boosting model and check the bias and variance.

USING ADAPTIVE BOOSTING

PREDICTING ON TEST DATA

pred_test<-predict(model_ada,test)
confusionMatrix(as.factor(test$BAD),as.factor(pred_test))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1219    3
         1   69   49
                                          
               Accuracy : 0.9463          
                 95% CI : (0.9328, 0.9577)
    No Information Rate : 0.9612          
    P-Value [Acc > NIR] : 0.9971          
                                          
                  Kappa : 0.5524          
 Mcnemar's Test P-Value : 1.855e-14       
                                          
            Sensitivity : 0.9464          
            Specificity : 0.9423          
         Pos Pred Value : 0.9975          
         Neg Pred Value : 0.4153          
             Prevalence : 0.9612          
         Detection Rate : 0.9097          
   Detection Prevalence : 0.9119          
      Balanced Accuracy : 0.9444          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test),as.numeric(test$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7063996

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

The above are the values predicted on test data using adaptive boosting method with good accuracy.Now lets shuffle the data and perform the research analysis on different resampled data.

SHUFFLE THE DATA

set.seed(123)

RESAMPLE THE DATA

library(caret)
split<-createDataPartition(hmeq1$BAD,p=0.55,list = FALSE)
train<-hmeq1[split,]
test<-hmeq1[-split,]
dim(train)

[1] 1844   13

dim(test)

[1] 1507   13

USING ADAPTIVE BOOSTING

MODEL BUILDING

library(ada)
model_ada<-ada(BAD ~ ., data=train,loss='exponential',type='discrete',iter=150)
model_ada

Call:
ada(BAD ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 150)

Loss: exponential Method: discrete   Iteration: 150 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 1667    0
         1   12  165

Train Error: 0.007 

Out-Of-Bag Error:  0.026  iteration= 147 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       140        140

plot(model_ada)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1667    0
         1   12  165
                                          
               Accuracy : 0.9935          
                 95% CI : (0.9887, 0.9966)
    No Information Rate : 0.9105          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9613          
 Mcnemar's Test P-Value : 0.001496        
                                          
            Sensitivity : 0.9929          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9322          
             Prevalence : 0.9105          
         Detection Rate : 0.9040          
   Detection Prevalence : 0.9040          
      Balanced Accuracy : 0.9964          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.9661017

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICTING ON TEST DATA

pred_test<-predict(model_ada,test)
confusionMatrix(as.factor(test$BAD),as.factor(pred_test))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1384    4
         1   75   44
                                          
               Accuracy : 0.9476          
                 95% CI : (0.9351, 0.9583)
    No Information Rate : 0.9681          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.5045          
 Mcnemar's Test P-Value : 3.391e-15       
                                          
            Sensitivity : 0.9486          
            Specificity : 0.9167          
         Pos Pred Value : 0.9971          
         Neg Pred Value : 0.3697          
             Prevalence : 0.9681          
         Detection Rate : 0.9184          
   Detection Prevalence : 0.9210          
      Balanced Accuracy : 0.9326          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test),as.numeric(test$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.683433

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA

set.seed(1234)

RESAMPLE THE DATA

library(caret)
split<-createDataPartition(hmeq1$BAD,p=0.7,list = FALSE)
train<-hmeq1[split,]
test<-hmeq1[-split,]
dim(train)

[1] 2346   13

dim(test)

[1] 1005   13

MODEL BUILDING

library(ada)
model_ada<-ada(BAD ~ ., data=train,loss='exponential',type='discrete',iter=150)
model_ada

Call:
ada(BAD ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 150)

Loss: exponential Method: discrete   Iteration: 150 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 2147    0
         1    8  191

Train Error: 0.003 

Out-Of-Bag Error:  0.02  iteration= 149 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       148        148

plot(model_ada)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2147    0
         1    8  191
                                          
               Accuracy : 0.9966          
                 95% CI : (0.9933, 0.9985)
    No Information Rate : 0.9186          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.9776          
 Mcnemar's Test P-Value : 0.01333         
                                          
            Sensitivity : 0.9963          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9598          
             Prevalence : 0.9186          
         Detection Rate : 0.9152          
   Detection Prevalence : 0.9152          
      Balanced Accuracy : 0.9981          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.9798995

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST DATA

pred_test<-predict(model_ada,test)
confusionMatrix(as.factor(test$BAD),as.factor(pred_test))

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 907   1
         1  63  34
                                          
               Accuracy : 0.9363          
                 95% CI : (0.9194, 0.9506)
    No Information Rate : 0.9652          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.489           
 Mcnemar's Test P-Value : 2.44e-14        
                                          
            Sensitivity : 0.9351          
            Specificity : 0.9714          
         Pos Pred Value : 0.9989          
         Neg Pred Value : 0.3505          
             Prevalence : 0.9652          
         Detection Rate : 0.9025          
   Detection Prevalence : 0.9035          
      Balanced Accuracy : 0.9532          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test),as.numeric(test$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.6747071

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

RESHUFFLE THE DATA

set.seed(12345)

RESAMPLE THE DATA

library(caret)
split<-createDataPartition(hmeq1$BAD,p=0.8,list = FALSE)
train<-hmeq1[split,]
test<-hmeq1[-split,]
dim(train)

[1] 2681   13

dim(test)

[1] 670  13

MODEL BUILDING

library(ada)
model_ada<-ada(BAD ~ ., data=train,loss='exponential',type='discrete',iter=150)
model_ada

Call:
ada(BAD ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 150)

Loss: exponential Method: discrete   Iteration: 150 

Final Confusion Matrix for Data:
          Final Prediction
True value    0    1
         0 2444    0
         1   10  227

Train Error: 0.004 

Out-Of-Bag Error:  0.021  iteration= 150 

Additional Estimates of number of iterations:

train.err1 train.kap1 
       135        135

plot(model_ada)

PREDICT ON TRAIN DATA

library(caret)
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$BAD),as.factor(pred_train))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2444    0
         1   10  227
                                          
               Accuracy : 0.9963          
                 95% CI : (0.9932, 0.9982)
    No Information Rate : 0.9153          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9764          
 Mcnemar's Test P-Value : 0.004427        
                                          
            Sensitivity : 0.9959          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9578          
             Prevalence : 0.9153          
         Detection Rate : 0.9116          
   Detection Prevalence : 0.9116          
      Balanced Accuracy : 0.9980          
                                          
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.978903

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST DATA

pred_test<-predict(model_ada,test)
confusionMatrix(as.factor(test$BAD),as.factor(pred_test))

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 611   0
         1  31  28
                                         
               Accuracy : 0.9537         
                 95% CI : (0.935, 0.9683)
    No Information Rate : 0.9582         
    P-Value [Acc > NIR] : 0.7555         
                                         
                  Kappa : 0.6223         
 Mcnemar's Test P-Value : 7.118e-08      
                                         
            Sensitivity : 0.9517         
            Specificity : 1.0000         
         Pos Pred Value : 1.0000         
         Neg Pred Value : 0.4746         
             Prevalence : 0.9582         
         Detection Rate : 0.9119         
   Detection Prevalence : 0.9119         
      Balanced Accuracy : 0.9759         
                                         
       'Positive' Class : 0

library(ROCR)
pr<-prediction(as.numeric(pred_test),as.numeric(test$BAD))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc

[[1]]
[1] 0.7372881

prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

FINAL SUMMARY

SUMMARY ON DIFFERENT TRAIN AND TEST DATA
PARAMETERS	ACCURACY	KAPPA	SENSITIVITY	SPECIFICITY	BALANCED_ACCURACY	AREA_UNDER_CURVE
TRAIN	99.0	97.0	99.0	100	99.0	97.0
TRAIN1	99.0	96.0	99.0	100	99.0	96.0
TRAIN2	99.0	97.0	99.0	100	99.0	97.9
TRAIN3	99.0	97.0	99.0	100	99.8	97.8
TEST	94.0	51.0	94.0	95	95.0	68.0
TEST1	94.0	50.4	94.0	91	93.0	68.0
TEST2	93.6	49.0	93.5	97	95.0	67.4
TEST3	95.0	62.0	95.0	100	97.0	73.7

CONCLUSION

From above summary we can see that the adaptive boosting model performed well on different combination of data obtained from original hmeq1 dataset, but in the case of kappa and area under curve there is a slight difference for train and test data,so overall it has low bias-variance tradeoff and therefore we can rectify the customer deliquency (whether customer repays the home loan/not) who is requesting for the equity home loans.

Hence, we can trust this model and is ready to be deployed.

FINAL DATA WITH PREDICTED VALUES

library(caret)
predicted<-predict(model_ada,hmeq1)
library(dplyr)


Attaching package: 'dplyr'

The following object is masked from 'package:randomForest':

    combine

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

hmeq1<-mutate(hmeq1,expected_BAD=predicted)
library(dplyr)
hmeq1<-rename(hmeq1,observed_BAD=BAD)

FINAL DATASET WITH OBSERVED AND EXPECTED DEFAULTERS (displaying only first 20 observations)
observed_BAD	LOAN	MORTDUE	VALUE	REASON	JOB	YOJ	DEROG	DELINQ	CLAGE	NINQ	CLNO	DEBTINC	expected_BAD
1	1700	30548	40320	HomeImp	Other	9	0	0	101.46600	1	8	37.11361	1
1	1800	28502	43034	HomeImp	Other	11	0	0	88.76603	0	8	36.88489	1
0	2300	102370	120953	HomeImp	Office	2	0	0	90.99253	0	13	31.58850	0
1	2400	34863	47471	HomeImp	Mgr	12	0	0	70.49108	1	21	38.26360	1
0	2400	98449	117195	HomeImp	Office	4	0	0	93.81177	0	13	29.68183	0
0	2900	103949	112505	HomeImp	Office	1	0	0	96.10233	0	13	30.05114	0
0	2900	104373	120702	HomeImp	Office	2	0	0	101.54030	0	13	29.91586	0
1	2900	7750	67996	HomeImp	Other	16	3	0	122.20466	2	8	36.21135	1
1	2900	61962	70915	DebtCon	Mgr	2	0	0	282.80166	3	37	49.20640	1
0	3000	104570	121729	HomeImp	Office	2	0	0	85.88437	0	14	32.05978	0
0	3200	74864	87266	HomeImp	ProfExe	7	0	0	250.63127	0	12	42.91000	0
1	3300	130518	164317	DebtCon	Other	9	0	6	192.28915	0	33	35.73056	1
0	3600	100693	114743	HomeImp	Office	6	0	0	88.47045	0	14	29.39354	0
0	3600	52337	63989	HomeImp	Office	20	0	0	204.27250	0	20	20.47092	0
1	3700	17857	21144	HomeImp	Other	5	0	0	129.71732	1	9	26.63435	1
0	3800	51180	63459	HomeImp	Office	20	0	0	203.75153	0	20	20.06704	0
1	3900	29896	45960	HomeImp	Other	11	0	0	146.12324	0	14	24.47888	1
0	3900	102143	118742	HomeImp	Office	2	0	0	85.27737	0	13	29.34392	0
0	4000	105164	112774	HomeImp	Office	1	0	0	94.72487	0	13	29.39093	0
0	4000	54543	61777	HomeImp	Office	21	0	0	205.58668	0	19	21.80656	0

HMEQ

shekar

17 August 2018

PROBLEM STATEMENT

DATASET INFORMATION

ATTRIBUTES INFORMATION

DATA PRE-PROCESSING

Convert The Classes

Check For Missing Values

Check For Outliers

RESAMPLING THE PRE-PROCESSED DATA

BINARY LOGISTIC REGRESSION

MODEL BUILDING

PREDICT ON TRAIN DATA

NAIVE BAYES CLASSIFIER

MODEL BUILDING

PREDICT ON TRAIN DATA

CART DECISION TREE

MODEL BUILDING

PREDICT ON TRAIN DATA

SUPPORT VECTOR MACHINES

MODEL BUILDING

PREDICT ON TRAIN DATA

RANDOM FOREST

MODEL BUILDING

PREDICT ON TRAIN DATA

ADAPTIVE BOOSTING

MODEL BUILDING

PREDICT ON TRAIN DATA

SUMMARY

USING ADAPTIVE BOOSTING

PREDICTING ON TEST DATA

SHUFFLE THE DATA

RESAMPLE THE DATA

USING ADAPTIVE BOOSTING

MODEL BUILDING

PREDICT ON TRAIN DATA

PREDICTING ON TEST DATA

RESHUFFLE THE DATA

RESAMPLE THE DATA

MODEL BUILDING

PREDICT ON TRAIN DATA

PREDICT ON TEST DATA

RESHUFFLE THE DATA

RESAMPLE THE DATA

MODEL BUILDING

PREDICT ON TRAIN DATA

PREDICT ON TEST DATA

FINAL SUMMARY

CONCLUSION

FINAL DATA WITH PREDICTED VALUES