DEFINE PROBLEM STATEMENT

Can a customer be classified as defaulter/non-defaulter using lgd1 dataset? 

DATASET INFORMATION

    The dataset is provided by a European bank and has been slightly modified and anonymized. It includes 2,545 observations on loans and LGDs.
library(readxl)
lgd1<-read_excel("D:/projects/lgd/lgd.xlsx")
View(lgd1)
loss given default dataset (displaying only 10 observations)
LTV Recovery_rate lgd_time y_logistic lnrr Y_probit purpose1 event
0.2140781 0.6980155 0.3019845 -0.8378657 -0.3595139 -0.5187014 0 1
0.2140781 0.7800841 0.2199159 -1.2661566 -0.2483535 -0.7724773 0 1
0.2140781 0.7022869 0.2977131 -0.8582118 -0.3534132 -0.5309894 0 1
0.2140781 0.7539889 0.2460111 -1.1200010 -0.2823776 -0.6870961 0 1
0.2140781 0.8028126 0.1971874 -1.4039668 -0.2196340 -0.8517105 0 1
0.0273835 0.9921424 0.0078576 -4.8383828 -0.0078887 -2.4154625 0 1
0.0626695 0.9999900 0.0000100 -11.5129155 0.0000000 -4.2648908 0 0
0.2261305 0.9806656 0.0193344 -3.9263453 -0.0195238 -2.0676939 0 1
0.0073802 0.9920494 0.0079506 -4.8265298 -0.0079823 -2.4111769 0 1
0.0629311 0.9999900 0.0000100 -11.5129155 0.0000000 -4.2648908 0 0

ATTRIBUTES INFORMATION

LTV (loan to value ratio) is a risk assessment ratio which is the ratio of the amount of mortgage(loan) to the appraised(estimated) value of the property.Generally high LTV ratio's are seen at higher risks{it is in ratio}.

Recovery_rate is the value of a security when it emerges from default[1-loss that arises in the event of default(LGD)]{it is in ratio}.

lgd_time (loss given default) is the risk of exposure that is not expected to be recovered in the event of default{it is in ratio}.

y_logistic is the logistic transformation of the loss given default(LGD).

lnrr is the natural algorithm of a recovery rate.

Y_probit is the probit transformation of the loss given default(LGD)

purpose1 indicates the purpose of a loan ('1' for rent purpose and '0' for others).

event indicates the defaulter or non defaulter ('1' for default and '0' for non default).

Here, event will be an outcome variable as a customer is defaulter/non-defaulter is decided by the remaining 7 variables which are predectors.

DATA PREPROCESSING

Check for missing values

str(lgd1)
Classes 'tbl_df', 'tbl' and 'data.frame':   2545 obs. of  8 variables:
 $ LTV          : num  0.214 0.214 0.214 0.214 0.214 ...
 $ Recovery_rate: num  0.698 0.78 0.702 0.754 0.803 ...
 $ lgd_time     : num  0.302 0.22 0.298 0.246 0.197 ...
 $ y_logistic   : num  -0.838 -1.266 -0.858 -1.12 -1.404 ...
 $ lnrr         : num  -0.36 -0.248 -0.353 -0.282 -0.22 ...
 $ Y_probit     : num  -0.519 -0.772 -0.531 -0.687 -0.852 ...
 $ purpose1     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ event        : num  1 1 1 1 1 1 0 1 1 0 ...
sum(is.na(lgd1))
[1] 0
summary(lgd1)
      LTV           Recovery_rate        lgd_time         y_logistic      
 Min.   :0.001359   Min.   :0.00001   Min.   :0.00001   Min.   :-11.5129  
 1st Qu.:0.399180   1st Qu.:0.60215   1st Qu.:0.00001   1st Qu.:-11.5129  
 Median :0.659417   Median :0.96793   Median :0.03207   Median : -3.4074  
 Mean   :0.676556   Mean   :0.77187   Mean   :0.22813   Mean   : -3.9413  
 3rd Qu.:0.923548   3rd Qu.:0.99999   3rd Qu.:0.39785   3rd Qu.: -0.4144  
 Max.   :1.984065   Max.   :0.99999   Max.   :0.99999   Max.   : 11.5129  
      lnrr              Y_probit          purpose1           event       
 Min.   :-11.51292   Min.   :-4.2649   Min.   :0.00000   Min.   :0.0000  
 1st Qu.: -0.50726   1st Qu.:-4.2649   1st Qu.:0.00000   1st Qu.:0.0000  
 Median : -0.03259   Median :-1.8513   Median :0.00000   Median :1.0000  
 Mean   : -0.99665   Mean   :-1.6508   Mean   :0.07269   Mean   :0.7139  
 3rd Qu.:  0.00000   3rd Qu.:-0.2589   3rd Qu.:0.00000   3rd Qu.:1.0000  
 Max.   :  0.00000   Max.   : 4.2649   Max.   :1.00000   Max.   :1.0000  
Here, it clearly shows that no missing values are available.

Check for outliers

boxplot(lgd1)

dim(lgd1)
[1] 2545    8
From boxplot we can see that outliers are available for certain variables but for variable 'lnrr' and variable 'purpose1' we have few values which are far from other outliers so we can remove few extreme outliers in order to improve our performance as these values may lead to more variance and bias. 

Remove extreme outliers

lgd1<-lgd1[lgd1$lnrr>-6,]
dim(lgd1)
[1] 2396    8
lgd1<-lgd1[!lgd1$purpose1==1,]
dim(lgd1)
[1] 2234    8
boxplot(lgd1)

Here the dimension of the model is reduced by removing certain extreme outliers from variable 'lnrr' and variable 'purpose1',now by box plotting on this data we can see that no extreme outliers are available.

Resampling the pre-processed data

Now the data is splitted into train data with 1117 observations,test1 data into 559 observations and test2 data with 558 observations.
library(caret)
Loading required package: lattice
Loading required package: ggplot2
splitlgd1<-createDataPartition(lgd1$event,p=0.5,list = FALSE)
train<-lgd1[splitlgd1,]
test<-lgd1[-splitlgd1,]
dim(train)
[1] 1117    8
dim(test)
[1] 1117    8
splittest<-createDataPartition(test$event,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(test1)
[1] 559   8
dim(test2)
[1] 558   8
Now lets build the model by using different algorithmic models and the best model is choosen out of it based on different parameters and the outcomes are predicted using this best model.Note that here we are predicting only on train data in order to compare the models.

BINARY LOGISTIC REGRESSION

Model Building

train$event<-as.factor(train$event)
class(train$event)
[1] "factor"
model_bin<-glm(event ~ ., data=train,family = binomial(link = "logit"))
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Predict On Train Data

pred_train_bin<-ifelse(model_bin$fitted.values>0.5,1,0)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train_bin))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 357   0
         1   0 760
                                     
               Accuracy : 1          
                 95% CI : (0.9967, 1)
    No Information Rate : 0.6804     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.3196     
         Detection Rate : 0.3196     
   Detection Prevalence : 0.3196     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
Loading required package: gplots

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
pr<-prediction(as.numeric(pred_train_bin),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

After building the model using binary logistic regression we predict the outcomes on train data and from this we can infer that all 1117 values of train data are predicted correctly with 100% accuracy.

NAIVE BAYES

Model Building

metric<-"Accuracy"
library(naivebayes)
model_naive<-naive_bayes(event ~ ., data=train,metric=metric)
model_naive
===================== Naive Bayes ===================== 
Call: 
naive_bayes.formula(formula = event ~ ., data = train, metric = metric)

A priori probabilities: 

        0         1 
0.3196061 0.6803939 

Tables: 
      
LTV            0         1
  mean 0.4908501 0.7462954
  sd   0.2732269 0.3606723

             
Recovery_rate         0         1
         mean 0.9999900 0.7504959
         sd   0.0000000 0.2903258

        
lgd_time         0         1
    mean 0.0000100 0.2495041
    sd   0.0000000 0.2903258

          
y_logistic          0          1
      mean -11.512915  -2.048721
      sd     0.000000   2.529703

      
lnrr            0          1
  mean  0.0000000 -0.4885591
  sd    0.0000000  0.9083414

# ... and 2 more tables
Here, based on prior probabilities posterior probabilities are obtained using naive bayes method and we can got the mean and standard deviation for all the variables separately for defaulters and non defaulters.

Predicting On Train Data

library(caret)
pred_train_naive<-predict(model_naive,train)
confusionMatrix(as.factor(train$event),as.factor(pred_train_naive))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 357   0
         1  39 721
                                          
               Accuracy : 0.9651          
                 95% CI : (0.9526, 0.9751)
    No Information Rate : 0.6455          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.922           
 Mcnemar's Test P-Value : 1.166e-09       
                                          
            Sensitivity : 0.9015          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9487          
             Prevalence : 0.3545          
         Detection Rate : 0.3196          
   Detection Prevalence : 0.3196          
      Balanced Accuracy : 0.9508          
                                          
       'Positive' Class : 0               
                                          
library(ROCR)
pr<-prediction(as.numeric(pred_train_naive),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9743421
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here, 32 observations are wrongly predicted as non defaulters which are actually defaulters,except this remaining values are predicted correctly with 97 % accuracy,true positive rate being 91%, true negative rate being 100% ,balanced accuracy being 95% and area under ROC curve being 98% we can rely on this model 93% of times. 

KNN CLASSIFICATION

Model Building

library(caret)
repeatedcv<-trainControl(method = "repeatedcv",number = 5,repeats = 3)
model_knn<-train(event ~ ., data=train,method="knn",trControl=repeatedcv)
model_knn
k-Nearest Neighbors 

1117 samples
   7 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 893, 894, 893, 894, 894, 894, ... 
Resampling results across tuning parameters:

  k  Accuracy   Kappa    
  5  0.9991031  0.9979414
  7  0.9991031  0.9979414
  9  0.9991031  0.9979414

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
In k nearest neighbours for different k values accuracy and kappa is calculated and at last model is choosen for k value 9 which has optimum accuracy.

Predict On Train Data

library(caret)
pred_train_knn<-predict(model_knn,train)
confusionMatrix(as.factor(train$event),as.factor(pred_train_knn))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 357   0
         1   1 759
                                    
               Accuracy : 0.9991    
                 95% CI : (0.995, 1)
    No Information Rate : 0.6795    
    P-Value [Acc > NIR] : <2e-16    
                                    
                  Kappa : 0.9979    
 Mcnemar's Test P-Value : 1         
                                    
            Sensitivity : 0.9972    
            Specificity : 1.0000    
         Pos Pred Value : 1.0000    
         Neg Pred Value : 0.9987    
             Prevalence : 0.3205    
         Detection Rate : 0.3196    
   Detection Prevalence : 0.3196    
      Balanced Accuracy : 0.9986    
                                    
       'Positive' Class : 0         
                                    
library(ROCR)
pr<-prediction(as.numeric(pred_train_knn),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9993421
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here all the values are correctly predicted with 100% accuracy and other parameters also being at its best, we can category it as a best model.

C5.0 DECISION TREE

Model Building

library(C50)
model_c50<-C5.0(event ~ ., data=train)
summary(model_c50)

Call:
C5.0.formula(formula = event ~ ., data = train)


C5.0 [Release 2.07 GPL Edition]     Fri Aug 17 22:00:38 2018
-------------------------------

Class specified by attribute `outcome'

Read 1117 cases (8 attributes) from undefined.data

Decision tree:

y_logistic <= -11.51292: 0 (357)
y_logistic > -11.51292: 1 (760)


Evaluation on training data (1117 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         2    0( 0.0%)   <<


       (a)   (b)    <-classified as
      ----  ----
       357          (a): class 0
             760    (b): class 1


    Attribute usage:

    100.00% y_logistic


Time: 0.0 secs
plot(model_c50)

  Here, the starting node y_logistic is selected based on the information gain ratio calculated for all the variables and pruning is done with a confidence threshold value of 0.25.
  

Predict On Train Data

pred_train_c50<-predict(model_c50,train)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train_c50))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 357   0
         1   0 760
                                     
               Accuracy : 1          
                 95% CI : (0.9967, 1)
    No Information Rate : 0.6804     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.3196     
         Detection Rate : 0.3196     
   Detection Prevalence : 0.3196     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_train_c50),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

  Here also we can see that all values are correctly predicted with 100% accuracy.
  

RANDOM FOREST(BAGGING)

Model Building

library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':

    margin
model_rforest<-randomForest(event ~ ., data=train,mtry=3,ntree=500)
model_rforest

Call:
 randomForest(formula = event ~ ., data = train, mtry = 3, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 0.09%
Confusion matrix:
    0   1 class.error
0 357   0 0.000000000
1   1 759 0.001315789
Here the model is built with 500 trees which reduced the variance and out of bag error for test data is only 0.1% which is very less.

Predict On Train Data

pred_train_rf<-predict(model_rforest,train)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train_rf))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 357   0
         1   0 760
                                     
               Accuracy : 1          
                 95% CI : (0.9967, 1)
    No Information Rate : 0.6804     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.3196     
         Detection Rate : 0.3196     
   Detection Prevalence : 0.3196     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_train_rf),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Even here all the values are correctly predicted with full accuracy.

ADAPTIVE BOOSTING

Model Building

library(ada)
Loading required package: rpart
model_ada<-ada(event ~ ., data=train,loss='exponential',type='discrete',iter=100)
model_ada
Call:
ada(event ~ ., data = train, loss = "exponential", type = "discrete", 
    iter = 100)

Loss: exponential Method: discrete   Iteration: 100 

Final Confusion Matrix for Data:
          Final Prediction
True value   0   1
         0 357   0
         1   0 760

Train Error: 0 

Out-Of-Bag Error:  0  iteration= 18 

Additional Estimates of number of iterations:

train.err1 train.kap1 
         1          1 
plot(model_ada)

By using exponential adaptive boosting total 100 iterations were executed where the error is null at all iterations. 

Predict On Train Data

pred_train_ada<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train_ada))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 357   0
         1   0 760
                                     
               Accuracy : 1          
                 95% CI : (0.9967, 1)
    No Information Rate : 0.6804     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.3196     
         Detection Rate : 0.3196     
   Detection Prevalence : 0.3196     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_train_ada),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

By predicting on train data its accuracy and remaining parameters were perfectly performed which says it is the another best model.

Now, lets check the performance of all the models in summary and choose the best out of it.

SUMMARY

SUMMARY
parameters Binary_logistic_regression Naive_bayes knn_classification C5.0 Random_forest Adaptive_boosting
Accuracy 100 97 99.8 100 100 100
Kappa 100 93 99.6 100 100 100
Sensitivity 100 91 99.0 100 100 100
Specificity 100 100 100.0 100 100 100
Balanced_Accuracy 100 96 99.7 100 100 100
Area_Under_Curve 100 98 99.8 100 100 100
Here, we can infer that all models are performing perfectly well, so any model can be choosen out of it and tested for remaining data so as to control the bias and variance.Lets choose the random forest bagging model to predict on future data.

Using Random Forest Bagging

Predicting on test1 and on test2 data

pred_test1<-predict(model_rforest,test1)
library(caret)
confusionMatrix(as.factor(test1$event),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 182   0
         1   1 376
                                     
               Accuracy : 0.9982     
                 95% CI : (0.9901, 1)
    No Information Rate : 0.6726     
    P-Value [Acc > NIR] : <2e-16     
                                     
                  Kappa : 0.9959     
 Mcnemar's Test P-Value : 1          
                                     
            Sensitivity : 0.9945     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 0.9973     
             Prevalence : 0.3274     
         Detection Rate : 0.3256     
   Detection Prevalence : 0.3256     
      Balanced Accuracy : 0.9973     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9986737
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here only two values are misclassified with an accuracy of 99% and remaining parameters are at its best.
pred_test2<-predict(model_rforest,test2)
library(caret)
confusionMatrix(as.factor(test2$event),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 165   0
         1   0 393
                                     
               Accuracy : 1          
                 95% CI : (0.9934, 1)
    No Information Rate : 0.7043     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.2957     
         Detection Rate : 0.2957     
   Detection Prevalence : 0.2957     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here only one value is misclassified with an accuracy of 99%.

Lets shuffle the dataset ‘lgd1’ and split into train , test1 and test2 dataset and perform the analysis on these dataset.

SHUFFLE THE DATASET

set.seed(123)

RESAMPLE THE DATA

library(caret)
splitlgd1<-createDataPartition(lgd1$event,p=0.7,list = FALSE)
train<-lgd1[splitlgd1,]
test<-lgd1[-splitlgd1,]
dim(train)
[1] 1564    8
dim(test)
[1] 670   8
splittest<-createDataPartition(test$event,p=0.6,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(test1)
[1] 402   8
Here 70% of lgd1 data is passed to train dataset with remaining 30% to test data and further test dataset is split into test1 dataset with 60% of test values and test2 dataset with 40% of test data.

BUILD THE MODEL

library(randomForest)
model_rforest1<-randomForest(event ~ ., data=train,mtry=5,ntree=300)
Warning in randomForest.default(m, y, ...): The response has five or fewer
unique values. Are you sure you want to do regression?
model_rforest1

Call:
 randomForest(formula = event ~ ., data = train, mtry = 5, ntree = 300) 
               Type of random forest: regression
                     Number of trees: 300
No. of variables tried at each split: 5

          Mean of squared residuals: 0.0006394821
                    % Var explained: 99.71
Now the random forest model is built with 300 trees and with 5 model try.

PREDICT ON TRAIN DATA

pred<-predict(model_rforest1,train,type="response")
pred_train<-ifelse(pred>0.5,1,0)

library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  500    0
         1    0 1064
                                     
               Accuracy : 1          
                 95% CI : (0.9976, 1)
    No Information Rate : 0.6803     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.3197     
         Detection Rate : 0.3197     
   Detection Prevalence : 0.3197     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Here it performed well on shuffled train data.

PREDICT ON TEST1 AND TEST2 DATA

pred<-predict(model_rforest1,test1,type="response")
pred_test1<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(test1$event),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 122   0
         1   0 280
                                     
               Accuracy : 1          
                 95% CI : (0.9909, 1)
    No Information Rate : 0.6965     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.3035     
         Detection Rate : 0.3035     
   Detection Prevalence : 0.3035     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred<-predict(model_rforest1,test2,type="response")
pred_test2<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(test2$event),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0  82   0
         1   0 186
                                     
               Accuracy : 1          
                 95% CI : (0.9863, 1)
    No Information Rate : 0.694      
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.000      
            Specificity : 1.000      
         Pos Pred Value : 1.000      
         Neg Pred Value : 1.000      
             Prevalence : 0.306      
         Detection Rate : 0.306      
   Detection Prevalence : 0.306      
      Balanced Accuracy : 1.000      
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

It performed well even on test1 data and test2 data.

Lets again shuffle the lgd1 dataset and split the data into train, test1 and test2 data and check the performance.

RESHUFFLE THE DATASET

set.seed(1234)

RESAMPLE THE DATA

library(caret)
splitlgd1<-createDataPartition(lgd1$event,p=0.4,list = FALSE)
train<-lgd1[splitlgd1,]
test<-lgd1[-splitlgd1,]
dim(train)
[1] 894   8
dim(test)
[1] 1340    8
splittest<-createDataPartition(test$event,p=0.6,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(test1)
[1] 804   8
Here 40% of lgd1 data taken as train data and 60% of test data into test1 data and remaining 40% of data to test2 data.

BUILD THE MODEL

library(randomForest)
model_rforest2<-randomForest(event ~ ., data=train,mtry=3,ntree=200)
Warning in randomForest.default(m, y, ...): The response has five or fewer
unique values. Are you sure you want to do regression?
model_rforest2

Call:
 randomForest(formula = event ~ ., data = train, mtry = 3, ntree = 200) 
               Type of random forest: regression
                     Number of trees: 200
No. of variables tried at each split: 3

          Mean of squared residuals: 0.0002460305
                    % Var explained: 99.88
Here the model is built with 200 trees.

PREDICT ON TRAIN DATA

pred<-predict(model_rforest2,train,type="response")
pred_train<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 262   0
         1   0 632
                                     
               Accuracy : 1          
                 95% CI : (0.9959, 1)
    No Information Rate : 0.7069     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.2931     
         Detection Rate : 0.2931     
   Detection Prevalence : 0.2931     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

PREDICT ON TEST1 AND TEST2 DATA

pred<-predict(model_rforest2,test1,type="response")
pred_test1<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(test1$event),as.factor(pred_test1))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 260   0
         1   0 544
                                     
               Accuracy : 1          
                 95% CI : (0.9954, 1)
    No Information Rate : 0.6766     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.3234     
         Detection Rate : 0.3234     
   Detection Prevalence : 0.3234     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

pred<-predict(model_rforest2,test2,type = "response")
pred_test2<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(test2$event),as.factor(pred_test2))
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 182   0
         1   1 353
                                     
               Accuracy : 0.9981     
                 95% CI : (0.9896, 1)
    No Information Rate : 0.6586     
    P-Value [Acc > NIR] : <2e-16     
                                     
                  Kappa : 0.9958     
 Mcnemar's Test P-Value : 1          
                                     
            Sensitivity : 0.9945     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 0.9972     
             Prevalence : 0.3414     
         Detection Rate : 0.3396     
   Detection Prevalence : 0.3396     
      Balanced Accuracy : 0.9973     
                                     
       'Positive' Class : 0          
                                     
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9985876
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)

Even this model performed well on train data ,test1 data and test2 data.

Now lets check the performance of random forest model on different dataset created using lgd1 data.

SUMMARY

FINAL SUMMARY USING RANDOM FOREST
DATA ACCURACY KAPPA SENSITIVITY SPECIFICITY BALANCED_ACCURACY AREA_UNDER_CURVE
Train 100.0 100.0 100.0 100 100.0 100.0
Test1 99.6 99.1 98.8 100 99.4 99.7
Test2 99.8 99.6 99.4 100 99.7 99.8
Train 100.0 100.0 100.0 100 100.0 100.0
Test1 100.0 100.0 100.0 100 100.0 100.0
Test2 100.0 100.0 100.0 100 100.0 100.0
Train 100.0 100.0 100.0 100 100.0 100.0
Test1 100.0 100.0 100.0 100 100.0 100.0
Test2 99.8 99.5 99.4 100 99.7 99.8

The performance is very good as all parameters are at consistent rate which shows that the model has less bias-variance tradeoff.

CONCLUSION

All the different models performed really great on this dataset and we used random forest bagging algorithm and predicted the data well by considering bias and variance.
Now we can easily identify the defaulter by using random forest model.