Can a customer be classified as defaulter/non-defaulter using lgd1 dataset?
The dataset is provided by a European bank and has been slightly modified and anonymized. It includes 2,545 observations on loans and LGDs.
library(readxl)
lgd1<-read_excel("D:/projects/lgd/lgd.xlsx")
View(lgd1)
| LTV | Recovery_rate | lgd_time | y_logistic | lnrr | Y_probit | purpose1 | event |
|---|---|---|---|---|---|---|---|
| 0.2140781 | 0.6980155 | 0.3019845 | -0.8378657 | -0.3595139 | -0.5187014 | 0 | 1 |
| 0.2140781 | 0.7800841 | 0.2199159 | -1.2661566 | -0.2483535 | -0.7724773 | 0 | 1 |
| 0.2140781 | 0.7022869 | 0.2977131 | -0.8582118 | -0.3534132 | -0.5309894 | 0 | 1 |
| 0.2140781 | 0.7539889 | 0.2460111 | -1.1200010 | -0.2823776 | -0.6870961 | 0 | 1 |
| 0.2140781 | 0.8028126 | 0.1971874 | -1.4039668 | -0.2196340 | -0.8517105 | 0 | 1 |
| 0.0273835 | 0.9921424 | 0.0078576 | -4.8383828 | -0.0078887 | -2.4154625 | 0 | 1 |
| 0.0626695 | 0.9999900 | 0.0000100 | -11.5129155 | 0.0000000 | -4.2648908 | 0 | 0 |
| 0.2261305 | 0.9806656 | 0.0193344 | -3.9263453 | -0.0195238 | -2.0676939 | 0 | 1 |
| 0.0073802 | 0.9920494 | 0.0079506 | -4.8265298 | -0.0079823 | -2.4111769 | 0 | 1 |
| 0.0629311 | 0.9999900 | 0.0000100 | -11.5129155 | 0.0000000 | -4.2648908 | 0 | 0 |
LTV (loan to value ratio) is a risk assessment ratio which is the ratio of the amount of mortgage(loan) to the appraised(estimated) value of the property.Generally high LTV ratio's are seen at higher risks{it is in ratio}.
Recovery_rate is the value of a security when it emerges from default[1-loss that arises in the event of default(LGD)]{it is in ratio}.
lgd_time (loss given default) is the risk of exposure that is not expected to be recovered in the event of default{it is in ratio}.
y_logistic is the logistic transformation of the loss given default(LGD).
lnrr is the natural algorithm of a recovery rate.
Y_probit is the probit transformation of the loss given default(LGD)
purpose1 indicates the purpose of a loan ('1' for rent purpose and '0' for others).
event indicates the defaulter or non defaulter ('1' for default and '0' for non default).
Here, event will be an outcome variable as a customer is defaulter/non-defaulter is decided by the remaining 7 variables which are predectors.
str(lgd1)
Classes 'tbl_df', 'tbl' and 'data.frame': 2545 obs. of 8 variables:
$ LTV : num 0.214 0.214 0.214 0.214 0.214 ...
$ Recovery_rate: num 0.698 0.78 0.702 0.754 0.803 ...
$ lgd_time : num 0.302 0.22 0.298 0.246 0.197 ...
$ y_logistic : num -0.838 -1.266 -0.858 -1.12 -1.404 ...
$ lnrr : num -0.36 -0.248 -0.353 -0.282 -0.22 ...
$ Y_probit : num -0.519 -0.772 -0.531 -0.687 -0.852 ...
$ purpose1 : num 0 0 0 0 0 0 0 0 0 0 ...
$ event : num 1 1 1 1 1 1 0 1 1 0 ...
sum(is.na(lgd1))
[1] 0
summary(lgd1)
LTV Recovery_rate lgd_time y_logistic
Min. :0.001359 Min. :0.00001 Min. :0.00001 Min. :-11.5129
1st Qu.:0.399180 1st Qu.:0.60215 1st Qu.:0.00001 1st Qu.:-11.5129
Median :0.659417 Median :0.96793 Median :0.03207 Median : -3.4074
Mean :0.676556 Mean :0.77187 Mean :0.22813 Mean : -3.9413
3rd Qu.:0.923548 3rd Qu.:0.99999 3rd Qu.:0.39785 3rd Qu.: -0.4144
Max. :1.984065 Max. :0.99999 Max. :0.99999 Max. : 11.5129
lnrr Y_probit purpose1 event
Min. :-11.51292 Min. :-4.2649 Min. :0.00000 Min. :0.0000
1st Qu.: -0.50726 1st Qu.:-4.2649 1st Qu.:0.00000 1st Qu.:0.0000
Median : -0.03259 Median :-1.8513 Median :0.00000 Median :1.0000
Mean : -0.99665 Mean :-1.6508 Mean :0.07269 Mean :0.7139
3rd Qu.: 0.00000 3rd Qu.:-0.2589 3rd Qu.:0.00000 3rd Qu.:1.0000
Max. : 0.00000 Max. : 4.2649 Max. :1.00000 Max. :1.0000
Here, it clearly shows that no missing values are available.
boxplot(lgd1)
dim(lgd1)
[1] 2545 8
From boxplot we can see that outliers are available for certain variables but for variable 'lnrr' and variable 'purpose1' we have few values which are far from other outliers so we can remove few extreme outliers in order to improve our performance as these values may lead to more variance and bias.
lgd1<-lgd1[lgd1$lnrr>-6,]
dim(lgd1)
[1] 2396 8
lgd1<-lgd1[!lgd1$purpose1==1,]
dim(lgd1)
[1] 2234 8
boxplot(lgd1)
Here the dimension of the model is reduced by removing certain extreme outliers from variable 'lnrr' and variable 'purpose1',now by box plotting on this data we can see that no extreme outliers are available.
Now the data is splitted into train data with 1117 observations,test1 data into 559 observations and test2 data with 558 observations.
library(caret)
Loading required package: lattice
Loading required package: ggplot2
splitlgd1<-createDataPartition(lgd1$event,p=0.5,list = FALSE)
train<-lgd1[splitlgd1,]
test<-lgd1[-splitlgd1,]
dim(train)
[1] 1117 8
dim(test)
[1] 1117 8
splittest<-createDataPartition(test$event,p=0.5,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(test1)
[1] 559 8
dim(test2)
[1] 558 8
Now lets build the model by using different algorithmic models and the best model is choosen out of it based on different parameters and the outcomes are predicted using this best model.Note that here we are predicting only on train data in order to compare the models.
train$event<-as.factor(train$event)
class(train$event)
[1] "factor"
model_bin<-glm(event ~ ., data=train,family = binomial(link = "logit"))
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred_train_bin<-ifelse(model_bin$fitted.values>0.5,1,0)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train_bin))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 357 0
1 0 760
Accuracy : 1
95% CI : (0.9967, 1)
No Information Rate : 0.6804
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3196
Detection Rate : 0.3196
Detection Prevalence : 0.3196
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
Loading required package: gplots
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
pr<-prediction(as.numeric(pred_train_bin),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
After building the model using binary logistic regression we predict the outcomes on train data and from this we can infer that all 1117 values of train data are predicted correctly with 100% accuracy.
metric<-"Accuracy"
library(naivebayes)
model_naive<-naive_bayes(event ~ ., data=train,metric=metric)
model_naive
===================== Naive Bayes =====================
Call:
naive_bayes.formula(formula = event ~ ., data = train, metric = metric)
A priori probabilities:
0 1
0.3196061 0.6803939
Tables:
LTV 0 1
mean 0.4908501 0.7462954
sd 0.2732269 0.3606723
Recovery_rate 0 1
mean 0.9999900 0.7504959
sd 0.0000000 0.2903258
lgd_time 0 1
mean 0.0000100 0.2495041
sd 0.0000000 0.2903258
y_logistic 0 1
mean -11.512915 -2.048721
sd 0.000000 2.529703
lnrr 0 1
mean 0.0000000 -0.4885591
sd 0.0000000 0.9083414
# ... and 2 more tables
Here, based on prior probabilities posterior probabilities are obtained using naive bayes method and we can got the mean and standard deviation for all the variables separately for defaulters and non defaulters.
library(caret)
pred_train_naive<-predict(model_naive,train)
confusionMatrix(as.factor(train$event),as.factor(pred_train_naive))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 357 0
1 39 721
Accuracy : 0.9651
95% CI : (0.9526, 0.9751)
No Information Rate : 0.6455
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.922
Mcnemar's Test P-Value : 1.166e-09
Sensitivity : 0.9015
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9487
Prevalence : 0.3545
Detection Rate : 0.3196
Detection Prevalence : 0.3196
Balanced Accuracy : 0.9508
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train_naive),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9743421
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Here, 32 observations are wrongly predicted as non defaulters which are actually defaulters,except this remaining values are predicted correctly with 97 % accuracy,true positive rate being 91%, true negative rate being 100% ,balanced accuracy being 95% and area under ROC curve being 98% we can rely on this model 93% of times.
library(caret)
repeatedcv<-trainControl(method = "repeatedcv",number = 5,repeats = 3)
model_knn<-train(event ~ ., data=train,method="knn",trControl=repeatedcv)
model_knn
k-Nearest Neighbors
1117 samples
7 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 893, 894, 893, 894, 894, 894, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.9991031 0.9979414
7 0.9991031 0.9979414
9 0.9991031 0.9979414
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
In k nearest neighbours for different k values accuracy and kappa is calculated and at last model is choosen for k value 9 which has optimum accuracy.
library(caret)
pred_train_knn<-predict(model_knn,train)
confusionMatrix(as.factor(train$event),as.factor(pred_train_knn))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 357 0
1 1 759
Accuracy : 0.9991
95% CI : (0.995, 1)
No Information Rate : 0.6795
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9979
Mcnemar's Test P-Value : 1
Sensitivity : 0.9972
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9987
Prevalence : 0.3205
Detection Rate : 0.3196
Detection Prevalence : 0.3196
Balanced Accuracy : 0.9986
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train_knn),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9993421
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Here all the values are correctly predicted with 100% accuracy and other parameters also being at its best, we can category it as a best model.
library(C50)
model_c50<-C5.0(event ~ ., data=train)
summary(model_c50)
Call:
C5.0.formula(formula = event ~ ., data = train)
C5.0 [Release 2.07 GPL Edition] Fri Aug 17 22:00:38 2018
-------------------------------
Class specified by attribute `outcome'
Read 1117 cases (8 attributes) from undefined.data
Decision tree:
y_logistic <= -11.51292: 0 (357)
y_logistic > -11.51292: 1 (760)
Evaluation on training data (1117 cases):
Decision Tree
----------------
Size Errors
2 0( 0.0%) <<
(a) (b) <-classified as
---- ----
357 (a): class 0
760 (b): class 1
Attribute usage:
100.00% y_logistic
Time: 0.0 secs
plot(model_c50)
Here, the starting node y_logistic is selected based on the information gain ratio calculated for all the variables and pruning is done with a confidence threshold value of 0.25.
pred_train_c50<-predict(model_c50,train)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train_c50))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 357 0
1 0 760
Accuracy : 1
95% CI : (0.9967, 1)
No Information Rate : 0.6804
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3196
Detection Rate : 0.3196
Detection Prevalence : 0.3196
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train_c50),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Here also we can see that all values are correctly predicted with 100% accuracy.
library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':
margin
model_rforest<-randomForest(event ~ ., data=train,mtry=3,ntree=500)
model_rforest
Call:
randomForest(formula = event ~ ., data = train, mtry = 3, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 0.09%
Confusion matrix:
0 1 class.error
0 357 0 0.000000000
1 1 759 0.001315789
Here the model is built with 500 trees which reduced the variance and out of bag error for test data is only 0.1% which is very less.
pred_train_rf<-predict(model_rforest,train)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train_rf))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 357 0
1 0 760
Accuracy : 1
95% CI : (0.9967, 1)
No Information Rate : 0.6804
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3196
Detection Rate : 0.3196
Detection Prevalence : 0.3196
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train_rf),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Even here all the values are correctly predicted with full accuracy.
library(ada)
Loading required package: rpart
model_ada<-ada(event ~ ., data=train,loss='exponential',type='discrete',iter=100)
model_ada
Call:
ada(event ~ ., data = train, loss = "exponential", type = "discrete",
iter = 100)
Loss: exponential Method: discrete Iteration: 100
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 357 0
1 0 760
Train Error: 0
Out-Of-Bag Error: 0 iteration= 18
Additional Estimates of number of iterations:
train.err1 train.kap1
1 1
plot(model_ada)
By using exponential adaptive boosting total 100 iterations were executed where the error is null at all iterations.
pred_train_ada<-predict(model_ada,train)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train_ada))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 357 0
1 0 760
Accuracy : 1
95% CI : (0.9967, 1)
No Information Rate : 0.6804
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3196
Detection Rate : 0.3196
Detection Prevalence : 0.3196
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train_ada),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
By predicting on train data its accuracy and remaining parameters were perfectly performed which says it is the another best model.
Now, lets check the performance of all the models in summary and choose the best out of it.
| parameters | Binary_logistic_regression | Naive_bayes | knn_classification | C5.0 | Random_forest | Adaptive_boosting |
|---|---|---|---|---|---|---|
| Accuracy | 100 | 97 | 99.8 | 100 | 100 | 100 |
| Kappa | 100 | 93 | 99.6 | 100 | 100 | 100 |
| Sensitivity | 100 | 91 | 99.0 | 100 | 100 | 100 |
| Specificity | 100 | 100 | 100.0 | 100 | 100 | 100 |
| Balanced_Accuracy | 100 | 96 | 99.7 | 100 | 100 | 100 |
| Area_Under_Curve | 100 | 98 | 99.8 | 100 | 100 | 100 |
Here, we can infer that all models are performing perfectly well, so any model can be choosen out of it and tested for remaining data so as to control the bias and variance.Lets choose the random forest bagging model to predict on future data.
pred_test1<-predict(model_rforest,test1)
library(caret)
confusionMatrix(as.factor(test1$event),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 182 0
1 1 376
Accuracy : 0.9982
95% CI : (0.9901, 1)
No Information Rate : 0.6726
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9959
Mcnemar's Test P-Value : 1
Sensitivity : 0.9945
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9973
Prevalence : 0.3274
Detection Rate : 0.3256
Detection Prevalence : 0.3256
Balanced Accuracy : 0.9973
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9986737
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Here only two values are misclassified with an accuracy of 99% and remaining parameters are at its best.
pred_test2<-predict(model_rforest,test2)
library(caret)
confusionMatrix(as.factor(test2$event),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 165 0
1 0 393
Accuracy : 1
95% CI : (0.9934, 1)
No Information Rate : 0.7043
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.2957
Detection Rate : 0.2957
Detection Prevalence : 0.2957
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Here only one value is misclassified with an accuracy of 99%.
Lets shuffle the dataset ‘lgd1’ and split into train , test1 and test2 dataset and perform the analysis on these dataset.
set.seed(123)
library(caret)
splitlgd1<-createDataPartition(lgd1$event,p=0.7,list = FALSE)
train<-lgd1[splitlgd1,]
test<-lgd1[-splitlgd1,]
dim(train)
[1] 1564 8
dim(test)
[1] 670 8
splittest<-createDataPartition(test$event,p=0.6,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(test1)
[1] 402 8
Here 70% of lgd1 data is passed to train dataset with remaining 30% to test data and further test dataset is split into test1 dataset with 60% of test values and test2 dataset with 40% of test data.
library(randomForest)
model_rforest1<-randomForest(event ~ ., data=train,mtry=5,ntree=300)
Warning in randomForest.default(m, y, ...): The response has five or fewer
unique values. Are you sure you want to do regression?
model_rforest1
Call:
randomForest(formula = event ~ ., data = train, mtry = 5, ntree = 300)
Type of random forest: regression
Number of trees: 300
No. of variables tried at each split: 5
Mean of squared residuals: 0.0006394821
% Var explained: 99.71
Now the random forest model is built with 300 trees and with 5 model try.
pred<-predict(model_rforest1,train,type="response")
pred_train<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 500 0
1 0 1064
Accuracy : 1
95% CI : (0.9976, 1)
No Information Rate : 0.6803
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3197
Detection Rate : 0.3197
Detection Prevalence : 0.3197
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Here it performed well on shuffled train data.
pred<-predict(model_rforest1,test1,type="response")
pred_test1<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(test1$event),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 122 0
1 0 280
Accuracy : 1
95% CI : (0.9909, 1)
No Information Rate : 0.6965
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3035
Detection Rate : 0.3035
Detection Prevalence : 0.3035
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred<-predict(model_rforest1,test2,type="response")
pred_test2<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(test2$event),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 82 0
1 0 186
Accuracy : 1
95% CI : (0.9863, 1)
No Information Rate : 0.694
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.306
Detection Rate : 0.306
Detection Prevalence : 0.306
Balanced Accuracy : 1.000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
It performed well even on test1 data and test2 data.
Lets again shuffle the lgd1 dataset and split the data into train, test1 and test2 data and check the performance.
set.seed(1234)
library(caret)
splitlgd1<-createDataPartition(lgd1$event,p=0.4,list = FALSE)
train<-lgd1[splitlgd1,]
test<-lgd1[-splitlgd1,]
dim(train)
[1] 894 8
dim(test)
[1] 1340 8
splittest<-createDataPartition(test$event,p=0.6,list = FALSE)
test1<-test[splittest,]
test2<-test[-splittest,]
dim(test1)
[1] 804 8
Here 40% of lgd1 data taken as train data and 60% of test data into test1 data and remaining 40% of data to test2 data.
library(randomForest)
model_rforest2<-randomForest(event ~ ., data=train,mtry=3,ntree=200)
Warning in randomForest.default(m, y, ...): The response has five or fewer
unique values. Are you sure you want to do regression?
model_rforest2
Call:
randomForest(formula = event ~ ., data = train, mtry = 3, ntree = 200)
Type of random forest: regression
Number of trees: 200
No. of variables tried at each split: 3
Mean of squared residuals: 0.0002460305
% Var explained: 99.88
Here the model is built with 200 trees.
pred<-predict(model_rforest2,train,type="response")
pred_train<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(train$event),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 262 0
1 0 632
Accuracy : 1
95% CI : (0.9959, 1)
No Information Rate : 0.7069
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.2931
Detection Rate : 0.2931
Detection Prevalence : 0.2931
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred<-predict(model_rforest2,test1,type="response")
pred_test1<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(test1$event),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 260 0
1 0 544
Accuracy : 1
95% CI : (0.9954, 1)
No Information Rate : 0.6766
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3234
Detection Rate : 0.3234
Detection Prevalence : 0.3234
Balanced Accuracy : 1.0000
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 1
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred<-predict(model_rforest2,test2,type = "response")
pred_test2<-ifelse(pred>0.5,1,0)
library(caret)
confusionMatrix(as.factor(test2$event),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 182 0
1 1 353
Accuracy : 0.9981
95% CI : (0.9896, 1)
No Information Rate : 0.6586
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9958
Mcnemar's Test P-Value : 1
Sensitivity : 0.9945
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9972
Prevalence : 0.3414
Detection Rate : 0.3396
Detection Prevalence : 0.3396
Balanced Accuracy : 0.9973
'Positive' Class : 0
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$event))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9985876
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Even this model performed well on train data ,test1 data and test2 data.
Now lets check the performance of random forest model on different dataset created using lgd1 data.
| DATA | ACCURACY | KAPPA | SENSITIVITY | SPECIFICITY | BALANCED_ACCURACY | AREA_UNDER_CURVE |
|---|---|---|---|---|---|---|
| Train | 100.0 | 100.0 | 100.0 | 100 | 100.0 | 100.0 |
| Test1 | 99.6 | 99.1 | 98.8 | 100 | 99.4 | 99.7 |
| Test2 | 99.8 | 99.6 | 99.4 | 100 | 99.7 | 99.8 |
| Train | 100.0 | 100.0 | 100.0 | 100 | 100.0 | 100.0 |
| Test1 | 100.0 | 100.0 | 100.0 | 100 | 100.0 | 100.0 |
| Test2 | 100.0 | 100.0 | 100.0 | 100 | 100.0 | 100.0 |
| Train | 100.0 | 100.0 | 100.0 | 100 | 100.0 | 100.0 |
| Test1 | 100.0 | 100.0 | 100.0 | 100 | 100.0 | 100.0 |
| Test2 | 99.8 | 99.5 | 99.4 | 100 | 99.7 | 99.8 |
The performance is very good as all parameters are at consistent rate which shows that the model has less bias-variance tradeoff.
All the different models performed really great on this dataset and we used random forest bagging algorithm and predicted the data well by considering bias and variance.
Now we can easily identify the defaulter by using random forest model.