IMPORTING DATA

library(readxl)
hmeq<-read_excel("d:/ds/hmeq.xlsx")
loans<-hmeq

The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral.

Data Preprocessing

x<-na.omit(loans)
class(x$BAD)

## [1] "numeric"

x$BAD<-as.factor(x$BAD)
x$JOB<-as.factor(x$JOB)
x$REASON<-as.factor(x$REASON)

Splitting the data

library(caret)

## Warning: package 'caret' was built under R version 3.5.1

## Loading required package: lattice

## Loading required package: ggplot2

split<-sample(nrow(x),nrow(x)*0.8)
train<-x[split,]
test<-x[-split,]

Exploratory data analysis

ggplot(loans,aes(BAD))+geom_bar()

Comparing clients with who paid and not paid home equity loans.

loans$BAD<-as.factor(loans$BAD)
ggplot(loans,aes(JOB))+geom_bar(aes(fill=BAD))+ggtitle("job description vs defaulted loan")

ggplot(loans,aes(DEROG))+geom_bar(aes(fill=BAD))

## Warning: Removed 708 rows containing non-finite values (stat_count).

Clients who have lesser derogatory reports are the one who have high defaulted loans compared to higher derogatory.

ggplot(loans,aes(NINQ))+geom_bar(aes(fill=BAD))+ggtitle("Number of recent credit inquiries vs clients")

## Warning: Removed 510 rows containing non-finite values (stat_count).

ggplot(x,aes(DEBTINC,BAD))+geom_point()+ggtitle("Debt-income ratio vs defaulted loans")

After certain limit of debt income ratio we have clients who defaulted loans certainly.Due to less income or high debt we will have defaulted loans.

model building

library(caret)
model<-train(BAD ~ .,data=train,method="glm",family=binomial(link="logit"))
summary(model)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7875  -0.4122  -0.2864  -0.1980   4.0481  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.780e+00  5.165e-01  -9.255  < 2e-16 ***
## LOAN          -1.766e-05  9.053e-06  -1.951  0.05111 .  
## MORTDUE       -4.702e-06  4.092e-06  -1.149  0.25056    
## VALUE          5.832e-06  3.504e-06   1.664  0.09603 .  
## REASONHomeImp -1.831e-01  1.784e-01  -1.026  0.30473    
## JOBOffice     -5.974e-01  2.975e-01  -2.008  0.04461 *  
## JOBOther       1.578e-02  2.304e-01   0.069  0.94539    
## JOBProfExe    -1.197e-01  2.687e-01  -0.445  0.65598    
## JOBSales       1.531e+00  4.921e-01   3.112  0.00186 ** 
## JOBSelf        8.750e-01  4.392e-01   1.992  0.04633 *  
## YOJ           -3.345e-03  1.101e-02  -0.304  0.76122    
## DEROG          6.631e-01  1.093e-01   6.065 1.32e-09 ***
## DELINQ         8.029e-01  7.944e-02  10.107  < 2e-16 ***
## CLAGE         -6.247e-03  1.200e-03  -5.205 1.94e-07 ***
## NINQ           1.313e-01  4.045e-02   3.245  0.00117 ** 
## CLNO          -1.638e-02  9.114e-03  -1.798  0.07223 .  
## DEBTINC        9.334e-02  1.115e-02   8.369  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1641.2  on 2690  degrees of freedom
## Residual deviance: 1261.2  on 2674  degrees of freedom
## AIC: 1295.2
## 
## Number of Fisher Scoring iterations: 6

predicting

pred<-predict(model,test)
confusionMatrix(test$BAD,pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 610   8
##          1  42  13
##                                           
##                Accuracy : 0.9257          
##                  95% CI : (0.9032, 0.9444)
##     No Information Rate : 0.9688          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.311           
##  Mcnemar's Test P-Value : 3.058e-06       
##                                           
##             Sensitivity : 0.9356          
##             Specificity : 0.6190          
##          Pos Pred Value : 0.9871          
##          Neg Pred Value : 0.2364          
##              Prevalence : 0.9688          
##          Detection Rate : 0.9064          
##    Detection Prevalence : 0.9183          
##       Balanced Accuracy : 0.7773          
##                                           
##        'Positive' Class : 0               
##

model2<-train(BAD ~ .,data=train,method="svmRadial")
model2

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2691 samples
##   12 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2691, 2691, 2691, 2691, 2691, 2691, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.9195170  0.1820640
##   0.50  0.9289100  0.3309801
##   1.00  0.9337635  0.4037946
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05339373
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05339373 and C = 1.

Predicition

pred1<-predict(model2,test)
confusionMatrix(test$BAD,pred1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 618   0
##          1  38  17
##                                           
##                Accuracy : 0.9435          
##                  95% CI : (0.9233, 0.9597)
##     No Information Rate : 0.9747          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.451           
##  Mcnemar's Test P-Value : 1.947e-09       
##                                           
##             Sensitivity : 0.9421          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.3091          
##              Prevalence : 0.9747          
##          Detection Rate : 0.9183          
##    Detection Prevalence : 0.9183          
##       Balanced Accuracy : 0.9710          
##                                           
##        'Positive' Class : 0               
##

Most of the times it is predicting it correctly as the accuracy of model is 94% By seeing the Confusion Matrix we can deduce that Only 39 obs out of total test data it is predicting wrong.

hmeq loans

PAVAN

19 July 2018