library(readxl)
hmeq<-read_excel("d:/ds/hmeq.xlsx")
loans<-hmeq
The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral.
x<-na.omit(loans)
class(x$BAD)
## [1] "numeric"
x$BAD<-as.factor(x$BAD)
x$JOB<-as.factor(x$JOB)
x$REASON<-as.factor(x$REASON)
library(caret)
## Warning: package 'caret' was built under R version 3.5.1
## Loading required package: lattice
## Loading required package: ggplot2
split<-sample(nrow(x),nrow(x)*0.8)
train<-x[split,]
test<-x[-split,]
ggplot(loans,aes(BAD))+geom_bar()
Comparing clients with who paid and not paid home equity loans.
loans$BAD<-as.factor(loans$BAD)
ggplot(loans,aes(JOB))+geom_bar(aes(fill=BAD))+ggtitle("job description vs defaulted loan")
ggplot(loans,aes(DEROG))+geom_bar(aes(fill=BAD))
## Warning: Removed 708 rows containing non-finite values (stat_count).
Clients who have lesser derogatory reports are the one who have high defaulted loans compared to higher derogatory.
ggplot(loans,aes(NINQ))+geom_bar(aes(fill=BAD))+ggtitle("Number of recent credit inquiries vs clients")
## Warning: Removed 510 rows containing non-finite values (stat_count).
ggplot(x,aes(DEBTINC,BAD))+geom_point()+ggtitle("Debt-income ratio vs defaulted loans")
After certain limit of debt income ratio we have clients who defaulted loans certainly.Due to less income or high debt we will have defaulted loans.
library(caret)
model<-train(BAD ~ .,data=train,method="glm",family=binomial(link="logit"))
summary(model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7875 -0.4122 -0.2864 -0.1980 4.0481
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.780e+00 5.165e-01 -9.255 < 2e-16 ***
## LOAN -1.766e-05 9.053e-06 -1.951 0.05111 .
## MORTDUE -4.702e-06 4.092e-06 -1.149 0.25056
## VALUE 5.832e-06 3.504e-06 1.664 0.09603 .
## REASONHomeImp -1.831e-01 1.784e-01 -1.026 0.30473
## JOBOffice -5.974e-01 2.975e-01 -2.008 0.04461 *
## JOBOther 1.578e-02 2.304e-01 0.069 0.94539
## JOBProfExe -1.197e-01 2.687e-01 -0.445 0.65598
## JOBSales 1.531e+00 4.921e-01 3.112 0.00186 **
## JOBSelf 8.750e-01 4.392e-01 1.992 0.04633 *
## YOJ -3.345e-03 1.101e-02 -0.304 0.76122
## DEROG 6.631e-01 1.093e-01 6.065 1.32e-09 ***
## DELINQ 8.029e-01 7.944e-02 10.107 < 2e-16 ***
## CLAGE -6.247e-03 1.200e-03 -5.205 1.94e-07 ***
## NINQ 1.313e-01 4.045e-02 3.245 0.00117 **
## CLNO -1.638e-02 9.114e-03 -1.798 0.07223 .
## DEBTINC 9.334e-02 1.115e-02 8.369 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1641.2 on 2690 degrees of freedom
## Residual deviance: 1261.2 on 2674 degrees of freedom
## AIC: 1295.2
##
## Number of Fisher Scoring iterations: 6
pred<-predict(model,test)
confusionMatrix(test$BAD,pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 610 8
## 1 42 13
##
## Accuracy : 0.9257
## 95% CI : (0.9032, 0.9444)
## No Information Rate : 0.9688
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.311
## Mcnemar's Test P-Value : 3.058e-06
##
## Sensitivity : 0.9356
## Specificity : 0.6190
## Pos Pred Value : 0.9871
## Neg Pred Value : 0.2364
## Prevalence : 0.9688
## Detection Rate : 0.9064
## Detection Prevalence : 0.9183
## Balanced Accuracy : 0.7773
##
## 'Positive' Class : 0
##
model2<-train(BAD ~ .,data=train,method="svmRadial")
model2
## Support Vector Machines with Radial Basis Function Kernel
##
## 2691 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2691, 2691, 2691, 2691, 2691, 2691, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9195170 0.1820640
## 0.50 0.9289100 0.3309801
## 1.00 0.9337635 0.4037946
##
## Tuning parameter 'sigma' was held constant at a value of 0.05339373
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05339373 and C = 1.
Predicition
pred1<-predict(model2,test)
confusionMatrix(test$BAD,pred1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 618 0
## 1 38 17
##
## Accuracy : 0.9435
## 95% CI : (0.9233, 0.9597)
## No Information Rate : 0.9747
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.451
## Mcnemar's Test P-Value : 1.947e-09
##
## Sensitivity : 0.9421
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.3091
## Prevalence : 0.9747
## Detection Rate : 0.9183
## Detection Prevalence : 0.9183
## Balanced Accuracy : 0.9710
##
## 'Positive' Class : 0
##
Most of the times it is predicting it correctly as the accuracy of model is 94% By seeing the Confusion Matrix we can deduce that Only 39 obs out of total test data it is predicting wrong.