Introduction

This project deals with modelling of credit defaults. Credit Risk is one of the major issues prevailing in any organisation. Thus, its is necessary for companies to track their clients and prevent defaults. This project deals with analysing a certain data set and performing statistical operations.

The statistical functions used for this project are: 1. Logistic Regression 2. Area Under the Curve 3. Decision Tree

About the Data

The data set contains the following informations about the clients:

  1. loan_status: It is an integer variable that refects the loan status, 0 if customer is a non-defaulter and 1 if customer is a defaulter.
  2. loan_amt: It is an integer variable that shows the total amount of loan taken by the client.
  3. int_rate: It is a numerical variable that shows the interest rate applicable on the loan amount.
  4. grade: It is a categorical variable which shows the rating of the client, A being most worthy to G being least worthy.
  5. annual_inc: It is a numerical variable that shows the annual income of the client.

Visualising the Data

Let’s start with viewing the data set using:

loandata<-read.csv("D:/R/loandata.csv")
summary(loandata)
##        X          loan_status       loan_amnt        int_rate     grade   
##  Min.   :    1   Min.   :0.0000   Min.   :  500   Min.   : 5.42   A:9649  
##  1st Qu.: 7274   1st Qu.:0.0000   1st Qu.: 5000   1st Qu.: 7.90   B:9329  
##  Median :14546   Median :0.0000   Median : 8000   Median :10.99   C:5748  
##  Mean   :14546   Mean   :0.1109   Mean   : 9594   Mean   :11.00   D:3231  
##  3rd Qu.:21819   3rd Qu.:0.0000   3rd Qu.:12250   3rd Qu.:13.47   E: 868  
##  Max.   :29092   Max.   :1.0000   Max.   :35000   Max.   :23.22   F: 211  
##                                                   NA's   :2776    G:  56  
##    annual_inc     
##  Min.   :   4000  
##  1st Qu.:  40000  
##  Median :  56424  
##  Mean   :  67169  
##  3rd Qu.:  80000  
##  Max.   :6000000  
## 
#Viewing the Structure
str(loandata)
## 'data.frame':    29092 obs. of  6 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ loan_status: int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt  : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ int_rate   : num  10.6 NA 13.5 NA NA ...
##  $ grade      : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ annual_inc : num  24000 12252 49200 36000 48000 ...
library(gmodels)
#Probability of defaults and non-defaults
CrossTable(loandata$loan_status)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  29092 
## 
##  
##           |         0 |         1 | 
##           |-----------|-----------|
##           |     25865 |      3227 | 
##           |     0.889 |     0.111 | 
##           |-----------|-----------|
## 
## 
## 
## 
#Probility of loan status across grades
cross1<-CrossTable(loandata$grade, loandata$loan_status,prop.r = TRUE,prop.c = FALSE,prop.t = FALSE,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  29092 
## 
##  
##                | loandata$loan_status 
## loandata$grade |         0 |         1 | Row Total | 
## ---------------|-----------|-----------|-----------|
##              A |      9084 |       565 |      9649 | 
##                |     0.941 |     0.059 |     0.332 | 
## ---------------|-----------|-----------|-----------|
##              B |      8344 |       985 |      9329 | 
##                |     0.894 |     0.106 |     0.321 | 
## ---------------|-----------|-----------|-----------|
##              C |      4904 |       844 |      5748 | 
##                |     0.853 |     0.147 |     0.198 | 
## ---------------|-----------|-----------|-----------|
##              D |      2651 |       580 |      3231 | 
##                |     0.820 |     0.180 |     0.111 | 
## ---------------|-----------|-----------|-----------|
##              E |       692 |       176 |       868 | 
##                |     0.797 |     0.203 |     0.030 | 
## ---------------|-----------|-----------|-----------|
##              F |       155 |        56 |       211 | 
##                |     0.735 |     0.265 |     0.007 | 
## ---------------|-----------|-----------|-----------|
##              G |        35 |        21 |        56 | 
##                |     0.625 |     0.375 |     0.002 | 
## ---------------|-----------|-----------|-----------|
##   Column Total |     25865 |      3227 |     29092 | 
## ---------------|-----------|-----------|-----------|
## 
## 
#Plotting the graph to see the proportion of defaulters across grades
plot(cross1$prop.tbl,col=c("green","red"),main="Propertion of Defaulters/Non-defaulters with respect to grade",xlab = "Grade of customer",ylab = "1-Default 0-Non-Default")

Checking for NA Values

There are certain client data wherein the interest rate values are mentioned as “NA”. With such kind of missing data, the results of the analysis will not be acurate. Thus, we need to remove the rows where the int_rate values are NA.

#Checking for NA values
anyNA(loandata)
## [1] TRUE
#Identify the NAs in the variable
summary(loandata$int_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5.42    7.90   10.99   11.00   13.47   23.22    2776
#Deleting rows containing NAs
na_index<-which(is.na(loandata$int_rate))
loandata_delrow_na<-loandata[-c(na_index),]
#Identify that NA values have been eliminated 
summary(loandata_delrow_na$int_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.42    7.90   10.99   11.00   13.47   23.22

Creating Test set and Train set

To test the accuracy of any data, the data set should be split into two parts:

  1. Test data (1/3 portion)
  2. Train data (2/3 portion)

Here, the statistical operations are performed on the train set, and the accuracy is checked on the test set.

set.seed(1234)
# Spit the data set into train and test set
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
index_train<-createDataPartition(y=loandata_delrow_na$loan_status,p=2/3,list=FALSE)
training_data<-loandata_delrow_na[index_train,]
testing_data<-loandata_delrow_na[-index_train,]
#Check the dimention of train and test set
dim(training_data); dim(testing_data)
## [1] 17544     6
## [1] 8772    6

Logistic Regression

The logistic model is a widely used statistical model that uses logistic functions to model a binary dependent variable. Logistic Regression is estimating the parameters of logistic model. It is a form of binary regression that takes two possible form, in this case, the defalters(1) and non-defaulers(0).

The given codes are for running a logistic regression. A particular variable is said to be significant if the p-value is less than the confidence interval(alpha).

Further, the accuracy of the model is tested using the confusion matrix. For doing so, both variables should be factors of the same level. Thus, first the integers are convered to factors, and then the accuracy is estimated.

#Run Logistic regression on train data
training_data$loan_status_fac<-as.factor(training_data$loan_status)
modLog<-train(loan_status_fac~., data=training_data,method="glm")
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
summary(modLog)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.409e-06  -2.409e-06  -2.409e-06  -2.409e-06   2.409e-06  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.657e+01  2.173e+04  -0.001    0.999
## X           -5.267e-17  3.297e-01   0.000    1.000
## loan_status  5.313e+01  8.668e+03   0.006    0.995
## loan_amnt   -4.724e-17  4.463e-01   0.000    1.000
## int_rate    -2.235e-13  2.709e+03   0.000    1.000
## gradeB       5.408e-13  1.196e+04   0.000    1.000
## gradeC       1.112e-12  1.845e+04   0.000    1.000
## gradeD       1.636e-12  2.373e+04   0.000    1.000
## gradeE       2.293e-12  3.113e+04   0.000    1.000
## gradeF       1.656e-12  4.430e+04   0.000    1.000
## gradeG       8.139e-12  7.264e+04   0.000    1.000
## annual_inc   1.231e-18  3.958e-02   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.2226e+04  on 17543  degrees of freedom
## Residual deviance: 1.0178e-07  on 17532  degrees of freedom
## AIC: 24
## 
## Number of Fisher Scoring iterations: 25
#Prediction of the logitic regression model
prediction<-predict(modLog,newdata = testing_data)
#Both the variables are of different data types
typeof(prediction)
## [1] "integer"
typeof(testing_data$loan_status)
## [1] "integer"
#Convert to factor
prediction_fac<-as.factor(prediction)
testing_data$loan_status_fac<-as.factor(testing_data$loan_status)
#Checking the structure
str(prediction_fac)
##  Factor w/ 2 levels "0","1": 1 1 1 2 1 2 1 1 1 2 ...
str(testing_data$loan_status_fac)
##  Factor w/ 2 levels "0","1": 1 1 1 2 1 2 1 1 1 2 ...
#Check accuracy
library(e1071)
confusionMatrix(table(prediction_fac,testing_data$loan_status_fac))
## Confusion Matrix and Statistics
## 
##               
## prediction_fac    0    1
##              0 7784    0
##              1    0  988
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9996, 1)
##     No Information Rate : 0.8874     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.8874     
##          Detection Rate : 0.8874     
##    Detection Prevalence : 0.8874     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Area Under the Curve

Area Under the Curve (AUC) is also referred to as Receiver Operating Characteristic (ROC). It measures the tradeoff between sensitivity and specificity in a binary classifier. These binary predictions require setting of a threshold. Values above the threshold are classified as positive and those below the threshold are classified as negative.

Different threshold values give different levels of sensitivity and specificity. A high threshold represent true positive rates (TPR) and a low threshold produces false positive rates (FPR). The ROC curve plots true positive rate against false positive rate, giving a picture of the whole spectrum of such tradeoffs.

Since this is a binary outcome, the labels vector is a series of TRUE and FALSE values (or ones and zeros in our case). The graph starts at the origin and it traces a path across the axis dictated by the sequence of instructions. When it sees a one (TRUE) it takes a step Northward (in the positive y direction); when it sees a zero (FALSE) it takes a step to the East (the positive x direction). The path across the graph is determined by the order of the ones and zeros, and it always finishes in the upper right corner. An ROC “curve” computed in this way is actually a step function.

In the given case, the AUC is calculated on the train set. Here the ROC curve for the response scores from the logistic regression model is calculated with the widely used “pROC” package.

#AREA UNDER THE CURVE
library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(ROCR)
#Creating a variable containing dependent variable of train set
category <- training_data$loan_status
summary(category)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1109  0.0000  1.0000
#Sequence generation
prediction2 <- rev(seq_along(category))
#Determining True positive and false positive rates
prediction2[1:1] <- mean(prediction2[1:1])
#Find out AUC
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
## 
##     ci
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_obj <- roc(category, prediction2)
auc(roc_obj)
## Area under the curve: 0.5175
#Plot AUC
library(rpart)
library("ROSE")
## Loaded ROSE 0.0-3
library(Epi)
roc.curve(training_data$loan_status,prediction2)

## Area under the curve (AUC): 0.518
ROC(form=training_data$loan_status~prediction2,plot="ROC")

Decision Tree

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

There are two methods of forming a decision tree, Information Gain and GINI Index. The core algorithm for building decision trees uses Entropy (disorganised groups) and Information Gain to construct a decision tree. The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

The following are the steps followed to obtain the descision tree using information gain and subsequently testing the accuracy. The steps for testing the accuracy are similar to the ones used to test accuracy for the logistic regression (explained above).

#DECISION TREE USING INFORMATION GAIN
#Train the decision tree classifier with 'information gain' as criteria 
trctrl<-trainControl(method="repeatedcv",number=10,repeats=3)
library(e1071)
set.seed(3333)
training_data$loan_status_fac<-as.factor(training_data$loan_status)
dtree_fit<-train(loan_status_fac~.,data=training_data,method="rpart",parms=list(split="information"),trControl=trctrl,tuneLength=10)
#Check the result of the "train" method
dtree_fit
## CART 
## 
## 17544 samples
##     6 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 15790, 15790, 15789, 15789, 15790, 15790, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa
##   0.0000000  1.0000000  1    
##   0.1111111  1.0000000  1    
##   0.2222222  1.0000000  1    
##   0.3333333  1.0000000  1    
##   0.4444444  1.0000000  1    
##   0.5555556  1.0000000  1    
##   0.6666667  1.0000000  1    
##   0.7777778  1.0000000  1    
##   0.8888889  1.0000000  1    
##   1.0000000  0.8890789  0    
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.8888889.
#Plot the decision tree
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner,
##     node_surv, node_terminal, varimp
library(caret)
form <- as.formula(loan_status_fac ~ .)
data <- training_data[,-c(1,2)]
tree.1 <- rpart(form,data=data,control=rpart.control(minsplit=50,cp=0))
prp(tree.1)

#Predict the target variable for the whole 'test' dataset
prediction3<-predict(dtree_fit,newdata = testing_data)
#Both the variables are of different data types
typeof(prediction3)
## [1] "integer"
typeof(testing_data$loan_status)
## [1] "integer"
#Convert to factor
prediction3_fac<-as.factor(prediction3)
testing_data$loan_status_fac<-as.factor(testing_data$loan_status)
#Checking the structure
str(prediction3_fac)
##  Factor w/ 2 levels "0","1": 1 1 1 2 1 2 1 1 1 2 ...
str(testing_data$loan_status_fac)
##  Factor w/ 2 levels "0","1": 1 1 1 2 1 2 1 1 1 2 ...
#Check accuracy
confusionMatrix(prediction3,testing_data$loan_status_fac)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7784    0
##          1    0  988
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9996, 1)
##     No Information Rate : 0.8874     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.8874     
##          Detection Rate : 0.8874     
##    Detection Prevalence : 0.8874     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##