Loan Default Prediction

Sara Monica

August 26, 2022

Business Problem

Banks run into losses when a customer doesn’t pay their loans on time. Because of this, every year, banks have losses in crores, and this also impacts the country’s economic growth to a large extent. In this hackathon, we look at various attributes such as funded amount, location, loan, balance, etc., to predict if a person will be a loan defaulter or not.

To solve this problem, MachineHack has created a training dataset of 67,463 rows and 35 columns and a testing dataset of 28,913 rows and 34 columns. The hackathon demands a few pre-requisite skills like big dataset, underfitting vs overfitting, and the ability to optimise “log_loss” to generalise well on unseen data.

Data is provided by Machinehack X Deloitte : https://www.kaggle.com/datasets/ankitkalauni/bank-loan-defaulter-prediction-hackathon

Data Dictionary

Variable | Definition

ID: unique ID of representative -Loan Amount: loan amount applied
Funded Amount; loan amount funded
Funded Amount Investor: loan amount approved by the investors
Term; term of loan (in months)
Batch Enrolled: batch numbers to representatives
Interest Rate: interest rate (%) on loan
Grade: grade by the bank
Sub Grade: sub-grade by the bank
Employment Duration: duration
Home Ownership: Owner ship of home
Verification Status: Income verification by the bank
Payment Plan: if any payment plan has started against loan
Loan Title: loan title provided
Debit to Income: ratio of representative’s total monthly debt repayment divided by self reported monthly income excluding mortgage
Delinquency - two years: number of 30+ days delinquency in past 2 years
Inquires - six months: total number of inquiries in last 6 months
Open Account: number of open credit line in representative’s credit line
Public Record: number of derogatory public records
Revolving Balance: total credit revolving balance
Revolving Utilities: amount of credit a representative is using relative to revolving_balance
Total Accounts: total number of credit lines available in representatives credit line
Initial List Status: unique listing status of the loan - W(Waiting), F(Forwarded)
Total Received Interest: total interest received till date
Total Received Late Fee: total late fee received till date
Recoveries: post charge off gross recovery
Collection Recovery Fee: post charge off collection fee
Collection 12 months Medical: total collections in last 12 months excluding medical collections
Application Type: indicates when the representative is an individual or joint
Last week Pay: indicates how long (in weeks) a representative has paid EMI after batch enrolled
Accounts Delinquent: number of accounts on which the representative is delinquent
Total Collection Amount: total collection amount ever owed
Total Current Balance: total current balance from all accounts
Total Revolving Credit Limit: total revolving credit limit
Loan Status: 1 = Defaulter, 0 = Non Defaulters

Data Processing

Load Libraries

library(caret)
library(pROC)
library(ggplot2)
library(corrplot)
library(tibble)
library(glue)
library(plotly)
library(keras)
library(neuralnet)
library(tidyverse)
library(nnet)
library(randomForest)
library(DMwR)
library(party)
library(e1071)
library(psych)
library(knitr)
library(reshape2)
library(expss)

Read Data

#  read data
train <- read.csv("data/train.csv", stringsAsFactors = T)

Reduce the amount of data to fasten the data processing

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(123)
trainIndex <- createDataPartition(train$Loan.Status, p = 0.05, 
                                  list = FALSE, 
                                  times = 1)
train <- train[trainIndex, ,drop=FALSE]

The observation data consists of the following variables:

id: Id of each cement mixture,
cement: The amount of cement (Kg) in a m3 mixture,
slag: The amount of blast furnace slag (Kg) in a m3 mixture,
flyash: The amount of fly ash (Kg) in a m3 mixture,
water: The amount of water (Kg) in a m3 mixture,
super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,
coarse_agg: The amount of Coarse Aggreagate (Kg) in a m3 mixture,
fine_agg: The amount of Fine Aggreagate (Kg) in a m3 mixture,
age: the number of resting days before the compressive strength measurement,
strength: Concrete compressive strength measurement in MPa unit.

In order to ensure that the data is “fully prepared,” we demonstrate how to use various data transformations, scaling, handling outliers, or any other statistical strategy. It is best practice to preprocess our data before performing analysis. Data must first be cleaned and transformed before being used for analysis and modeling.

Pre-processing

# data structure
glimpse(train)

## Rows: 3,374
## Columns: 35
## $ ID                           <int> 1786725, 3910168, 63629135, 5777207, 1428…
## $ Loan.Amount                  <int> 30665, 11480, 15033, 18874, 20108, 19498,…
## $ Funded.Amount                <int> 10765, 6287, 22997, 16334, 21864, 31990, …
## $ Funded.Amount.Investor       <dbl> 20266.841, 13873.335, 31357.055, 9289.247…
## $ Term                         <int> 59, 59, 59, 59, 58, 36, 59, 58, 58, 59, 5…
## $ Batch.Enrolled               <fct> BAT2078974, BAT4271519, BAT3873588, BAT38…
## $ Interest.Rate                <dbl> 5.755122, 8.721857, 11.184074, 22.019769,…
## $ Grade                        <fct> B, A, C, C, C, A, A, C, B, A, A, E, B, E,…
## $ Sub.Grade                    <fct> A2, D2, B1, F4, B3, D4, D3, B2, C2, B4, C…
## $ Employment.Duration          <fct> RENT, RENT, RENT, MORTGAGE, RENT, MORTGAG…
## $ Home.Ownership               <dbl> 33045.79, 118616.11, 76458.39, 74408.14, …
## $ Verification.Status          <fct> Verified, Not Verified, Source Verified, …
## $ Payment.Plan                 <fct> n, n, n, n, n, n, n, n, n, n, n, n, n, n,…
## $ Loan.Title                   <fct> Credit card refinancing, Credit card refi…
## $ Debit.to.Income              <dbl> 18.663944, 37.987319, 17.236560, 8.678503…
## $ Delinquency...two.years      <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 2, 0, 0, 0,…
## $ Inquires...six.months        <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ Open.Account                 <int> 15, 7, 16, 12, 12, 10, 11, 20, 34, 10, 27…
## $ Public.Record                <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Revolving.Balance            <int> 1426, 3394, 2652, 4249, 4173, 95, 3315, 1…
## $ Revolving.Utilities          <dbl> 68.37336, 65.46295, 100.03446, 66.29889, …
## $ Total.Accounts               <int> 22, 21, 16, 5, 20, 11, 14, 26, 10, 32, 10…
## $ Initial.List.Status          <fct> f, f, w, f, f, f, f, f, w, f, f, f, w, w,…
## $ Total.Received.Interest      <dbl> 1238.0518, 2094.7213, 926.5085, 744.5339,…
## $ Total.Received.Late.Fee      <dbl> 0.098743645, 0.023595359, 0.045956855, 0.…
## $ Recoveries                   <dbl> 2.0186990, 8.9910775, 5.1422376, 3.901512…
## $ Collection.Recovery.Fee      <dbl> 0.4745522, 1.1390663, 1.2014657, 0.680021…
## $ Collection.12.months.Medical <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ Application.Type             <fct> INDIVIDUAL, INDIVIDUAL, INDIVIDUAL, INDIV…
## $ Last.week.Pay                <int> 55, 90, 13, 153, 147, 71, 58, 135, 13, 13…
## $ Accounts.Delinquent          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Total.Collection.Amount      <int> 56, 45, 613, 50, 36, 25, 20, 9, 18, 47, 3…
## $ Total.Current.Balance        <int> 78057, 52369, 4451, 198127, 90387, 37732,…
## $ Total.Revolving.Credit.Limit <int> 9671, 68542, 18598, 19431, 4262, 11317, 9…
## $ Loan.Status                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Make a new categories from loan title and customize the data typw

train <- train %>% mutate(Batch.Enrolled  = as.factor(Batch.Enrolled),
                          Loan.Title = case_when(
                          Loan.Title %in% c('Debt Consolidation 2013','Debt Reduction','Debt','conso', 'debt loan','get out of debt','Consolidated','Debt Payoff', 'Loan Consolidation','relief','Lending Club','Debt Loan', 'CONSOLIDATION','Get Debt Free','Freedom','consolidation','Lending loan', 'Debt Free','Debt Consolidation Loan','Debt payoff','debt consolidation loan', 'Consolidation','Debt Consolidation','Debt consolidation','CONSOLIDATE', 'DEBT CONSOLIDATION','Consolidation Loan','consolidation loan','debt consolidation', 'consolidate','Consolidate','debt','Dept consolidation') ~ 'debt_consolidation',  
                          Loan.Title %in% c('Credit Card' ,'Credit Card consolidation' , 'CC Refi' ,'pay off bills' ,'CC-Refinance' ,'Refinance Loan' , 'bills' ,'Credit' ,'Credit Card Paydown' ,'Credit Card Refinance Loan' , 'credit pay off' ,'Credit card refinancing' ,'CC Loan' ,'credit card refinance' , 'Credit Card Loan' ,'Credit Card Consolidation' ,'Card Consolidation' ,'CC consolidation' , 'Bill Payoff' ,'Credit Consolidation' ,'credit card consolidation' ,'Refinance' , 'refi' ,'Credit Card Debt' ,'Credit card payoff' ,'Credit Card Refinance' ,'CC Consolidation' , 'payoff' ,'Credit Cards' ,'CC Refinance' ,'Credit card pay off' ,'cards' , 'credit card' ,'Credit Card Refi' ,'Cards' ,'Pay Off' , 'Credit Loan' ,'Credit Card Payoff' ,'CC' ,'Payoff' , 'Credit card refinance' ,'Bill Consolidation' ,'Credit payoff') ~ 'credit_card',
                          Loan.Title %in% c('Home improvement', 'Home Improvement', 'Home Improvement Loan', 'home improvement', 'Pool', 'Home buying', 'House', 'Home', 'Bathroom', 'Home loan', 'Getting Ahead', 'Moving and relocation') ~ "home_and_improvement",
                            Loan.Title %in% c('vacation', 'Vacation') ~ "vacation",
                            Loan.Title %in% c('MYLOAN' ,'Personal' , 'Personal Loan' ,'loan1' ,'Loan 1' ,'My Loan' , 'Loan' ,'Personal loan' ,'Major purchase', 'personal') ~ "personal",
                            Loan.Title %in% c('Medical expenses', 'Medical loan', 'Medical') ~ "medical_loan",
                            Loan.Title %in% c('Car Loan', 'Car financing' ,'car') ~ "car",
                            Loan.Title == "Wedding Loan" ~ "wedding_loan",
                            Loan.Title == "Business" ~ "business",
                            Loan.Title == "Green loan" ~ "renewable_energy",
                            Loan.Title == 'Other' ~ 'other'),
                            Loan.Title = as.factor(Loan.Title)) %>%
                          rename('Inquires' = 'Inquires...six.months' ,
                          'Deliquency' = 'Delinquency...two.years',
                          'Home.Ownership' = 'Employment.Duration',
                          'Employment.Duration' =  'Home.Ownership', 
                          'DTI' = 'Debit.to.Income',
                          'FAI' = 'Funded.Amount.Investor', 
                          'RU' = 'Revolving.Utilities', 
                          'ILS' = 'Initial.List.Status',
                          'TRI' = 'Total.Received.Interest',
                          'TRLF' = 'Total.Received.Late.Fee',
                          'CRF' = 'Collection.Recovery.Fee',
                          'C12' = 'Collection.12.months.Medical',
                          'TCA' = 'Total.Collection.Amount',
                          'TCB' = 'Total.Current.Balance',
                          'TRCL' = 'Total.Revolving.Credit.Limit',
                          'VS' = 'Verification.Status') %>%
                select(-c(Accounts.Delinquent, Payment.Plan))

Check any missing data

#  check missing value
colSums(is.na(train))

##                  ID         Loan.Amount       Funded.Amount                 FAI 
##                   0                   0                   0                   0 
##                Term      Batch.Enrolled       Interest.Rate               Grade 
##                   0                   0                   0                   0 
##           Sub.Grade      Home.Ownership Employment.Duration                  VS 
##                   0                   0                   0                   0 
##          Loan.Title                 DTI          Deliquency            Inquires 
##                   0                   0                   0                   0 
##        Open.Account       Public.Record   Revolving.Balance                  RU 
##                   0                   0                   0                   0 
##      Total.Accounts                 ILS                 TRI                TRLF 
##                   0                   0                   0                   0 
##          Recoveries                 CRF                 C12    Application.Type 
##                   0                   0                   0                   0 
##       Last.week.Pay                 TCA                 TCB                TRCL 
##                   0                   0                   0                   0 
##         Loan.Status 
##                   0

Remove duplicated and missing data

# remove duplicate 
unique(train)
# remove row containing NA value 
train <- train %>% filter(complete.cases(.))

Data Distribution

`Numeric` variables:

There are 15 numeric variables.

We can find below a density plot of these variables:

`Ordinal` variables:

There are 11 ordinal variables, one of them is the target variable.

We can find below a barplot of this variables:

Below we cand find a barplot of the target variable Loan.Status:

Proportion table of Targeted Variable

succ_tab <- data.frame(prop.table((table(train$Loan.Status)))*100)
colnames(succ_tab) <- c("response","perc(%)")
succ_tab

##   response   perc(%)
## 1        0 90.367516
## 2        1  9.632484

The data is highly imbalance between class

`Correlation` between features:

We will check the correlation between all the numerical variables, so we will use the list data_numeric_vars.

library(superml)
lbl = LabelEncoder$new()
train_encode <- train %>% mutate(Batch.Enrolled = lbl$fit_transform(Batch.Enrolled),
                 Grade = lbl$fit_transform(Grade),
                 Sub.Grade = lbl$fit_transform(Sub.Grade),
                 Home.Ownership = lbl$fit_transform(Home.Ownership),
                 VS = lbl$fit_transform(VS),
                 Loan.Title = lbl$fit_transform(Loan.Title),
                 ILS = lbl$fit_transform(ILS),
                 Application.Type = lbl$fit_transform(Application.Type))

Firstly, the correlation will be checked by the graph corplot:

train_corr <- train_encode %>% select(-ID) %>% select_if(is.numeric) %>% cor() %>% as.data.frame() %>% arrange(-Loan.Status)
train_corr %>% select(-c(1:31)) %>% round(4) %>% filter(Loan.Status != 1)

##                     Loan.Status
## TCA                      0.0716
## Funded.Amount            0.0336
## TRLF                     0.0317
## RU                       0.0234
## Grade                    0.0228
## Loan.Title               0.0207
## Employment.Duration      0.0188
## Recoveries               0.0176
## TCB                      0.0176
## C12                      0.0175
## Revolving.Balance        0.0128
## Last.week.Pay            0.0114
## FAI                      0.0109
## Home.Ownership           0.0100
## Sub.Grade                0.0070
## Interest.Rate            0.0063
## DTI                      0.0057
## Deliquency               0.0057
## Batch.Enrolled           0.0041
## TRCL                     0.0019
## TRI                      0.0010
## Public.Record           -0.0001
## Total.Accounts          -0.0011
## VS                      -0.0025
## Term                    -0.0035
## Open.Account            -0.0067
## CRF                     -0.0073
## ILS                     -0.0080
## Application.Type        -0.0097
## Inquires                -0.0262
## Loan.Amount             -0.0384

# creating correlation matrix
corr_train <- train_encode %>% select(-ID) %>% select_if(is.numeric) %>% cor() %>% as.data.frame()
corr_train <-  round(corr_train,4)

# reduce the size of correlation matrix
melted_corr_train <- reshape2::melt(as.matrix(corr_train))
# melted_corr_train[is.na(melted_corr_train$value),] - Accounts.Delinquent has no correlation with others features

# plotting the correlation heatmap
ggplot(data = melted_corr_train, aes(x=Var1, y=Var2,
                                   fill=value)) +
geom_tile() +
geom_text(aes(Var2, Var1, label = value),
          color = "pink", size = 3) +
labs( x = NULL,
      y = NULL,
      fill = NULL) +
  theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))

It seems that the variables are not highly correlated, there are not values close to dark blue or dark red.

To be sure, we will check the maximum and minimum correlations, but in the maximum the value one should be removed, because each variable is evaluated against itself:

Maximum.correlation
0.076

Minimum.correlation
-0.0846

It seems that the variables are not highly correlated, there are not values close to dark blue or dark red. The maximum correlation was 0.076, and the minimum --0.0846, hence there is not any variable to be omitted.

Check data distribution of each predictor

train %>% 
   select_if(is.numeric) %>% 
   select(-ID) %>%
   boxplot(main = 'Distribution of Each Predictor', xlab = 'Predictor', ylab = 'Values')

Our data can be visually examined to identify whether any outliers are present. By requiring our model to accommodate them, outliers impact the dependent variable we’re developing. As their names indicate, outliers lie outside our model’s majority. The resolving capability of our model might be reduced if we include outliers. We can observe from the boxplot that some variables, such age, super plast, and slag, have noticeable outliers.

Model Fitting and Evaluation

Data Splitting

We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.

Check the Data Split

library(performanceEstimation)
# To SMOTE train_set
train_clean <- train_encode %>% select(-c(ID ,Home.Ownership, Loan.Title, Sub.Grade, VS, Batch.Enrolled, ILS, Deliquency, Application.Type)) %>% mutate(Loan.Status = as.factor(Loan.Status))

index <- createDataPartition(train_clean$Loan.Status, p = 0.80, 
                                  list = FALSE, 
                                  times = 1)

# Split the dataset using the defined partition
data_train <- train_clean[index, ,drop=FALSE]
data_test <- train_clean[-index, ,drop=FALSE]

set.seed(1234)
train_smote <- smote(Loan.Status ~ ., data_train , perc.over = 4, perc.under = 1.5)
prop.table((table(train_smote$Loan.Status)))

## 
##         0         1 
## 0.5454545 0.4545455

Model Fitting

Model NeuralNet

# # your code here
model_nn1 <- neuralnet(formula = Loan.Status ~ .,
                                data = train_smote,
                                hidden =  c(5,3),
                                err.fct = "ce", # rumus perhitungan error yang diinginkan (ce = cross-entropy)
                                act.fct = "logistic", # fungsi untuk mengubah menjadi probabiliti dengan menggunakan fungsi logistik
                                linear.output = F) # pada bagian output bukan angka (khusus untuk kasus klasifikasi)
plot(model_nn1, rep = "best")

pred_nn1 <- neuralnet::compute(model_nn1, data_test)
p1 <- pred_nn1$net.result
pred1 <- ifelse(p1>0.5, 1, 0)
head(pred1)

##    [,1] [,2]
## 7     1    0
## 15    1    0
## 18    1    0
## 34    1    0
## 35    1    0
## 38    1    0

Regression Model

model_glm <- glm(Loan.Status~., family = binomial, train_smote)
summary(model_glm)

## 
## Call:
## glm(formula = Loan.Status ~ ., family = binomial, data = train_smote)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3126  -1.0915  -0.8546   1.2174   1.7526  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -6.543e-01  6.922e-01  -0.945 0.344524    
## Loan.Amount         -2.361e-05  4.901e-06  -4.817 1.46e-06 ***
## Funded.Amount        8.003e-06  4.759e-06   1.682 0.092658 .  
## FAI                  1.317e-06  5.818e-06   0.226 0.820920    
## Term                 1.404e-02  1.091e-02   1.287 0.198070    
## Interest.Rate        4.479e-04  1.077e-02   0.042 0.966813    
## Grade                2.823e-02  2.853e-02   0.989 0.322432    
## Employment.Duration  1.518e-06  8.541e-07   1.777 0.075511 .  
## DTI                 -1.412e-02  4.714e-03  -2.996 0.002739 ** 
## Inquires            -2.633e-01  1.088e-01  -2.419 0.015542 *  
## Open.Account        -2.260e-02  6.688e-03  -3.379 0.000728 ***
## Public.Record       -2.789e-02  1.276e-01  -0.219 0.826953    
## Revolving.Balance    1.765e-05  5.599e-06   3.152 0.001620 ** 
## RU                   5.630e-03  1.834e-03   3.070 0.002142 ** 
## Total.Accounts      -3.288e-03  4.784e-03  -0.687 0.491831    
## TRI                 -4.375e-05  1.850e-05  -2.365 0.018018 *  
## TRLF                 6.363e-03  7.818e-03   0.814 0.415726    
## Recoveries          -1.380e-05  1.127e-04  -0.122 0.902516    
## CRF                 -6.966e-03  1.386e-02  -0.503 0.615152    
## C12                  3.104e-01  2.749e-01   1.129 0.258796    
## Last.week.Pay        1.911e-03  9.293e-04   2.057 0.039718 *  
## TCA                  2.135e-04  6.495e-05   3.287 0.001013 ** 
## TCB                  4.392e-08  2.910e-07   0.151 0.880009    
## TRCL                -4.416e-06  2.119e-06  -2.084 0.037194 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3941.1  on 2859  degrees of freedom
## Residual deviance: 3841.7  on 2836  degrees of freedom
## AIC: 3889.7
## 
## Number of Fisher Scoring iterations: 4

Summary predicted value

pred_model_glm <-predict(model_glm, type="response")
summary(pred_model_glm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1727  0.3970  0.4556  0.4545  0.5134  0.9315

Naive Choice of Cut-off probability

pcut1<- mean(as.numeric(as.character(train_smote$Loan.Status)))
class.glm.train<- (as.numeric(pred_model_glm)>pcut1)*1
summary(class.glm.train)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5049  1.0000  1.0000

Confusion Matrix Regression Model

# get confusion matrix
table(as.numeric(train_smote$Loan.Status), class.glm.train, dnn = c("True", "Predicted"))

##     Predicted
## True   0   1
##    1 872 688
##    2 544 756

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
  weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
  weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
  c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
  c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
  cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
  return(cost) # you have to return to a value when you write R functions
} # end of the function

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
  cost[i] = costfunc(obs = train_smote$Loan.Status, pred.p = as.numeric(pred_model_glm), pcut = p.seq[i])  
} # end of the loop

optimal.pcut = p.seq[which(cost==min(cost))][1]
pcut<-optimal.pcut
optimal.pcut.asymmetric<-optimal.pcut

optimal.pcut

## [1] 0.29

Plotting the misclassfication rate vs range of probability cutoffs

plot(p.seq, cost)

ROC curve

pred_model_glm_test <- ROCR::prediction(pred_model_glm, train_smote$Loan.Status)
pref_model_glm_test <- ROCR::performance(pred_model_glm_test, "tpr", "fpr")
plot(pref_model_glm_test, colorize=TRUE)

Precision Recall curve - Another way to find out AUC

unlist(slot(ROCR::performance(pred_model_glm_test, "auc"), "y.values"))

## [1] 0.6017668

score1= pred_model_glm[train_smote$Loan.Status==1]
score0= pred_model_glm[train_smote$Loan.Status==0]
roc= PRROC::roc.curve(score1, score0, curve = T)
roc$auc

## [1] 0.6017668

Plotting PR curve

pr= PRROC::pr.curve(score1, score0, curve = T)
plot(pr)

Model NNET

model_nnet1 <- nnet(Loan.Status ~ . , data = train_smote, linout=F, trace = F, size = 20, maxit = 500)

ROC curve

pred_model_nnet1_test <- ROCR::prediction(predict(model_nnet1, newdata=data_test, type="raw"), data_test$Loan.Status)
perf_model_nnet1_test <- ROCR::performance(pred_model_nnet1_test, "tpr", "fpr")
plot(perf_model_nnet1_test, colorize=TRUE, lwd=2, main="model_nnet1:ROC - Neural Network")
abline(a=0,b=1)

table(data_test$Loan.Status,predict(model_nnet1,newdata=data_test, type="class"))

##    
##       0   1
##   0 373 236
##   1  39  26

plot(model_nnet1$residuals)

Model Neural Netwotk (caret package)

set.seed(123)
RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

train_control <- trainControl(method = "repeatedcv", number = 10, repeats=5)

par(mfrow=c(1,1))
model_nnet2 <- train(Loan.Status~., train_smote,
                 method = "nnet", trace = FALSE,
                 trControl= train_control)

model_nnet2

## Neural Network 
## 
## 2860 samples
##   23 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2574, 2574, 2574, 2574, 2574, 2574, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa        
##   1     0e+00  0.5454545   0.0000000000
##   1     1e-04  0.5453846  -0.0001397624
##   1     1e-01  0.5440559   0.0148953040
##   3     0e+00  0.5451049   0.0015127636
##   3     1e-04  0.5448951   0.0068549095
##   3     1e-01  0.5518182   0.0494164668
##   5     0e+00  0.5453147   0.0147395523
##   5     1e-04  0.5493007   0.0330137891
##   5     1e-01  0.5562937   0.0628183736
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

pred_nnet2_train <- predict(model_nnet2, type = "prob")[,2]
pred_nnet2_test <- predict(model_nnet2, newdata = data_test ,type = "prob")[,2]

pred_nnet2_train <- as.numeric(pred_nnet2_train > optimal.pcut)
pred_nnet2_test <- as.numeric(pred_nnet2_test > optimal.pcut)

cm_nnet2_train <- table(train_smote$Loan.Status, pred_nnet2_train)
cm_nnet2_test <- table(data_test$Loan.Status, pred_nnet2_test)

misscl_nnet2_train <- round((cm_nnet2_train[2]+cm_nnet2_train[3])/sum(cm_nnet2_train), 2)
misscl_nnet2_test <- round((cm_nnet2_test[2]+cm_nnet2_test[3])/sum(cm_nnet2_test), 2)

cat("train misclassfication rate:", misscl_nnet2_train, "| test misclassfication rate:", misscl_nnet2_test)

## train misclassfication rate: 0.49 | test misclassfication rate: 0.79

ROC curve

pred_nnet2_test_roc <- ROCR::prediction(pred_nnet2_test, data_test$Loan.Status)
pred_nnet2_test_roc <- ROCR::performance(pred_nnet2_test_roc, "tpr", "fpr")
plot(pred_nnet2_test_roc, colorize=TRUE)

Precision Recall curve - Another way to find out AUC

score1= pred_nnet2_test[data_test$Loan.Status==1]
score0= pred_nnet2_test[data_test$Loan.Status==0]
roc= PRROC::roc.curve(score1, score0, curve = T)
roc$auc

## [1] 0.4945055

Plotting PR curve

pr= PRROC::pr.curve(score1, score0, curve = T)
plot(pr)

Model Imporvement

Random Foresttion

Create random forest model as model_rf

set.seed(123)
model_rf <- randomForest(x = train_smote %>% select(-Loan.Status),
                         y = train_smote$Loan.Status, 
                         ntree = 500)

model_rf

## 
## Call:
##  randomForest(x = train_smote %>% select(-Loan.Status), y = train_smote$Loan.Status,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 6.01%
## Confusion matrix:
##      0    1 class.error
## 0 1518   42  0.02692308
## 1  130 1170  0.10000000

Check the summary and Predictor contribution on Targeted Variable

model_rf$finalModel

## NULL

varImp(model_rf)

##                        Overall
## Loan.Amount          60.878598
## Funded.Amount        63.463375
## FAI                  56.080875
## Term                158.884273
## Interest.Rate        54.385025
## Grade                78.523177
## Employment.Duration  62.664074
## DTI                  58.428727
## Inquires              9.727607
## Open.Account         54.640021
## Public.Record        10.531676
## Revolving.Balance    68.690633
## RU                   60.263150
## Total.Accounts       56.990086
## TRI                  77.185770
## TRLF                 63.703020
## Recoveries           64.206673
## CRF                  56.089286
## C12                   3.304977
## Last.week.Pay        62.236061
## TCA                 100.292971
## TCB                  67.346601
## TRCL                 68.982525

Model Random Forest - Evaluation

set.seed(66)
pred_rf_val <- predict(object = model_rf, newdata = data_test,  type="class")

tab_rf <- table(Predicted = pred_rf_val, Actual = data_test$Loan.Status)
tab_rf

##          Actual
## Predicted   0   1
##         0 577  61
##         1  32   4

pred_rf_val <- as.factor(pred_rf_val)
paste0("Accuracy : ",round(sum(diag(tab_rf))/sum(tab_rf)*100,2))

## [1] "Accuracy : 86.2"

Support Vector Machine

library(e1071)
model_svm <- svm(Loan.Status ~ ., data = train_smote)

The SVR model has higher performance compared to any model that we made before. However, we will still use both model for further analysis both as comparison and as examples.

set.seed(66)
pred_svm_val <- predict(object = model_svm, newdata = data_test,  type="class")

tab_svm <- table(Predicted = pred_svm_val, Actual = data_test$Loan.Status)
tab_svm

##          Actual
## Predicted   0   1
##         0 442  41
##         1 167  24

pred_svm_val <- as.factor(pred_svm_val)
paste0("Accuracy : ",round(sum(diag(tab_svm))/sum(tab_svm)*100,2))

## [1] "Accuracy : 69.14"

Conclusion

In this research project, we have examined various concrete formulations with different strengths. We developed a model that aligns to the available information. Utilizing model as a framework, we developed a fresh formulation and, being used to predicted the loan.

Throughout this project, we have employed a Random forest Model. Compared to a standard regression, the model better describes the data. As we have discovered, despite being more complicated, it is a model which could be understood. The prediction model implementing “model_rf” obtained accuracy for about 86.2%

Loan Default Prediction

Business Problem

Data Dictionary

Data Processing

Load Libraries

Data Distribution

Numeric variables:

Ordinal variables:

Correlation between features:

Model Fitting and Evaluation

Data Splitting

Model Fitting

Model NeuralNet

Regression Model

Model NNET

Model Neural Netwotk (caret package)

Model Imporvement

Random Foresttion

Support Vector Machine

Conclusion

`Numeric` variables:

`Ordinal` variables:

`Correlation` between features: