Business Problem
Banks run into losses when a customer doesn’t pay their loans on time. Because of this, every year, banks have losses in crores, and this also impacts the country’s economic growth to a large extent. In this hackathon, we look at various attributes such as funded amount, location, loan, balance, etc., to predict if a person will be a loan defaulter or not.
To solve this problem, MachineHack has created a training dataset of 67,463 rows and 35 columns and a testing dataset of 28,913 rows and 34 columns. The hackathon demands a few pre-requisite skills like big dataset, underfitting vs overfitting, and the ability to optimise “log_loss” to generalise well on unseen data.
Data is provided by Machinehack X Deloitte : https://www.kaggle.com/datasets/ankitkalauni/bank-loan-defaulter-prediction-hackathon
Data Dictionary
Variable | Definition
- ID: unique ID of representative -Loan Amount: loan amount applied
- Funded Amount; loan amount funded
- Funded Amount Investor: loan amount approved by the investors
- Term; term of loan (in months)
- Batch Enrolled: batch numbers to representatives
- Interest Rate: interest rate (%) on loan
- Grade: grade by the bank
- Sub Grade: sub-grade by the bank
- Employment Duration: duration
- Home Ownership: Owner ship of home
- Verification Status: Income verification by the bank
- Payment Plan: if any payment plan has started against loan
- Loan Title: loan title provided
- Debit to Income: ratio of representative’s total monthly debt repayment divided by self reported monthly income excluding mortgage
- Delinquency - two years: number of 30+ days delinquency in past 2 years
- Inquires - six months: total number of inquiries in last 6 months
- Open Account: number of open credit line in representative’s credit line
- Public Record: number of derogatory public records
- Revolving Balance: total credit revolving balance
- Revolving Utilities: amount of credit a representative is using relative to revolving_balance
- Total Accounts: total number of credit lines available in representatives credit line
- Initial List Status: unique listing status of the loan - W(Waiting), F(Forwarded)
- Total Received Interest: total interest received till date
- Total Received Late Fee: total late fee received till date
- Recoveries: post charge off gross recovery
- Collection Recovery Fee: post charge off collection fee
- Collection 12 months Medical: total collections in last 12 months excluding medical collections
- Application Type: indicates when the representative is an individual or joint
- Last week Pay: indicates how long (in weeks) a representative has paid EMI after batch enrolled
- Accounts Delinquent: number of accounts on which the representative is delinquent
- Total Collection Amount: total collection amount ever owed
- Total Current Balance: total current balance from all accounts
- Total Revolving Credit Limit: total revolving credit limit
- Loan Status: 1 = Defaulter, 0 = Non Defaulters
Data Processing
Load Libraries
library(caret)
library(pROC)
library(ggplot2)
library(corrplot)
library(tibble)
library(glue)
library(plotly)
library(keras)
library(neuralnet)
library(tidyverse)
library(nnet)
library(randomForest)
library(DMwR)
library(party)
library(e1071)
library(psych)
library(knitr)
library(reshape2)
library(expss)
Read Data
# read data
train <- read.csv("data/train.csv", stringsAsFactors = T)
Reduce the amount of data to fasten the data processing
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
trainIndex <- createDataPartition(train$Loan.Status, p = 0.05,
list = FALSE,
times = 1)
train <- train[trainIndex, ,drop=FALSE]
The observation data consists of the following variables:
id: Id of each cement mixture,cement: The amount of cement (Kg) in a m3 mixture,slag: The amount of blast furnace slag (Kg) in a m3 mixture,flyash: The amount of fly ash (Kg) in a m3 mixture,water: The amount of water (Kg) in a m3 mixture,super_plast: The amount of Superplasticizer (Kg) in a m3 mixture,coarse_agg: The amount of Coarse Aggreagate (Kg) in a m3 mixture,fine_agg: The amount of Fine Aggreagate (Kg) in a m3 mixture,age: the number of resting days before the compressive strength measurement,strength: Concrete compressive strength measurement in MPa unit.
In order to ensure that the data is “fully prepared,” we demonstrate how to use various data transformations, scaling, handling outliers, or any other statistical strategy. It is best practice to preprocess our data before performing analysis. Data must first be cleaned and transformed before being used for analysis and modeling.
Pre-processing
# data structure
glimpse(train)
## Rows: 3,374
## Columns: 35
## $ ID <int> 1786725, 3910168, 63629135, 5777207, 1428…
## $ Loan.Amount <int> 30665, 11480, 15033, 18874, 20108, 19498,…
## $ Funded.Amount <int> 10765, 6287, 22997, 16334, 21864, 31990, …
## $ Funded.Amount.Investor <dbl> 20266.841, 13873.335, 31357.055, 9289.247…
## $ Term <int> 59, 59, 59, 59, 58, 36, 59, 58, 58, 59, 5…
## $ Batch.Enrolled <fct> BAT2078974, BAT4271519, BAT3873588, BAT38…
## $ Interest.Rate <dbl> 5.755122, 8.721857, 11.184074, 22.019769,…
## $ Grade <fct> B, A, C, C, C, A, A, C, B, A, A, E, B, E,…
## $ Sub.Grade <fct> A2, D2, B1, F4, B3, D4, D3, B2, C2, B4, C…
## $ Employment.Duration <fct> RENT, RENT, RENT, MORTGAGE, RENT, MORTGAG…
## $ Home.Ownership <dbl> 33045.79, 118616.11, 76458.39, 74408.14, …
## $ Verification.Status <fct> Verified, Not Verified, Source Verified, …
## $ Payment.Plan <fct> n, n, n, n, n, n, n, n, n, n, n, n, n, n,…
## $ Loan.Title <fct> Credit card refinancing, Credit card refi…
## $ Debit.to.Income <dbl> 18.663944, 37.987319, 17.236560, 8.678503…
## $ Delinquency...two.years <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 2, 0, 0, 0,…
## $ Inquires...six.months <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ Open.Account <int> 15, 7, 16, 12, 12, 10, 11, 20, 34, 10, 27…
## $ Public.Record <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Revolving.Balance <int> 1426, 3394, 2652, 4249, 4173, 95, 3315, 1…
## $ Revolving.Utilities <dbl> 68.37336, 65.46295, 100.03446, 66.29889, …
## $ Total.Accounts <int> 22, 21, 16, 5, 20, 11, 14, 26, 10, 32, 10…
## $ Initial.List.Status <fct> f, f, w, f, f, f, f, f, w, f, f, f, w, w,…
## $ Total.Received.Interest <dbl> 1238.0518, 2094.7213, 926.5085, 744.5339,…
## $ Total.Received.Late.Fee <dbl> 0.098743645, 0.023595359, 0.045956855, 0.…
## $ Recoveries <dbl> 2.0186990, 8.9910775, 5.1422376, 3.901512…
## $ Collection.Recovery.Fee <dbl> 0.4745522, 1.1390663, 1.2014657, 0.680021…
## $ Collection.12.months.Medical <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ Application.Type <fct> INDIVIDUAL, INDIVIDUAL, INDIVIDUAL, INDIV…
## $ Last.week.Pay <int> 55, 90, 13, 153, 147, 71, 58, 135, 13, 13…
## $ Accounts.Delinquent <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Total.Collection.Amount <int> 56, 45, 613, 50, 36, 25, 20, 9, 18, 47, 3…
## $ Total.Current.Balance <int> 78057, 52369, 4451, 198127, 90387, 37732,…
## $ Total.Revolving.Credit.Limit <int> 9671, 68542, 18598, 19431, 4262, 11317, 9…
## $ Loan.Status <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Make a new categories from loan title and customize the data typw
train <- train %>% mutate(Batch.Enrolled = as.factor(Batch.Enrolled),
Loan.Title = case_when(
Loan.Title %in% c('Debt Consolidation 2013','Debt Reduction','Debt','conso', 'debt loan','get out of debt','Consolidated','Debt Payoff', 'Loan Consolidation','relief','Lending Club','Debt Loan', 'CONSOLIDATION','Get Debt Free','Freedom','consolidation','Lending loan', 'Debt Free','Debt Consolidation Loan','Debt payoff','debt consolidation loan', 'Consolidation','Debt Consolidation','Debt consolidation','CONSOLIDATE', 'DEBT CONSOLIDATION','Consolidation Loan','consolidation loan','debt consolidation', 'consolidate','Consolidate','debt','Dept consolidation') ~ 'debt_consolidation',
Loan.Title %in% c('Credit Card' ,'Credit Card consolidation' , 'CC Refi' ,'pay off bills' ,'CC-Refinance' ,'Refinance Loan' , 'bills' ,'Credit' ,'Credit Card Paydown' ,'Credit Card Refinance Loan' , 'credit pay off' ,'Credit card refinancing' ,'CC Loan' ,'credit card refinance' , 'Credit Card Loan' ,'Credit Card Consolidation' ,'Card Consolidation' ,'CC consolidation' , 'Bill Payoff' ,'Credit Consolidation' ,'credit card consolidation' ,'Refinance' , 'refi' ,'Credit Card Debt' ,'Credit card payoff' ,'Credit Card Refinance' ,'CC Consolidation' , 'payoff' ,'Credit Cards' ,'CC Refinance' ,'Credit card pay off' ,'cards' , 'credit card' ,'Credit Card Refi' ,'Cards' ,'Pay Off' , 'Credit Loan' ,'Credit Card Payoff' ,'CC' ,'Payoff' , 'Credit card refinance' ,'Bill Consolidation' ,'Credit payoff') ~ 'credit_card',
Loan.Title %in% c('Home improvement', 'Home Improvement', 'Home Improvement Loan', 'home improvement', 'Pool', 'Home buying', 'House', 'Home', 'Bathroom', 'Home loan', 'Getting Ahead', 'Moving and relocation') ~ "home_and_improvement",
Loan.Title %in% c('vacation', 'Vacation') ~ "vacation",
Loan.Title %in% c('MYLOAN' ,'Personal' , 'Personal Loan' ,'loan1' ,'Loan 1' ,'My Loan' , 'Loan' ,'Personal loan' ,'Major purchase', 'personal') ~ "personal",
Loan.Title %in% c('Medical expenses', 'Medical loan', 'Medical') ~ "medical_loan",
Loan.Title %in% c('Car Loan', 'Car financing' ,'car') ~ "car",
Loan.Title == "Wedding Loan" ~ "wedding_loan",
Loan.Title == "Business" ~ "business",
Loan.Title == "Green loan" ~ "renewable_energy",
Loan.Title == 'Other' ~ 'other'),
Loan.Title = as.factor(Loan.Title)) %>%
rename('Inquires' = 'Inquires...six.months' ,
'Deliquency' = 'Delinquency...two.years',
'Home.Ownership' = 'Employment.Duration',
'Employment.Duration' = 'Home.Ownership',
'DTI' = 'Debit.to.Income',
'FAI' = 'Funded.Amount.Investor',
'RU' = 'Revolving.Utilities',
'ILS' = 'Initial.List.Status',
'TRI' = 'Total.Received.Interest',
'TRLF' = 'Total.Received.Late.Fee',
'CRF' = 'Collection.Recovery.Fee',
'C12' = 'Collection.12.months.Medical',
'TCA' = 'Total.Collection.Amount',
'TCB' = 'Total.Current.Balance',
'TRCL' = 'Total.Revolving.Credit.Limit',
'VS' = 'Verification.Status') %>%
select(-c(Accounts.Delinquent, Payment.Plan))
Check any missing data
# check missing value
colSums(is.na(train))
## ID Loan.Amount Funded.Amount FAI
## 0 0 0 0
## Term Batch.Enrolled Interest.Rate Grade
## 0 0 0 0
## Sub.Grade Home.Ownership Employment.Duration VS
## 0 0 0 0
## Loan.Title DTI Deliquency Inquires
## 0 0 0 0
## Open.Account Public.Record Revolving.Balance RU
## 0 0 0 0
## Total.Accounts ILS TRI TRLF
## 0 0 0 0
## Recoveries CRF C12 Application.Type
## 0 0 0 0
## Last.week.Pay TCA TCB TRCL
## 0 0 0 0
## Loan.Status
## 0
Remove duplicated and missing data
# remove duplicate
unique(train)
# remove row containing NA value
train <- train %>% filter(complete.cases(.))
Data Distribution
Numeric variables:
There are 15 numeric variables.
We can find below a density plot of these variables:
Ordinal variables:
There are 11 ordinal variables, one of them is the
target variable.
We can find below a barplot of this variables:
Below we cand find a barplot of the target variable
Loan.Status:
Proportion table of Targeted Variable
succ_tab <- data.frame(prop.table((table(train$Loan.Status)))*100)
colnames(succ_tab) <- c("response","perc(%)")
succ_tab
## response perc(%)
## 1 0 90.367516
## 2 1 9.632484
The data is highly imbalance between class
Correlation between features:
We will check the correlation between all the
numerical variables, so we will use the list
data_numeric_vars.
library(superml)
lbl = LabelEncoder$new()
train_encode <- train %>% mutate(Batch.Enrolled = lbl$fit_transform(Batch.Enrolled),
Grade = lbl$fit_transform(Grade),
Sub.Grade = lbl$fit_transform(Sub.Grade),
Home.Ownership = lbl$fit_transform(Home.Ownership),
VS = lbl$fit_transform(VS),
Loan.Title = lbl$fit_transform(Loan.Title),
ILS = lbl$fit_transform(ILS),
Application.Type = lbl$fit_transform(Application.Type))
Firstly, the correlation will be checked by the graph
corplot:
train_corr <- train_encode %>% select(-ID) %>% select_if(is.numeric) %>% cor() %>% as.data.frame() %>% arrange(-Loan.Status)
train_corr %>% select(-c(1:31)) %>% round(4) %>% filter(Loan.Status != 1)
## Loan.Status
## TCA 0.0716
## Funded.Amount 0.0336
## TRLF 0.0317
## RU 0.0234
## Grade 0.0228
## Loan.Title 0.0207
## Employment.Duration 0.0188
## Recoveries 0.0176
## TCB 0.0176
## C12 0.0175
## Revolving.Balance 0.0128
## Last.week.Pay 0.0114
## FAI 0.0109
## Home.Ownership 0.0100
## Sub.Grade 0.0070
## Interest.Rate 0.0063
## DTI 0.0057
## Deliquency 0.0057
## Batch.Enrolled 0.0041
## TRCL 0.0019
## TRI 0.0010
## Public.Record -0.0001
## Total.Accounts -0.0011
## VS -0.0025
## Term -0.0035
## Open.Account -0.0067
## CRF -0.0073
## ILS -0.0080
## Application.Type -0.0097
## Inquires -0.0262
## Loan.Amount -0.0384
# creating correlation matrix
corr_train <- train_encode %>% select(-ID) %>% select_if(is.numeric) %>% cor() %>% as.data.frame()
corr_train <- round(corr_train,4)
# reduce the size of correlation matrix
melted_corr_train <- reshape2::melt(as.matrix(corr_train))
# melted_corr_train[is.na(melted_corr_train$value),] - Accounts.Delinquent has no correlation with others features
# plotting the correlation heatmap
ggplot(data = melted_corr_train, aes(x=Var1, y=Var2,
fill=value)) +
geom_tile() +
geom_text(aes(Var2, Var1, label = value),
color = "pink", size = 3) +
labs( x = NULL,
y = NULL,
fill = NULL) +
theme(legend.title = element_blank(),
axis.text.x = element_text(hjust = 1, angle = 45),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line(colour = "grey"))
It seems that the variables are not highly correlated,
there are not values close to dark blue or dark red.
To be sure, we will check the maximum and
minimum correlations, but in the maximum the value one
should be removed, because each variable is evaluated against
itself:
| Maximum.correlation |
|---|
| 0.076 |
| Minimum.correlation |
|---|
| -0.0846 |
It seems that the variables are not highly correlated,
there are not values close to dark blue or dark red. The maximum
correlation was 0.076, and the minimum
--0.0846, hence there is not any variable to be
omitted.
Check data distribution of each predictor
train %>%
select_if(is.numeric) %>%
select(-ID) %>%
boxplot(main = 'Distribution of Each Predictor', xlab = 'Predictor', ylab = 'Values')
Our data can be visually examined to identify whether any outliers are present. By requiring our model to accommodate them, outliers impact the dependent variable we’re developing. As their names indicate, outliers lie outside our model’s majority. The resolving capability of our model might be reduced if we include outliers. We can observe from the boxplot that some variables, such age, super plast, and slag, have noticeable outliers.
Model Fitting and Evaluation
Data Splitting
We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.
Check the Data Split
library(performanceEstimation)
# To SMOTE train_set
train_clean <- train_encode %>% select(-c(ID ,Home.Ownership, Loan.Title, Sub.Grade, VS, Batch.Enrolled, ILS, Deliquency, Application.Type)) %>% mutate(Loan.Status = as.factor(Loan.Status))
index <- createDataPartition(train_clean$Loan.Status, p = 0.80,
list = FALSE,
times = 1)
# Split the dataset using the defined partition
data_train <- train_clean[index, ,drop=FALSE]
data_test <- train_clean[-index, ,drop=FALSE]
set.seed(1234)
train_smote <- smote(Loan.Status ~ ., data_train , perc.over = 4, perc.under = 1.5)
prop.table((table(train_smote$Loan.Status)))
##
## 0 1
## 0.5454545 0.4545455
Model Fitting
Model NeuralNet
# # your code here
model_nn1 <- neuralnet(formula = Loan.Status ~ .,
data = train_smote,
hidden = c(5,3),
err.fct = "ce", # rumus perhitungan error yang diinginkan (ce = cross-entropy)
act.fct = "logistic", # fungsi untuk mengubah menjadi probabiliti dengan menggunakan fungsi logistik
linear.output = F) # pada bagian output bukan angka (khusus untuk kasus klasifikasi)
plot(model_nn1, rep = "best")
pred_nn1 <- neuralnet::compute(model_nn1, data_test)
p1 <- pred_nn1$net.result
pred1 <- ifelse(p1>0.5, 1, 0)
head(pred1)
## [,1] [,2]
## 7 1 0
## 15 1 0
## 18 1 0
## 34 1 0
## 35 1 0
## 38 1 0
Regression Model
model_glm <- glm(Loan.Status~., family = binomial, train_smote)
summary(model_glm)
##
## Call:
## glm(formula = Loan.Status ~ ., family = binomial, data = train_smote)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3126 -1.0915 -0.8546 1.2174 1.7526
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.543e-01 6.922e-01 -0.945 0.344524
## Loan.Amount -2.361e-05 4.901e-06 -4.817 1.46e-06 ***
## Funded.Amount 8.003e-06 4.759e-06 1.682 0.092658 .
## FAI 1.317e-06 5.818e-06 0.226 0.820920
## Term 1.404e-02 1.091e-02 1.287 0.198070
## Interest.Rate 4.479e-04 1.077e-02 0.042 0.966813
## Grade 2.823e-02 2.853e-02 0.989 0.322432
## Employment.Duration 1.518e-06 8.541e-07 1.777 0.075511 .
## DTI -1.412e-02 4.714e-03 -2.996 0.002739 **
## Inquires -2.633e-01 1.088e-01 -2.419 0.015542 *
## Open.Account -2.260e-02 6.688e-03 -3.379 0.000728 ***
## Public.Record -2.789e-02 1.276e-01 -0.219 0.826953
## Revolving.Balance 1.765e-05 5.599e-06 3.152 0.001620 **
## RU 5.630e-03 1.834e-03 3.070 0.002142 **
## Total.Accounts -3.288e-03 4.784e-03 -0.687 0.491831
## TRI -4.375e-05 1.850e-05 -2.365 0.018018 *
## TRLF 6.363e-03 7.818e-03 0.814 0.415726
## Recoveries -1.380e-05 1.127e-04 -0.122 0.902516
## CRF -6.966e-03 1.386e-02 -0.503 0.615152
## C12 3.104e-01 2.749e-01 1.129 0.258796
## Last.week.Pay 1.911e-03 9.293e-04 2.057 0.039718 *
## TCA 2.135e-04 6.495e-05 3.287 0.001013 **
## TCB 4.392e-08 2.910e-07 0.151 0.880009
## TRCL -4.416e-06 2.119e-06 -2.084 0.037194 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3941.1 on 2859 degrees of freedom
## Residual deviance: 3841.7 on 2836 degrees of freedom
## AIC: 3889.7
##
## Number of Fisher Scoring iterations: 4
Summary predicted value
pred_model_glm <-predict(model_glm, type="response")
summary(pred_model_glm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1727 0.3970 0.4556 0.4545 0.5134 0.9315
Naive Choice of Cut-off probability
pcut1<- mean(as.numeric(as.character(train_smote$Loan.Status)))
class.glm.train<- (as.numeric(pred_model_glm)>pcut1)*1
summary(class.glm.train)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.5049 1.0000 1.0000
Confusion Matrix Regression Model
# get confusion matrix
table(as.numeric(train_smote$Loan.Status), class.glm.train, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 1 872 688
## 2 544 756
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = train_smote$Loan.Status, pred.p = as.numeric(pred_model_glm), pcut = p.seq[i])
} # end of the loop
optimal.pcut = p.seq[which(cost==min(cost))][1]
pcut<-optimal.pcut
optimal.pcut.asymmetric<-optimal.pcut
optimal.pcut
## [1] 0.29
Plotting the misclassfication rate vs range of probability cutoffs
plot(p.seq, cost)
ROC curve
pred_model_glm_test <- ROCR::prediction(pred_model_glm, train_smote$Loan.Status)
pref_model_glm_test <- ROCR::performance(pred_model_glm_test, "tpr", "fpr")
plot(pref_model_glm_test, colorize=TRUE)
Precision Recall curve - Another way to find out AUC
unlist(slot(ROCR::performance(pred_model_glm_test, "auc"), "y.values"))
## [1] 0.6017668
score1= pred_model_glm[train_smote$Loan.Status==1]
score0= pred_model_glm[train_smote$Loan.Status==0]
roc= PRROC::roc.curve(score1, score0, curve = T)
roc$auc
## [1] 0.6017668
Plotting PR curve
pr= PRROC::pr.curve(score1, score0, curve = T)
plot(pr)
Model NNET
model_nnet1 <- nnet(Loan.Status ~ . , data = train_smote, linout=F, trace = F, size = 20, maxit = 500)
ROC curve
pred_model_nnet1_test <- ROCR::prediction(predict(model_nnet1, newdata=data_test, type="raw"), data_test$Loan.Status)
perf_model_nnet1_test <- ROCR::performance(pred_model_nnet1_test, "tpr", "fpr")
plot(perf_model_nnet1_test, colorize=TRUE, lwd=2, main="model_nnet1:ROC - Neural Network")
abline(a=0,b=1)
table(data_test$Loan.Status,predict(model_nnet1,newdata=data_test, type="class"))
##
## 0 1
## 0 373 236
## 1 39 26
plot(model_nnet1$residuals)
Model Neural Netwotk (caret package)
set.seed(123)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
train_control <- trainControl(method = "repeatedcv", number = 10, repeats=5)
par(mfrow=c(1,1))
model_nnet2 <- train(Loan.Status~., train_smote,
method = "nnet", trace = FALSE,
trControl= train_control)
model_nnet2
## Neural Network
##
## 2860 samples
## 23 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 2574, 2574, 2574, 2574, 2574, 2574, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.5454545 0.0000000000
## 1 1e-04 0.5453846 -0.0001397624
## 1 1e-01 0.5440559 0.0148953040
## 3 0e+00 0.5451049 0.0015127636
## 3 1e-04 0.5448951 0.0068549095
## 3 1e-01 0.5518182 0.0494164668
## 5 0e+00 0.5453147 0.0147395523
## 5 1e-04 0.5493007 0.0330137891
## 5 1e-01 0.5562937 0.0628183736
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
pred_nnet2_train <- predict(model_nnet2, type = "prob")[,2]
pred_nnet2_test <- predict(model_nnet2, newdata = data_test ,type = "prob")[,2]
pred_nnet2_train <- as.numeric(pred_nnet2_train > optimal.pcut)
pred_nnet2_test <- as.numeric(pred_nnet2_test > optimal.pcut)
cm_nnet2_train <- table(train_smote$Loan.Status, pred_nnet2_train)
cm_nnet2_test <- table(data_test$Loan.Status, pred_nnet2_test)
misscl_nnet2_train <- round((cm_nnet2_train[2]+cm_nnet2_train[3])/sum(cm_nnet2_train), 2)
misscl_nnet2_test <- round((cm_nnet2_test[2]+cm_nnet2_test[3])/sum(cm_nnet2_test), 2)
cat("train misclassfication rate:", misscl_nnet2_train, "| test misclassfication rate:", misscl_nnet2_test)
## train misclassfication rate: 0.49 | test misclassfication rate: 0.79
ROC curve
pred_nnet2_test_roc <- ROCR::prediction(pred_nnet2_test, data_test$Loan.Status)
pred_nnet2_test_roc <- ROCR::performance(pred_nnet2_test_roc, "tpr", "fpr")
plot(pred_nnet2_test_roc, colorize=TRUE)
Precision Recall curve - Another way to find out AUC
score1= pred_nnet2_test[data_test$Loan.Status==1]
score0= pred_nnet2_test[data_test$Loan.Status==0]
roc= PRROC::roc.curve(score1, score0, curve = T)
roc$auc
## [1] 0.4945055
Plotting PR curve
pr= PRROC::pr.curve(score1, score0, curve = T)
plot(pr)
Model Imporvement
Random Foresttion
Create random forest model as
model_rf
set.seed(123)
model_rf <- randomForest(x = train_smote %>% select(-Loan.Status),
y = train_smote$Loan.Status,
ntree = 500)
model_rf
##
## Call:
## randomForest(x = train_smote %>% select(-Loan.Status), y = train_smote$Loan.Status, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 6.01%
## Confusion matrix:
## 0 1 class.error
## 0 1518 42 0.02692308
## 1 130 1170 0.10000000
Check the summary and Predictor contribution on Targeted Variable
model_rf$finalModel
## NULL
varImp(model_rf)
## Overall
## Loan.Amount 60.878598
## Funded.Amount 63.463375
## FAI 56.080875
## Term 158.884273
## Interest.Rate 54.385025
## Grade 78.523177
## Employment.Duration 62.664074
## DTI 58.428727
## Inquires 9.727607
## Open.Account 54.640021
## Public.Record 10.531676
## Revolving.Balance 68.690633
## RU 60.263150
## Total.Accounts 56.990086
## TRI 77.185770
## TRLF 63.703020
## Recoveries 64.206673
## CRF 56.089286
## C12 3.304977
## Last.week.Pay 62.236061
## TCA 100.292971
## TCB 67.346601
## TRCL 68.982525
Model Random Forest - Evaluation
set.seed(66)
pred_rf_val <- predict(object = model_rf, newdata = data_test, type="class")
tab_rf <- table(Predicted = pred_rf_val, Actual = data_test$Loan.Status)
tab_rf
## Actual
## Predicted 0 1
## 0 577 61
## 1 32 4
pred_rf_val <- as.factor(pred_rf_val)
paste0("Accuracy : ",round(sum(diag(tab_rf))/sum(tab_rf)*100,2))
## [1] "Accuracy : 86.2"
Support Vector Machine
library(e1071)
model_svm <- svm(Loan.Status ~ ., data = train_smote)
The SVR model has higher performance compared to any model that we made before. However, we will still use both model for further analysis both as comparison and as examples.
set.seed(66)
pred_svm_val <- predict(object = model_svm, newdata = data_test, type="class")
tab_svm <- table(Predicted = pred_svm_val, Actual = data_test$Loan.Status)
tab_svm
## Actual
## Predicted 0 1
## 0 442 41
## 1 167 24
pred_svm_val <- as.factor(pred_svm_val)
paste0("Accuracy : ",round(sum(diag(tab_svm))/sum(tab_svm)*100,2))
## [1] "Accuracy : 69.14"
Conclusion
In this research project, we have examined various concrete formulations with different strengths. We developed a model that aligns to the available information. Utilizing model as a framework, we developed a fresh formulation and, being used to predicted the loan.
Throughout this project, we have employed a
Random forest Model. Compared to a standard regression, the
model better describes the data. As we have discovered, despite being
more complicated, it is a model which could be understood. The
prediction model implementing “model_rf” obtained accuracy for about
86.2%