The objective of this project is to develop a logistic model to predict whether a person applying for credit at a financial institution will be approved or not. The model will use a binary dependent variable, represented as “1” or “0,” indicating whether the loan will be approved or not. Thus, we will create a binary logistic model that produces an output between 0 and 1, representing the probability of the event occurring, which in this case is the granting of credit or a loan. To build this model, we will use various independent variables related to the characteristics of loan applicants.
Loan approval is a critical activity in financial institutions, but it also involves risks. Having the ability to identify in advance which candidates are more likely to be approved or rejected is essential for making informed financial decisions. In this context, the logistic model is an appropriate technique to address this problem, as it can handle binary dependent variables.
Logistic modeling is especially useful when the dependent variable is categorical. The output of the logistic model is a probability, indicating the likelihood of an event occurring. In the case of this project, the probability of credit approval will be calculated based on the independent variables related to applicant characteristics.
When constructing the logistic model, we will consider the relationships between the independent variables and the binary dependent variable through a logistic function. This function will adjust the coefficients of the independent variables to calculate the probabilities of credit approval or rejection.
Therefore, the logistic model will be a valuable tool to assist the financial institution in making informed decisions about credit granting, mitigating risks, and increasing accuracy in applicant analysis.
Database used:
loan
(Please right-click and select “Open in a new tab/window.”)
pacotes <- c("plotly","tidyverse","knitr","kableExtra","fastDummies","rgl","car",
"reshape2","jtools","stargazer","lmtest","caret","pROC","ROCR","nnet",
"magick","cowplot","globals","equatiomatic","PerformanceAnalytics")
options(rgl.debug = TRUE)
if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
instalador <- pacotes[!pacotes %in% installed.packages()]
for(i in 1:length(instalador)) {
install.packages(instalador, dependencies = T)
break()}
sapply(pacotes, require, character = T)
} else {
sapply(pacotes, require, character = T)
}
dados <- read.csv("loan.csv")
summary(dados)
## Loan_ID Gender Married Dependents
## Length:614 Length:614 Length:614 Length:614
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Education Self_Employed ApplicantIncome CoapplicantIncome
## Length:614 Length:614 Min. : 150 Min. : 0
## Class :character Class :character 1st Qu.: 2878 1st Qu.: 0
## Mode :character Mode :character Median : 3812 Median : 1188
## Mean : 5403 Mean : 1621
## 3rd Qu.: 5795 3rd Qu.: 2297
## Max. :81000 Max. :41667
##
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## Min. : 9.0 Min. : 12 Min. :0.0000 Length:614
## 1st Qu.:100.0 1st Qu.:360 1st Qu.:1.0000 Class :character
## Median :128.0 Median :360 Median :1.0000 Mode :character
## Mean :146.4 Mean :342 Mean :0.8422
## 3rd Qu.:168.0 3rd Qu.:360 3rd Qu.:1.0000
## Max. :700.0 Max. :480 Max. :1.0000
## NA's :22 NA's :14 NA's :50
## Loan_Status
## Length:614
## Class :character
## Mode :character
##
##
##
##
head(dados, 10)
## Loan_ID Gender Married Dependents Education Self_Employed
## 1 LP001002 Male No 0 Graduate No
## 2 LP001003 Male Yes 1 Graduate No
## 3 LP001005 Male Yes 0 Graduate Yes
## 4 LP001006 Male Yes 0 Not Graduate No
## 5 LP001008 Male No 0 Graduate No
## 6 LP001011 Male Yes 2 Graduate Yes
## 7 LP001013 Male Yes 0 Not Graduate No
## 8 LP001014 Male Yes 3+ Graduate No
## 9 LP001018 Male Yes 2 Graduate No
## 10 LP001020 Male Yes 1 Graduate No
## ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
## 1 5849 0 NA 360 1
## 2 4583 1508 128 360 1
## 3 3000 0 66 360 1
## 4 2583 2358 120 360 1
## 5 6000 0 141 360 1
## 6 5417 4196 267 360 1
## 7 2333 1516 95 360 1
## 8 3036 2504 158 360 0
## 9 4006 1526 168 360 1
## 10 12841 10968 349 360 1
## Property_Area Loan_Status
## 1 Urban Y
## 2 Rural N
## 3 Urban Y
## 4 Urban Y
## 5 Urban Y
## 6 Urban Y
## 7 Urban Y
## 8 Semiurban N
## 9 Urban Y
## 10 Semiurban N
The variable Loan_ID is useless for our model as it uniquely identifies each observation solely for the purpose of data ordering. It will be impossible to extract any information or behavior from it that influences the dependent variable. Therefore, we will remove Loan_ID from the database:
dados_ajustados <- subset(dados, select = -Loan_ID)
In addition, the database contains observations with missing or null values. In this case, we could try to fill the missing/null values using some imputation strategy, or we can simply delete the observations with missing/null variables. In this case, we will choose to delete observations with missing/null variables since variables like “Gender” cannot be estimated except by pure arbitrariness.
dados_ajustados <- na.omit(dados_ajustados)
dados_ajustados <- dados_ajustados %>%
filter_all(all_vars(. != ""))
head(dados_ajustados, 10)
## Gender Married Dependents Education Self_Employed ApplicantIncome
## 1 Male Yes 1 Graduate No 4583
## 2 Male Yes 0 Graduate Yes 3000
## 3 Male Yes 0 Not Graduate No 2583
## 4 Male No 0 Graduate No 6000
## 5 Male Yes 2 Graduate Yes 5417
## 6 Male Yes 0 Not Graduate No 2333
## 7 Male Yes 3+ Graduate No 3036
## 8 Male Yes 2 Graduate No 4006
## 9 Male Yes 1 Graduate No 12841
## 10 Male Yes 2 Graduate No 3200
## CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
## 1 1508 128 360 1 Rural
## 2 0 66 360 1 Urban
## 3 2358 120 360 1 Urban
## 4 0 141 360 1 Urban
## 5 4196 267 360 1 Urban
## 6 1516 95 360 1 Urban
## 7 2504 158 360 0 Semiurban
## 8 1526 168 360 1 Urban
## 9 10968 349 360 1 Semiurban
## 10 700 70 360 1 Urban
## Loan_Status
## 1 N
## 2 Y
## 3 Y
## 4 Y
## 5 Y
## 6 Y
## 7 N
## 8 Y
## 9 N
## 10 Y
We used commands to eliminate both “NA” values and empty values. It is possible to notice that the observation with the empty value we had in the first position has been deleted.
Some of the variables are categorized in a way that doesn’t make much sense. We will change “Dependents” to numeric and “Credit_History” to character. We can see that “Credit_History” is a categorical variable since it refers to the positive or negative credit history of the applicant. Remember that encoding categorical variables as “character” is only plausible when the order of categories is not relevant. Let’s convert the “Credit_History” variable from numeric to character:
dados_ajustados$Credit_History <- as.character(dados_ajustados$Credit_History)
Now let’s convert the “Dependents” variable from character to numeric:
dados_ajustados$Dependents <- as.numeric(dados_ajustados$Dependents)
## Warning: NAs introduced by coercion
During the conversion of the “Dependents” variable, R replaced the “3+” values in the variable with “NA.” This happened because R doesn’t interpret the “+” as a numeric value. Let’s replace the “NA” with “3,” keeping in mind that “3” now represents 3 or more dependents of the applicant.
dados_ajustados$Dependents <- ifelse(is.na(dados_ajustados$Dependents), "3", dados_ajustados$Dependents)
Let’s check our changes using the summary function:
summary(dados_ajustados)
## Gender Married Dependents Education
## Length:480 Length:480 Length:480 Length:480
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## Length:480 Min. : 150 Min. : 0 Min. : 9.0
## Class :character 1st Qu.: 2899 1st Qu.: 0 1st Qu.:100.0
## Mode :character Median : 3859 Median : 1084 Median :128.0
## Mean : 5364 Mean : 1581 Mean :144.7
## 3rd Qu.: 5852 3rd Qu.: 2253 3rd Qu.:170.0
## Max. :81000 Max. :33837 Max. :600.0
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 36.0 Length:480 Length:480 Length:480
## 1st Qu.:360.0 Class :character Class :character Class :character
## Median :360.0 Mode :character Mode :character Mode :character
## Mean :342.1
## 3rd Qu.:360.0
## Max. :480.0
With our database properly adjusted, all that’s left is to dummy code our categorical variables. Before that, let’s visualize some characteristics of our variables.
table(dados_ajustados$Loan_Status)
##
## N Y
## 148 332
We can see that we have 332 cases of approved loans and 148 cases where the loan was denied.
table(dados_ajustados$Gender)
##
## Female Male
## 86 394
table(dados_ajustados$Married)
##
## No Yes
## 169 311
table(dados_ajustados$Education)
##
## Graduate Not Graduate
## 383 97
table(dados_ajustados$Self_Employed)
##
## No Yes
## 414 66
table(dados_ajustados$Credit_History)
##
## 0 1
## 70 410
table(dados_ajustados$Property_Area)
##
## Rural Semiurban Urban
## 139 191 150
chart.Correlation((dados_ajustados[, c(3,6:9)]), histogram = TRUE)
Notice that the highest correlation can be observed between the applicant’s income (ApplicantIncome) and the loan amount requested (LoanAmount). All other correlations are extremely low.
Let’s now start the process of dummy coding our categorical variables and then proceed to create our initial model.
dados_dummies <- dummy_columns(.data = dados_ajustados,
select_columns = "Gender",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
dados_dummies <- dummy_columns(.data = dados_dummies,
select_columns = "Married",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
dados_dummies <- dummy_columns(.data = dados_dummies,
select_columns = "Education",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
dados_dummies <- dummy_columns(.data = dados_dummies,
select_columns = "Self_Employed",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
dados_dummies <- dummy_columns(.data = dados_dummies,
select_columns = "Credit_History",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
dados_dummies <- dummy_columns(.data = dados_dummies,
select_columns = "Property_Area",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
dados_dummies <- dummy_columns(.data = dados_dummies,
select_columns = "Loan_Status",
remove_selected_columns = T,
remove_first_dummy = T)
Notice that our output variable “Loan_Status” was also dummy-coded because R doesn’t interpret the values “Y” and “N.” Unlike the other dummy variables, in the case of our dependent variable, we choose to remove the first dummy instead of the most frequent one. This way, when we finish the model, we will have an output variable that indicates the probability of “loan approval,” making the model more understandable and straightforward within its context.
This was only possible through the remove_first_dummy = T command because, after cleaning the dataset, our first observation contained the value “N” in “Loan_Status.” Otherwise, if we had used remove_most_frequent_dummy = T like in the other variables, we would have an output variable indicating the probability of “loan denial” since most observations have the value “Y” in the “Loan_Status” variable.
Now that our database is “clean” and our categorical variables are properly “dummy-coded,” we can begin creating our model.
Let’s create a maximum likelihood logistic model that can predict whether a person’s credit request is approved or not through an analysis of their personal characteristics and financial history. Please note that our output variable has been renamed to “Loan_Status_Y” due to the dummy-coding process.
modelo_Loan <- glm(formula = Loan_Status_Y ~ .,
data = dados_dummies,
family = "binomial")
summary(modelo_Loan)
##
## Call:
## glm(formula = Loan_Status_Y ~ ., family = "binomial", data = dados_dummies)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2979 -0.4076 0.5114 0.7044 2.3670
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.939e+00 7.937e-01 3.702 0.000214 ***
## Dependents 8.251e-02 1.341e-01 0.615 0.538452
## ApplicantIncome 5.446e-06 2.946e-05 0.185 0.853358
## CoapplicantIncome -4.773e-05 4.279e-05 -1.115 0.264670
## LoanAmount -2.868e-03 1.792e-03 -1.600 0.109530
## Loan_Amount_Term -6.240e-04 2.018e-03 -0.309 0.757093
## Gender_Female -3.713e-01 3.286e-01 -1.130 0.258475
## Married_No -5.255e-01 2.859e-01 -1.838 0.066095 .
## `Education_Not Graduate` -4.238e-01 3.026e-01 -1.401 0.161353
## Self_Employed_Yes -1.516e-01 3.504e-01 -0.433 0.665142
## Credit_History_0 -3.632e+00 4.308e-01 -8.432 < 2e-16 ***
## Property_Area_Rural -9.489e-01 3.011e-01 -3.151 0.001624 **
## Property_Area_Urban -8.502e-01 3.049e-01 -2.788 0.005300 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 593.05 on 479 degrees of freedom
## Residual deviance: 437.97 on 467 degrees of freedom
## AIC: 463.97
##
## Number of Fisher Scoring iterations: 5
confint(modelo_Loan, level = 0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.462171e+00 4.593515e+00
## Dependents -1.771818e-01 3.504340e-01
## ApplicantIncome -5.493625e-05 5.728753e-05
## CoapplicantIncome -1.376342e-04 3.897337e-05
## LoanAmount -6.359118e-03 6.826728e-04
## Loan_Amount_Term -4.825091e-03 3.145783e-03
## Gender_Female -1.009540e+00 2.823850e-01
## Married_No -1.088665e+00 3.496724e-02
## `Education_Not Graduate` -1.010916e+00 1.792916e-01
## Self_Employed_Yes -8.217189e-01 5.578603e-01
## Credit_History_0 -4.562582e+00 -2.852234e+00
## Property_Area_Rural -1.548626e+00 -3.645477e-01
## Property_Area_Urban -1.457541e+00 -2.579827e-01
logLik(modelo_Loan)
## 'log Lik.' -218.9866 (df=13)
We have a model whose log-likelihood value seems to indicate a well-fitted model. However, for a 95% confidence level, many of the independent variables appear to be non-significant. We can say this because the “summary” function shows many p-values above 0.05. Additionally, the “confint” function, which returns 95% confidence intervals for the coefficients of the adjusted logistic model, returned many intervals that include the value 0, which may indicate that the variables with these intervals do not have a significant effect on the response variable. Let’s try a model without these independent variables, and then we can compare the models.
Vamos exucutar o algorítmo de stepwise no modelo para nos livrarmos dessas variáveis que não são muito significativas:
step_Loan <- step(modelo_Loan, k = 3.841459)
the value “k” is used as a critical threshold for the chi-squared statistic, and in this case, we are seeking variable selection with a 95% confidence level ## Parameters of the “step_Loan” model
summary(step_Loan)
##
## Call:
## glm(formula = Loan_Status_Y ~ Married_No + Credit_History_0 +
## Property_Area_Rural + Property_Area_Urban, family = "binomial",
## data = dados_dummies)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1173 -0.4034 0.4741 0.6710 2.4897
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1290 0.2468 8.628 < 2e-16 ***
## Married_No -0.5855 0.2423 -2.416 0.01568 *
## Credit_History_0 -3.6289 0.4259 -8.520 < 2e-16 ***
## Property_Area_Rural -0.9679 0.2961 -3.268 0.00108 **
## Property_Area_Urban -0.7526 0.2973 -2.531 0.01136 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 593.05 on 479 degrees of freedom
## Residual deviance: 445.58 on 475 degrees of freedom
## AIC: 455.58
##
## Number of Fisher Scoring iterations: 4
confint(step_Loan, level = 0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.666519 2.6368059
## Married_No -1.061575 -0.1095959
## Credit_History_0 -4.550147 -2.8579468
## Property_Area_Rural -1.557995 -0.3932934
## Property_Area_Urban -1.343851 -0.1742342
We can see that all the less significant variables have been eliminated. The Stepwise algorithm has significantly reduced the number of independent variables.
logLik(step_Loan)
## 'log Lik.' -222.789 (df=5)
logLik(modelo_Loan)
## 'log Lik.' -218.9866 (df=13)
AIC(step_Loan)
## [1] 455.578
AIC(modelo_Loan)
## [1] 463.9732
BIC(step_Loan)
## [1] 476.4469
BIC(modelo_Loan)
## [1] 518.2324
With lower LL, AIC, and BIC values, all our indicators suggest that the “step_Loan” model has a better fit to the data. However, this is not enough to determine which model has better predictive capability. Let’s test the accuracy of both models.
confusionMatrix(table(predict(modelo_Loan, type = "response") >= 0.5,
dados_dummies$Loan_Status_Y == 1)[2:1, 2:1])
## Confusion Matrix and Statistics
##
##
## TRUE FALSE
## TRUE 325 82
## FALSE 7 66
##
## Accuracy : 0.8146
## 95% CI : (0.7769, 0.8484)
## No Information Rate : 0.6917
## P-Value [Acc > NIR] : 7.161e-10
##
## Kappa : 0.4943
##
## Mcnemar's Test P-Value : 4.365e-15
##
## Sensitivity : 0.9789
## Specificity : 0.4459
## Pos Pred Value : 0.7985
## Neg Pred Value : 0.9041
## Prevalence : 0.6917
## Detection Rate : 0.6771
## Detection Prevalence : 0.8479
## Balanced Accuracy : 0.7124
##
## 'Positive' Class : TRUE
##
confusionMatrix(table(predict(step_Loan, type = "response") >= 0.5,
dados_dummies$Loan_Status_Y == 1)[2:1, 2:1])
## Confusion Matrix and Statistics
##
##
## TRUE FALSE
## TRUE 325 85
## FALSE 7 63
##
## Accuracy : 0.8083
## 95% CI : (0.7702, 0.8426)
## No Information Rate : 0.6917
## P-Value [Acc > NIR] : 5.209e-09
##
## Kappa : 0.4738
##
## Mcnemar's Test P-Value : 9.923e-16
##
## Sensitivity : 0.9789
## Specificity : 0.4257
## Pos Pred Value : 0.7927
## Neg Pred Value : 0.9000
## Prevalence : 0.6917
## Detection Rate : 0.6771
## Detection Prevalence : 0.8542
## Balanced Accuracy : 0.7023
##
## 'Positive' Class : TRUE
##
In the context of a logistic model, the output is actually the probability of an event occurring. Therefore, when we define a “cutoff” for creating a confusion matrix, we are setting a threshold that classifies predictions as “correct” or “incorrect.” In our case, we can see that in a confusion matrix for a cutoff of 0.6, both models had very similar metrics, but the original model performed slightly better overall. It’s worth noting that the cutoff of 0.5 yielded the best results for both models.
| Sensitividade | Especificidade | Acuracia |
|---|---|---|
| 0.9789157 | 0.4459459 | 0.8145833 |
| Sensitividade | Especificidade | Acuracia |
|---|---|---|
| 0.9789157 | 0.4256757 | 0.8083333 |
In summary, accuracy is a general measure of the model’s precision, sensitivity measures the ability to correctly identify instances of the positive class, and specificity measures the ability to correctly identify instances of the negative class.
Finally, let’s plot the ROC curve for both models. The ROC curve is an important evaluation metric in binary classification tasks. It provides valuable insights into a model’s performance and its ability to distinguish between positive and negative classes.
Let’s save the values necessary for plotting the ROC curve:
predicoes <- prediction(predictions = modelo_Loan$fitted.values,
labels = as.factor(dados_dummies$Loan_Status_Y))
sensitividade <- (performance(predicoes, measure = "sens"))@y.values[[1]]
especificidade <- (performance(predicoes, measure = "spec"))@y.values[[1]]
predicoes_step <- prediction(predictions = step_Loan$fitted.values,
labels = as.factor(dados_dummies$Loan_Status_Y))
sensitividade_step <- (performance(predicoes_step, measure = "sens"))@y.values[[1]]
especificidade_step <- (performance(predicoes_step, measure = "spec"))@y.values[[1]]
We need to use the roc function to specify the output variable and the predicted values:
ROC <- roc(response = dados_dummies$Loan_Status_Y,
predictor = modelo_Loan$fitted.values)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ROC_step <- roc(response = dados_dummies$Loan_Status_Y,
predictor = step_Loan$fitted.values)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
In summary, a larger area under the ROC curve represents a greater capacity for accuracy in a binary logistic model. When analyzing the ROC curves of our models, we can see that our original model has a slightly larger ROC curve.
Overall, we have obtained two reliable models with good predictive capability, as both have very similar accuracy. The choice of which model is “better” would depend on the context.
Our original model has slightly superior metrics but is more complex with 12 independent variables.
On the other hand, the “stepwise” version is simpler with only 4 independent variables but has slightly inferior metrics.
| x |
|---|
| \[ \begin{aligned} \operatorname{Loan\_Status\_Y} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{Loan\_Status\_Y} = \operatorname{1}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= 2.93857 + 0.08251(\operatorname{Dependents}) + 1e-05(\operatorname{ApplicantIncome}) - 5e-05(\operatorname{CoapplicantIncome})\ - \\ &\quad 0.00287(\operatorname{LoanAmount}) - 0.00062(\operatorname{Loan\_Amount\_Term}) - 0.37131(\operatorname{Gender\_Female}) - 0.52546(\operatorname{Married\_No})\ - \\ &\quad 0.42383(\operatorname{`Education\_Not\ Graduate`}) - 0.15164(\operatorname{Self\_Employed\_Yes}) - 3.63222(\operatorname{Credit\_History\_0}) - 0.94886(\operatorname{Property\_Area\_Rural})\ - \\ &\quad 0.85022(\operatorname{Property\_Area\_Urban}) \end{aligned} \] |
| x |
|---|
| \[ \begin{aligned} \operatorname{Loan\_Status\_Y} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{Loan\_Status\_Y} = \operatorname{1}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= 2.12904 - 0.58552(\operatorname{Married\_No}) - 3.62894(\operatorname{Credit\_History\_0}) - 0.9679(\operatorname{Property\_Area\_Rural})\ - \\ &\quad 0.75259(\operatorname{Property\_Area\_Urban}) \end{aligned} \] |
where “p̂” represents the estimated probability of the event occurring in a logistic model, but to obtain the actual probability, the inverse logistic function (or logit function) is applied to the estimate