The objective of this project is to develop a logistic model to predict whether a person applying for credit at a financial institution will be approved or not. The model will use a binary dependent variable, represented as “1” or “0,” indicating whether the loan will be approved or not. Thus, we will create a binary logistic model that produces an output between 0 and 1, representing the probability of the event occurring, which in this case is the granting of credit or a loan. To build this model, we will use various independent variables related to the characteristics of loan applicants.

Loan approval is a critical activity in financial institutions, but it also involves risks. Having the ability to identify in advance which candidates are more likely to be approved or rejected is essential for making informed financial decisions. In this context, the logistic model is an appropriate technique to address this problem, as it can handle binary dependent variables.

Logistic modeling is especially useful when the dependent variable is categorical. The output of the logistic model is a probability, indicating the likelihood of an event occurring. In the case of this project, the probability of credit approval will be calculated based on the independent variables related to applicant characteristics.

When constructing the logistic model, we will consider the relationships between the independent variables and the binary dependent variable through a logistic function. This function will adjust the coefficients of the independent variables to calculate the probabilities of credit approval or rejection.

Therefore, the logistic model will be a valuable tool to assist the financial institution in making informed decisions about credit granting, mitigating risks, and increasing accuracy in applicant analysis.

Database used:
loan (Please right-click and select “Open in a new tab/window.”)

Packages used

pacotes <- c("plotly","tidyverse","knitr","kableExtra","fastDummies","rgl","car",
             "reshape2","jtools","stargazer","lmtest","caret","pROC","ROCR","nnet",
             "magick","cowplot","globals","equatiomatic","PerformanceAnalytics")

options(rgl.debug = TRUE)

if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
  instalador <- pacotes[!pacotes %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(pacotes, require, character = T) 
} else {
  sapply(pacotes, require, character = T) 
}

Loading and partially viewing our database in R

dados <- read.csv("loan.csv")
summary(dados)

##    Loan_ID             Gender            Married           Dependents       
##  Length:614         Length:614         Length:614         Length:614        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Education         Self_Employed      ApplicantIncome CoapplicantIncome
##  Length:614         Length:614         Min.   :  150   Min.   :    0    
##  Class :character   Class :character   1st Qu.: 2878   1st Qu.:    0    
##  Mode  :character   Mode  :character   Median : 3812   Median : 1188    
##                                        Mean   : 5403   Mean   : 1621    
##                                        3rd Qu.: 5795   3rd Qu.: 2297    
##                                        Max.   :81000   Max.   :41667    
##                                                                         
##    LoanAmount    Loan_Amount_Term Credit_History   Property_Area     
##  Min.   :  9.0   Min.   : 12      Min.   :0.0000   Length:614        
##  1st Qu.:100.0   1st Qu.:360      1st Qu.:1.0000   Class :character  
##  Median :128.0   Median :360      Median :1.0000   Mode  :character  
##  Mean   :146.4   Mean   :342      Mean   :0.8422                     
##  3rd Qu.:168.0   3rd Qu.:360      3rd Qu.:1.0000                     
##  Max.   :700.0   Max.   :480      Max.   :1.0000                     
##  NA's   :22      NA's   :14       NA's   :50                         
##  Loan_Status       
##  Length:614        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

head(dados, 10)

##     Loan_ID Gender Married Dependents    Education Self_Employed
## 1  LP001002   Male      No          0     Graduate            No
## 2  LP001003   Male     Yes          1     Graduate            No
## 3  LP001005   Male     Yes          0     Graduate           Yes
## 4  LP001006   Male     Yes          0 Not Graduate            No
## 5  LP001008   Male      No          0     Graduate            No
## 6  LP001011   Male     Yes          2     Graduate           Yes
## 7  LP001013   Male     Yes          0 Not Graduate            No
## 8  LP001014   Male     Yes         3+     Graduate            No
## 9  LP001018   Male     Yes          2     Graduate            No
## 10 LP001020   Male     Yes          1     Graduate            No
##    ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
## 1             5849                 0         NA              360              1
## 2             4583              1508        128              360              1
## 3             3000                 0         66              360              1
## 4             2583              2358        120              360              1
## 5             6000                 0        141              360              1
## 6             5417              4196        267              360              1
## 7             2333              1516         95              360              1
## 8             3036              2504        158              360              0
## 9             4006              1526        168              360              1
## 10           12841             10968        349              360              1
##    Property_Area Loan_Status
## 1          Urban           Y
## 2          Rural           N
## 3          Urban           Y
## 4          Urban           Y
## 5          Urban           Y
## 6          Urban           Y
## 7          Urban           Y
## 8      Semiurban           N
## 9          Urban           Y
## 10     Semiurban           N

The variable Loan_ID is useless for our model as it uniquely identifies each observation solely for the purpose of data ordering. It will be impossible to extract any information or behavior from it that influences the dependent variable. Therefore, we will remove Loan_ID from the database:

dados_ajustados <- subset(dados, select = -Loan_ID)

In addition, the database contains observations with missing or null values. In this case, we could try to fill the missing/null values using some imputation strategy, or we can simply delete the observations with missing/null variables. In this case, we will choose to delete observations with missing/null variables since variables like “Gender” cannot be estimated except by pure arbitrariness.

dados_ajustados <- na.omit(dados_ajustados)
dados_ajustados <- dados_ajustados %>%
  filter_all(all_vars(. != ""))
head(dados_ajustados, 10)

##    Gender Married Dependents    Education Self_Employed ApplicantIncome
## 1    Male     Yes          1     Graduate            No            4583
## 2    Male     Yes          0     Graduate           Yes            3000
## 3    Male     Yes          0 Not Graduate            No            2583
## 4    Male      No          0     Graduate            No            6000
## 5    Male     Yes          2     Graduate           Yes            5417
## 6    Male     Yes          0 Not Graduate            No            2333
## 7    Male     Yes         3+     Graduate            No            3036
## 8    Male     Yes          2     Graduate            No            4006
## 9    Male     Yes          1     Graduate            No           12841
## 10   Male     Yes          2     Graduate            No            3200
##    CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
## 1               1508        128              360              1         Rural
## 2                  0         66              360              1         Urban
## 3               2358        120              360              1         Urban
## 4                  0        141              360              1         Urban
## 5               4196        267              360              1         Urban
## 6               1516         95              360              1         Urban
## 7               2504        158              360              0     Semiurban
## 8               1526        168              360              1         Urban
## 9              10968        349              360              1     Semiurban
## 10               700         70              360              1         Urban
##    Loan_Status
## 1            N
## 2            Y
## 3            Y
## 4            Y
## 5            Y
## 6            Y
## 7            N
## 8            Y
## 9            N
## 10           Y

We used commands to eliminate both “NA” values and empty values. It is possible to notice that the observation with the empty value we had in the first position has been deleted.

Converting the encoding of some of the independent variables

Some of the variables are categorized in a way that doesn’t make much sense. We will change “Dependents” to numeric and “Credit_History” to character. We can see that “Credit_History” is a categorical variable since it refers to the positive or negative credit history of the applicant. Remember that encoding categorical variables as “character” is only plausible when the order of categories is not relevant. Let’s convert the “Credit_History” variable from numeric to character:

dados_ajustados$Credit_History <- as.character(dados_ajustados$Credit_History)

Now let’s convert the “Dependents” variable from character to numeric:

dados_ajustados$Dependents <- as.numeric(dados_ajustados$Dependents)

## Warning: NAs introduced by coercion

During the conversion of the “Dependents” variable, R replaced the “3+” values in the variable with “NA.” This happened because R doesn’t interpret the “+” as a numeric value. Let’s replace the “NA” with “3,” keeping in mind that “3” now represents 3 or more dependents of the applicant.

dados_ajustados$Dependents <- ifelse(is.na(dados_ajustados$Dependents), "3", dados_ajustados$Dependents)

Let’s check our changes using the summary function:

summary(dados_ajustados)

##     Gender            Married           Dependents         Education        
##  Length:480         Length:480         Length:480         Length:480        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Self_Employed      ApplicantIncome CoapplicantIncome   LoanAmount   
##  Length:480         Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Class :character   1st Qu.: 2899   1st Qu.:    0     1st Qu.:100.0  
##  Mode  :character   Median : 3859   Median : 1084     Median :128.0  
##                     Mean   : 5364   Mean   : 1581     Mean   :144.7  
##                     3rd Qu.: 5852   3rd Qu.: 2253     3rd Qu.:170.0  
##                     Max.   :81000   Max.   :33837     Max.   :600.0  
##  Loan_Amount_Term Credit_History     Property_Area      Loan_Status       
##  Min.   : 36.0    Length:480         Length:480         Length:480        
##  1st Qu.:360.0    Class :character   Class :character   Class :character  
##  Median :360.0    Mode  :character   Mode  :character   Mode  :character  
##  Mean   :342.1                                                            
##  3rd Qu.:360.0                                                            
##  Max.   :480.0

With our database properly adjusted, all that’s left is to dummy code our categorical variables. Before that, let’s visualize some characteristics of our variables.

Table of absolute frequencies for the dependent variable

table(dados_ajustados$Loan_Status)

## 
##   N   Y 
## 148 332

We can see that we have 332 cases of approved loans and 148 cases where the loan was denied.

Frequency tables for categorical variables

table(dados_ajustados$Gender)

## 
## Female   Male 
##     86    394

table(dados_ajustados$Married)

## 
##  No Yes 
## 169 311

table(dados_ajustados$Education)

## 
##     Graduate Not Graduate 
##          383           97

table(dados_ajustados$Self_Employed)

## 
##  No Yes 
## 414  66

table(dados_ajustados$Credit_History)

## 
##   0   1 
##  70 410

table(dados_ajustados$Property_Area)

## 
##     Rural Semiurban     Urban 
##       139       191       150

Correlations of the numeric independent variables

chart.Correlation((dados_ajustados[, c(3,6:9)]), histogram = TRUE)

Notice that the highest correlation can be observed between the applicant’s income (ApplicantIncome) and the loan amount requested (LoanAmount). All other correlations are extremely low.

Let’s now start the process of dummy coding our categorical variables and then proceed to create our initial model.

“One-hot encoding” or “Binarization of categorical variables” or “Dummy coding.”

dados_dummies <- dummy_columns(.data = dados_ajustados,
                       select_columns = "Gender",
                       remove_selected_columns = T,
                       remove_most_frequent_dummy = T)

dados_dummies <- dummy_columns(.data = dados_dummies,
                       select_columns = "Married",
                       remove_selected_columns = T,
                       remove_most_frequent_dummy = T)

dados_dummies <- dummy_columns(.data = dados_dummies,
                       select_columns = "Education",
                       remove_selected_columns = T,
                       remove_most_frequent_dummy = T)

dados_dummies <- dummy_columns(.data = dados_dummies,
                       select_columns = "Self_Employed",
                       remove_selected_columns = T,
                       remove_most_frequent_dummy = T)

dados_dummies <- dummy_columns(.data = dados_dummies,
                       select_columns = "Credit_History",
                       remove_selected_columns = T,
                       remove_most_frequent_dummy = T)

dados_dummies <- dummy_columns(.data = dados_dummies,
                       select_columns = "Property_Area",
                       remove_selected_columns = T,
                       remove_most_frequent_dummy = T)

dados_dummies <- dummy_columns(.data = dados_dummies,
                       select_columns = "Loan_Status",
                       remove_selected_columns = T,
                       remove_first_dummy = T)

Notice that our output variable “Loan_Status” was also dummy-coded because R doesn’t interpret the values “Y” and “N.” Unlike the other dummy variables, in the case of our dependent variable, we choose to remove the first dummy instead of the most frequent one. This way, when we finish the model, we will have an output variable that indicates the probability of “loan approval,” making the model more understandable and straightforward within its context.

This was only possible through the remove_first_dummy = T command because, after cleaning the dataset, our first observation contained the value “N” in “Loan_Status.” Otherwise, if we had used remove_most_frequent_dummy = T like in the other variables, we would have an output variable indicating the probability of “loan denial” since most observations have the value “Y” in the “Loan_Status” variable.

Now that our database is “clean” and our categorical variables are properly “dummy-coded,” we can begin creating our model.

Let’s create a maximum likelihood logistic model that can predict whether a person’s credit request is approved or not through an analysis of their personal characteristics and financial history. Please note that our output variable has been renamed to “Loan_Status_Y” due to the dummy-coding process.

Model

modelo_Loan <- glm(formula = Loan_Status_Y ~ ., 
                      data = dados_dummies, 
                      family = "binomial")

Parameters of the “modelo_Loan” model

summary(modelo_Loan)

## 
## Call:
## glm(formula = Loan_Status_Y ~ ., family = "binomial", data = dados_dummies)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2979  -0.4076   0.5114   0.7044   2.3670  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.939e+00  7.937e-01   3.702 0.000214 ***
## Dependents                8.251e-02  1.341e-01   0.615 0.538452    
## ApplicantIncome           5.446e-06  2.946e-05   0.185 0.853358    
## CoapplicantIncome        -4.773e-05  4.279e-05  -1.115 0.264670    
## LoanAmount               -2.868e-03  1.792e-03  -1.600 0.109530    
## Loan_Amount_Term         -6.240e-04  2.018e-03  -0.309 0.757093    
## Gender_Female            -3.713e-01  3.286e-01  -1.130 0.258475    
## Married_No               -5.255e-01  2.859e-01  -1.838 0.066095 .  
## `Education_Not Graduate` -4.238e-01  3.026e-01  -1.401 0.161353    
## Self_Employed_Yes        -1.516e-01  3.504e-01  -0.433 0.665142    
## Credit_History_0         -3.632e+00  4.308e-01  -8.432  < 2e-16 ***
## Property_Area_Rural      -9.489e-01  3.011e-01  -3.151 0.001624 ** 
## Property_Area_Urban      -8.502e-01  3.049e-01  -2.788 0.005300 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 593.05  on 479  degrees of freedom
## Residual deviance: 437.97  on 467  degrees of freedom
## AIC: 463.97
## 
## Number of Fisher Scoring iterations: 5

confint(modelo_Loan, level = 0.95)

## Waiting for profiling to be done...

##                                  2.5 %        97.5 %
## (Intercept)               1.462171e+00  4.593515e+00
## Dependents               -1.771818e-01  3.504340e-01
## ApplicantIncome          -5.493625e-05  5.728753e-05
## CoapplicantIncome        -1.376342e-04  3.897337e-05
## LoanAmount               -6.359118e-03  6.826728e-04
## Loan_Amount_Term         -4.825091e-03  3.145783e-03
## Gender_Female            -1.009540e+00  2.823850e-01
## Married_No               -1.088665e+00  3.496724e-02
## `Education_Not Graduate` -1.010916e+00  1.792916e-01
## Self_Employed_Yes        -8.217189e-01  5.578603e-01
## Credit_History_0         -4.562582e+00 -2.852234e+00
## Property_Area_Rural      -1.548626e+00 -3.645477e-01
## Property_Area_Urban      -1.457541e+00 -2.579827e-01

Extraction of the Log-Likelihood (LL) value

logLik(modelo_Loan)

## 'log Lik.' -218.9866 (df=13)

We have a model whose log-likelihood value seems to indicate a well-fitted model. However, for a 95% confidence level, many of the independent variables appear to be non-significant. We can say this because the “summary” function shows many p-values above 0.05. Additionally, the “confint” function, which returns 95% confidence intervals for the coefficients of the adjusted logistic model, returned many intervals that include the value 0, which may indicate that the variables with these intervals do not have a significant effect on the response variable. Let’s try a model without these independent variables, and then we can compare the models.

Modelo with Stepwise

Vamos exucutar o algorítmo de stepwise no modelo para nos livrarmos dessas variáveis que não são muito significativas:

step_Loan <- step(modelo_Loan, k = 3.841459)

the value “k” is used as a critical threshold for the chi-squared statistic, and in this case, we are seeking variable selection with a 95% confidence level ## Parameters of the “step_Loan” model

summary(step_Loan)

## 
## Call:
## glm(formula = Loan_Status_Y ~ Married_No + Credit_History_0 + 
##     Property_Area_Rural + Property_Area_Urban, family = "binomial", 
##     data = dados_dummies)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1173  -0.4034   0.4741   0.6710   2.4897  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.1290     0.2468   8.628  < 2e-16 ***
## Married_No           -0.5855     0.2423  -2.416  0.01568 *  
## Credit_History_0     -3.6289     0.4259  -8.520  < 2e-16 ***
## Property_Area_Rural  -0.9679     0.2961  -3.268  0.00108 ** 
## Property_Area_Urban  -0.7526     0.2973  -2.531  0.01136 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 593.05  on 479  degrees of freedom
## Residual deviance: 445.58  on 475  degrees of freedom
## AIC: 455.58
## 
## Number of Fisher Scoring iterations: 4

confint(step_Loan, level = 0.95)

## Waiting for profiling to be done...

##                         2.5 %     97.5 %
## (Intercept)          1.666519  2.6368059
## Married_No          -1.061575 -0.1095959
## Credit_History_0    -4.550147 -2.8579468
## Property_Area_Rural -1.557995 -0.3932934
## Property_Area_Urban -1.343851 -0.1742342

We can see that all the less significant variables have been eliminated. The Stepwise algorithm has significantly reduced the number of independent variables.

Comparing the models

Comparing Log-Likelihood (LL), AIC e BIC

logLik(step_Loan)

## 'log Lik.' -222.789 (df=5)

logLik(modelo_Loan)

## 'log Lik.' -218.9866 (df=13)

AIC(step_Loan)

## [1] 455.578

AIC(modelo_Loan)

## [1] 463.9732

BIC(step_Loan)

## [1] 476.4469

BIC(modelo_Loan)

## [1] 518.2324

With lower LL, AIC, and BIC values, all our indicators suggest that the “step_Loan” model has a better fit to the data. However, this is not enough to determine which model has better predictive capability. Let’s test the accuracy of both models.

Confusion matrix for cutoff = 0.5 using the confusionMatrix function from the “caret” package

confusionMatrix(table(predict(modelo_Loan, type = "response") >= 0.5,
                      dados_dummies$Loan_Status_Y == 1)[2:1, 2:1])

## Confusion Matrix and Statistics
## 
##        
##         TRUE FALSE
##   TRUE   325    82
##   FALSE    7    66
##                                           
##                Accuracy : 0.8146          
##                  95% CI : (0.7769, 0.8484)
##     No Information Rate : 0.6917          
##     P-Value [Acc > NIR] : 7.161e-10       
##                                           
##                   Kappa : 0.4943          
##                                           
##  Mcnemar's Test P-Value : 4.365e-15       
##                                           
##             Sensitivity : 0.9789          
##             Specificity : 0.4459          
##          Pos Pred Value : 0.7985          
##          Neg Pred Value : 0.9041          
##              Prevalence : 0.6917          
##          Detection Rate : 0.6771          
##    Detection Prevalence : 0.8479          
##       Balanced Accuracy : 0.7124          
##                                           
##        'Positive' Class : TRUE            
##

confusionMatrix(table(predict(step_Loan, type = "response") >= 0.5,
                      dados_dummies$Loan_Status_Y == 1)[2:1, 2:1])

## Confusion Matrix and Statistics
## 
##        
##         TRUE FALSE
##   TRUE   325    85
##   FALSE    7    63
##                                           
##                Accuracy : 0.8083          
##                  95% CI : (0.7702, 0.8426)
##     No Information Rate : 0.6917          
##     P-Value [Acc > NIR] : 5.209e-09       
##                                           
##                   Kappa : 0.4738          
##                                           
##  Mcnemar's Test P-Value : 9.923e-16       
##                                           
##             Sensitivity : 0.9789          
##             Specificity : 0.4257          
##          Pos Pred Value : 0.7927          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.6917          
##          Detection Rate : 0.6771          
##    Detection Prevalence : 0.8542          
##       Balanced Accuracy : 0.7023          
##                                           
##        'Positive' Class : TRUE            
##

In the context of a logistic model, the output is actually the probability of an event occurring. Therefore, when we define a “cutoff” for creating a confusion matrix, we are setting a threshold that classifies predictions as “correct” or “incorrect.” In our case, we can see that in a confusion matrix for a cutoff of 0.6, both models had very similar metrics, but the original model performed slightly better overall. It’s worth noting that the cutoff of 0.5 yielded the best results for both models.

Model “modelo_Loan”

Sensitividade	Especificidade	Acuracia
0.9789157	0.4459459	0.8145833

Model “step_Loan”

Sensitividade	Especificidade	Acuracia
0.9789157	0.4256757	0.8083333

In summary, accuracy is a general measure of the model’s precision, sensitivity measures the ability to correctly identify instances of the positive class, and specificity measures the ability to correctly identify instances of the negative class.

Finally, let’s plot the ROC curve for both models. The ROC curve is an important evaluation metric in binary classification tasks. It provides valuable insights into a model’s performance and its ability to distinguish between positive and negative classes.

ROC curve

Let’s save the values necessary for plotting the ROC curve:

predicoes <- prediction(predictions = modelo_Loan$fitted.values, 
                        labels = as.factor(dados_dummies$Loan_Status_Y))
sensitividade <- (performance(predicoes, measure = "sens"))@y.values[[1]] 

especificidade <- (performance(predicoes, measure = "spec"))@y.values[[1]]

predicoes_step <- prediction(predictions = step_Loan$fitted.values, 
                        labels = as.factor(dados_dummies$Loan_Status_Y))
sensitividade_step <- (performance(predicoes_step, measure = "sens"))@y.values[[1]] 

especificidade_step <- (performance(predicoes_step, measure = "spec"))@y.values[[1]]

The roc function from the “pROC” package

We need to use the roc function to specify the output variable and the predicted values:

ROC <- roc(response = dados_dummies$Loan_Status_Y, 
           predictor = modelo_Loan$fitted.values)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

ROC_step <- roc(response = dados_dummies$Loan_Status_Y, 
           predictor = step_Loan$fitted.values)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

ROC curve for the “modelo_Loan” model

ROC curve for the “step_Loan” model

In summary, a larger area under the ROC curve represents a greater capacity for accuracy in a binary logistic model. When analyzing the ROC curves of our models, we can see that our original model has a slightly larger ROC curve.

Overall, we have obtained two reliable models with good predictive capability, as both have very similar accuracy. The choice of which model is “better” would depend on the context.

Our original model has slightly superior metrics but is more complex with 12 independent variables.

On the other hand, the “stepwise” version is simpler with only 4 independent variables but has slightly inferior metrics.

Equations of the models

modelo_Loan

x
\[ \begin{aligned} \operatorname{Loan\_Status\_Y} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{Loan\_Status\_Y} = \operatorname{1}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= 2.93857 + 0.08251(\operatorname{Dependents}) + 1e-05(\operatorname{ApplicantIncome}) - 5e-05(\operatorname{CoapplicantIncome})\ - \\ &\quad 0.00287(\operatorname{LoanAmount}) - 0.00062(\operatorname{Loan\_Amount\_Term}) - 0.37131(\operatorname{Gender\_Female}) - 0.52546(\operatorname{Married\_No})\ - \\ &\quad 0.42383(\operatorname{`Education\_Not\ Graduate`}) - 0.15164(\operatorname{Self\_Employed\_Yes}) - 3.63222(\operatorname{Credit\_History\_0}) - 0.94886(\operatorname{Property\_Area\_Rural})\ - \\ &\quad 0.85022(\operatorname{Property\_Area\_Urban}) \end{aligned} \]

step_Loan

x
\[ \begin{aligned} \operatorname{Loan\_Status\_Y} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{Loan\_Status\_Y} = \operatorname{1}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= 2.12904 - 0.58552(\operatorname{Married\_No}) - 3.62894(\operatorname{Credit\_History\_0}) - 0.9679(\operatorname{Property\_Area\_Rural})\ - \\ &\quad 0.75259(\operatorname{Property\_Area\_Urban}) \end{aligned} \]

where “p̂” represents the estimated probability of the event occurring in a logistic model, but to obtain the actual probability, the inverse logistic function (or logit function) is applied to the estimate

A logistic model for loan approval

Rafael

2023-06-02