Introduction

In this project, we are modeling a dataset where the dependent variable (y) is binary, meaning it has only two values recorded. In order to model this type of data, our group chose the “Loan Approval Data Set,” recording 4269 observations (specifically in the region of India, therefore money is measured in rupees (Rs.)). This dataset, created by Archit Sharma, observes various predictors: annual income, loan amount, loan term, CIBIL score (same as credit score but for the “Credit Information Bureau India Limited”), residential assets value, commercial assets value, luxury assets value, bank assets value, education, number of dependents, and employment status. The dependent variable recorded is loan statu,s which has either a record of “Accepted” or “Rejected”. We want to see if any of these predictors have an effect on whether an applicant’s loan gets approved or not, and what we can imply from building a model that shows this effect. We will be building three types of models: Linear Probability, probit, and logit models, to see how the predictors influence the probability of loan status approval and extend that analysis to make conclusions and advice.

# Load Libraries 
library(tidyverse)
library(car)
library(janitor)
library(corrplot)
library(lmtest)
library(margins)

Load The Data

data <- read.csv("loan_approval_dataset.csv")

Citation

Archit T. Sharma. “Loan‑Approval‑Prediction‑Dataset.” Kaggle, 8 Oct 2023. Loan Approval Data Set used for Prediction Models, https://www.kaggle.com/datasets/architsharma01/loan-approval-prediction-dataset

Descriptive Analysis of Variables

This section shows the behavior of all the variables individually. Binary variables such as education, self employment, loan status (the dependent variable), and discrete variables like loan term and number of dependents will only show the proportions. The rest of the variables will have all figures.

Statistical Summaries

summary(data$income_annum)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  200000 2700000 5100000 5059124 7500000 9900000

summary(data$loan_amount)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   300000  7700000 14500000 15133450 21500000 39500000

summary(data$cibil_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   300.0   453.0   600.0   599.9   748.0   900.0

summary(data$residential_assets_value)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -100000  2200000  5600000  7472617 11300000 29100000

summary(data$commercial_assets_value)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0  1300000  3700000  4973155  7600000 19400000

summary(data$luxury_assets_value)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   300000  7500000 14600000 15126306 21700000 39200000

summary(data$bank_asset_value)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0  2300000  4600000  4976692  7100000 14700000

#Proportions for Binary and Discrete Variables
# Education
data %>% tabyl (education)

data %>% tabyl (self_employed)

data %>% tabyl (no_of_dependents)

data %>% tabyl (loan_term)

data %>% tabyl (loan_status)

Histograms

#Histogram of Annual Income
hist (data$income_annum, breaks = "FD", col = "lavender",
      main = "Distribution of Applicant Annual Income",
      xlab = "Annual Income", probability = TRUE)
lines (density (data$income_annum) )

Annual Income Histogram: The distribution of annual income appears roughly uniform with a flat density curve, showing minimal skew and wide spread of values between Rs.0 to about Rs.10,000,000

#Histogram of Loan Amount 
hist (data$loan_amount, breaks = "FD", col = "lavender", 
      main = "Distribution of Loan Amount",
      xlab = "Loan Amount", probability = TRUE)
lines (density (data$loan_amount) )

Loan Amount Histogram: This distribution is right-skewed, meaning there is a higher-concentration of values on the lower end (Rs.0 to Rs.20,000,000), proceeding with a long tail extending toward higher amounts up to Rs.40,000,000.

#Histogram of Credit Score 
hist (data$cibil_score, breaks = "FD", col = "lavender", 
      main = "Distribution of CIBIL Score",
      xlab = "CIBIL Score", probability = TRUE)
lines (density (data$cibil_score) )

Credit Score Histogram: The distribution is mostly uniform, which shows an even spread of scores across the range with no visible skew. These values spread around 300 to 900.

#Histogram of Residential Assets Value
hist (data$residential_assets_value, breaks = "FD", col = "lavender",
      main = "Distribution of Residential Assets Value",
      xlab = "Resdential Assets Value", probability = TRUE)
lines (density (data$residential_assets_value) )

Residential Assets Value Histogram: These values are highly right skewed, with most of the values around Rs.0 to Rs.5,000,000 and a few reaching as high as Rs.30,000,000.

#Histogram of Commercial Assets Value
hist (data$commercial_assets_value, breaks = "FD", col = "lavender", 
      main = "Distribution of Commercial Assets Value",
      xlab = "Commercial Assets Value", probability = TRUE)
lines (density (data$commercial_assets_value) )

Commercial Asset Values Histogram: High Right skewness, with most vales concentrated between Rs.0 to Rs.5,000,000 but goes up to Rs.20,000,000.

#Histogram of Luxury Assets Value
hist (data$luxury_assets_value, breaks = "FD", col = "lavender",
      main = "Distribution of Luxury Assets Value",
      xlab = "Luxury Assets Value", probability = TRUE)
lines (density (data$luxury_assets_value) )

Luxury Asset Value Histogram: The distribution is right-skewed, with many applicants having Rs.0 to Rs.5,000,000 in luxury assets with a few reaching up to Rs.20,000,000.

#Histogram of Bank Assets Value
hist (data$bank_asset_value, breaks = "FD", col = "lavender",
      main = "Distribution of Bank Asset Value",
      xlab = "Bank Asset Value", probability = TRUE)
lines (density (data$bank_asset_value) )

Bank Asset Valye Histogram: the distribution is right-skewed, with most values around Rs.0 to 7,000,000. This extends to the right (long tail) to Rs.15,000,000 showing that very few applicants hold a high amount in bank assets.

Correlation Plot

#Making a new data set with only the numeric values
data_numeric <- data[ ,c(5,6,8,9,10,11,12)]
#Checking data columns 
head(data_numeric)

#Making the correlation plot
cor(data_numeric)

##                          income_annum loan_amount  cibil_score
## income_annum               1.00000000  0.92746991 -0.023034422
## loan_amount                0.92746991  1.00000000 -0.017034787
## cibil_score               -0.02303442 -0.01703479  1.000000000
## residential_assets_value   0.63684147  0.59459571 -0.019946757
## commercial_assets_value    0.64032770  0.60318809 -0.003769346
## luxury_assets_value        0.92914542  0.86091403 -0.028617628
## bank_asset_value           0.85109313  0.78812171 -0.015478271
##                          residential_assets_value commercial_assets_value
## income_annum                           0.63684147             0.640327698
## loan_amount                            0.59459571             0.603188088
## cibil_score                           -0.01994676            -0.003769346
## residential_assets_value               1.00000000             0.414786027
## commercial_assets_value                0.41478603             1.000000000
## luxury_assets_value                    0.59093243             0.591127500
## bank_asset_value                       0.52741756             0.548575593
##                          luxury_assets_value bank_asset_value
## income_annum                      0.92914542       0.85109313
## loan_amount                       0.86091403       0.78812171
## cibil_score                      -0.02861763      -0.01547827
## residential_assets_value          0.59093243       0.52741756
## commercial_assets_value           0.59112750       0.54857559
## luxury_assets_value               1.00000000       0.78851691
## bank_asset_value                  0.78851691       1.00000000

corrplot(cor(data_numeric))

The correlation matrix shows if the numeric variables are correlated with each other. Variable combinations that seem to have a high positive correlation are:

Income with Loan Amount (indicating that higher income applicants would request a larger loan amount)
Bank Asset Value with income, loan amount and luxury asset value. (indicating that people with high value in bank assets would be ones with high income and request a higher amount, therefore also have high value in luxury asset value)

Some moderate correlations are between commercial and residential asset values (applicants who hold commercial assets might also hold residential as well)

Credit Score (CIBIL Score) has a weak correlation with the variables.

The correlation plot suggests multicollinearity between the variables, therefore we should be careful when selecting our variables for the models later.

Box Plots

boxplot (data$income_annum, ylab="Annual Income")

Annual Income Box plot: The box plot shows relatively symmetric distribution, with the median around Rs.5,000,000. The interquartile rage is moderately wide, meaning that though there is some variability in the values, there are no extreme outliers.

boxplot (data$loan_amount, ylab="Loan Amount")

Loan Amount Box Plot: This box plots shows a right-skewed distribution, with the median around Rs.2,000,000. The box stretched to higher values as the whiskers extends up.

boxplot (data$cibil_score, ylab="CIBIL Score")

CIBIL score box plots: shows a roughly symmetric distribution with a median around 600. The total range seems to span from 300-600.

boxplot (data$residential_assets_value, ylab="Residential Assets Value")

Residential Assets Value box plot: This variable has an obvious right-skew in the observations. Some outliers are beyond Rs.20,000,000. The median of the observations are around Rs.5,000,000.

boxplot (data$commercial_assets_value, ylab="Commercial Assets Value")

Commercial Assets Value box plot: This variable displays a strong right-skewness. Some outliers are also past Rs.20,000,000 like the previous variable.

boxplot (data$luxury_assets_value, ylab="Luxury Assets Value")

Luxury Assets Value box plot: This variable is also very right-skewed with outliers above Rs.30,000,000. The median shows about Rs.10,000,000.

boxplot (data$bank_asset_value, ylab="Bans kAssets Value")

Bank Assets Value box plot: This variable has a moderate right-skewness in its distribution but shows a few outliers beyond Rs.15,000,000. The inter-quartile range shows variability.

Scatter Plots

data <- data
plot (data$income_annum, col = "midnightblue")
title ('Scater Plot of Annual Income')

Annual Income Scatter Plot: Shows a uniform vertical spread, there isn’t any trend or clustering.

data <- data
plot (data$loan_amount, col = "midnightblue")
title ('Scater Plot of Loan Amount')

Loan Amount Scatter Plot: Shows values densely scattered throughout the index. There are values ranging from Rs. 0 to Rs.40,000,000.

data <- data
plot (data$cibil_score, col = "midnightblue")
title ('Scater Plot of CIBIL Score')

CIBIL Score Scatter Plot: These values range from 300 to 900 and seems to be evenly dispersed across the observations. There don’t seem to have any major gaps.

data <- data
plot (data$residential_assets_value, col = "midnightblue")
title ('Scater Plot of Residential Asset Value')

Residential Assets Value Scatter Plot: This shows a concentration at the lower end of the observation values. These fall below Rs.10,000,000.

data <- data
plot (data$commercial_assets_value, col = "midnightblue")
title ('Scater Plot of Commercial Asset Value')

Commercial Assets Scatter Plot: Very densely packed lower values, and gets less dense as the values reach Rs. 20,000,000.

data <- data
plot (data$luxury_assets_value, col = "midnightblue")
title ('Scater Plot of Luxury Assets Value')

Luxury Assets Value Scatter Plot: heavy clustering of points under Rs. 15,000,000. The scatter plot shows obserations as high as Rs.40,000,000 meaning there is high variabilty.

data <- data
plot (data$bank_asset_value, col = "midnightblue")
title ('Scater Plot of Bank Assets Value')

Bank Assets Value Scatter Plot: Most values are concentrated below Rs.10,000,000. There are a few outliers in higher value reaching as for as Rs.15,000,000.

Modeling Building

Cleaning the binary variables

We Need to clean the variables education, employed, and the dependent variable loan status to make sure they are considered as binary.

#Cleaning the Binary Variables:
data$education <- as.factor(data$education)
data$self_employed <- as.factor(data$self_employed)
#Making Sure observations do not only show as 0's
unique(data$loan_status)

## [1] " Approved" " Rejected"

data$loan_status_clean <- trimws(tolower(data$loan_status))
data$loan_status_binary <- ifelse(data$loan_status_clean == "approved", 1, 0)
table(data$loan_status_binary)

## 
##    0    1 
## 1613 2656

The linear probability model will go through many transformations and tests to check which variables should be kept or transformed. Those variables will be used for the following probit and logit models. As of now loan amount, annual income, residential, commercial, luxury, and bnak assets are right skewed so a log transformation on those variables might help the model performance.

Linear Probability Model

#Linear Probability Model with no transformations or drops in variables
lpm_model_untransformed <- lm(loan_status_binary ~ no_of_dependents + 
                                education + self_employed + income_annum + loan_amount + 
                                loan_term + cibil_score + residential_assets_value + 
                                commercial_assets_value + luxury_assets_value + bank_asset_value, 
                              data = data)
#Linear Probabilit Model with log on right-skewed variables
lpm_model_transformed <- lm(loan_status_binary ~ no_of_dependents + education + 
                              self_employed + log(income_annum) + log(loan_amount) + 
                              loan_term + cibil_score + log(residential_assets_value+1) + 
                              log(commercial_assets_value + 1) + log(luxury_assets_value +1 ) + 
                              log(bank_asset_value +1), data = data)
#Check AIC and BIC values for each model
AIC(lpm_model_untransformed, lpm_model_transformed)

BIC(lpm_model_untransformed, lpm_model_transformed)

AIC and BIC values are lower for the model with transformations added therfore, we will keep that as our model for now. Next we will check for VIF.

vif(lpm_model_transformed)

##                  no_of_dependents                         education 
##                          1.003186                          1.001765 
##                     self_employed                 log(income_annum) 
##                          1.000778                         37.360885 
##                  log(loan_amount)                         loan_term 
##                         18.139865                          1.001388 
##                       cibil_score log(residential_assets_value + 1) 
##                          1.001406                          1.306223 
##  log(commercial_assets_value + 1)      log(luxury_assets_value + 1) 
##                          1.254114                         18.161984 
##         log(bank_asset_value + 1) 
##                          3.468172

#VIF extremely high for Log(annual Income), if dropped lets see model performance
lpm_model_transformed2 <- lm(loan_status_binary ~ no_of_dependents + education + 
                               self_employed + log(loan_amount) + loan_term + cibil_score + 
                              log(residential_assets_value+1) + log(commercial_assets_value + 1) + 
                               log(luxury_assets_value +1 ) + log(bank_asset_value +1), data = data)
vif(lpm_model_transformed2)

##                  no_of_dependents                         education 
##                          1.002923                          1.001739 
##                     self_employed                  log(loan_amount) 
##                          1.000722                         10.000158 
##                         loan_term                       cibil_score 
##                          1.000937                          1.001392 
## log(residential_assets_value + 1)  log(commercial_assets_value + 1) 
##                          1.298944                          1.251439 
##      log(luxury_assets_value + 1)         log(bank_asset_value + 1) 
##                         10.094075                          3.295644

#Drop Luxury Assets Value, next hight VIF value 
lpm_model_transformed3 <- lm(loan_status_binary ~ no_of_dependents + education + 
                               self_employed + log(loan_amount) + loan_term + cibil_score + 
                               log(residential_assets_value+1) + log(commercial_assets_value + 1) + 
                               log(bank_asset_value +1), data = data)
vif(lpm_model_transformed3)

##                  no_of_dependents                         education 
##                          1.002859                          1.001695 
##                     self_employed                  log(loan_amount) 
##                          1.000667                          3.440718 
##                         loan_term                       cibil_score 
##                          1.000815                          1.001122 
## log(residential_assets_value + 1)  log(commercial_assets_value + 1) 
##                          1.286766                          1.248119 
##         log(bank_asset_value + 1) 
##                          3.099826

#Now no VIF's are large so only those two variables can be dropped.

Now We can run a summary on the previous model as it seems to best fit, and drop variables that are insignificant to see if that improves model performance.

summary(lpm_model_transformed3)

## 
## Call:
## lm(formula = loan_status_binary ~ no_of_dependents + education + 
##     self_employed + log(loan_amount) + loan_term + cibil_score + 
##     log(residential_assets_value + 1) + log(commercial_assets_value + 
##     1) + log(bank_asset_value + 1), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21896 -0.21797 -0.03831  0.21357  0.98043 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -7.485e-01  9.202e-02  -8.134 5.41e-16 ***
## no_of_dependents                  -3.548e-03  2.747e-03  -1.292   0.1966    
## education Not Graduate            -8.799e-03  9.310e-03  -0.945   0.3447    
## self_employed Yes                  5.620e-03  9.306e-03   0.604   0.5459    
## log(loan_amount)                   2.357e-02  9.861e-03   2.390   0.0169 *  
## loan_term                         -1.033e-02  8.151e-04 -12.669  < 2e-16 ***
## cibil_score                        2.175e-03  2.702e-05  80.490  < 2e-16 ***
## log(residential_assets_value + 1) -5.795e-04  2.667e-03  -0.217   0.8280    
## log(commercial_assets_value + 1)   8.192e-04  1.991e-03   0.411   0.6808    
## log(bank_asset_value + 1)         -1.303e-02  7.172e-03  -1.816   0.0694 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3029 on 4231 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.6105, Adjusted R-squared:  0.6097 
## F-statistic: 736.8 on 9 and 4231 DF,  p-value: < 2.2e-16

#Summary tells us to drop variables less than 0.05, after dropping highest one, 
#AIC is still better if not dropped
lpm_model_transformed3_dropped <- lm(loan_status_binary ~ no_of_dependents + education + 
                                       self_employed + log(loan_amount) + loan_term + cibil_score + 
                                       log(commercial_assets_value +1) + 
                                       log(bank_asset_value + 1) , data = data)
#Check AIC values for each model
AIC(lpm_model_transformed3 , lpm_model_transformed3_dropped)

#AIC lower for when variables are NOT dropped so keeping all variables with the transformations.
lpm_final <- lm(loan_status_binary ~ no_of_dependents + education + self_employed + 
                  log(loan_amount) + loan_term + cibil_score + 
               log(residential_assets_value+1) + log(commercial_assets_value + 1) + 
                 log(bank_asset_value +1), data = data)
summary(lpm_final)

## 
## Call:
## lm(formula = loan_status_binary ~ no_of_dependents + education + 
##     self_employed + log(loan_amount) + loan_term + cibil_score + 
##     log(residential_assets_value + 1) + log(commercial_assets_value + 
##     1) + log(bank_asset_value + 1), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21896 -0.21797 -0.03831  0.21357  0.98043 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -7.485e-01  9.202e-02  -8.134 5.41e-16 ***
## no_of_dependents                  -3.548e-03  2.747e-03  -1.292   0.1966    
## education Not Graduate            -8.799e-03  9.310e-03  -0.945   0.3447    
## self_employed Yes                  5.620e-03  9.306e-03   0.604   0.5459    
## log(loan_amount)                   2.357e-02  9.861e-03   2.390   0.0169 *  
## loan_term                         -1.033e-02  8.151e-04 -12.669  < 2e-16 ***
## cibil_score                        2.175e-03  2.702e-05  80.490  < 2e-16 ***
## log(residential_assets_value + 1) -5.795e-04  2.667e-03  -0.217   0.8280    
## log(commercial_assets_value + 1)   8.192e-04  1.991e-03   0.411   0.6808    
## log(bank_asset_value + 1)         -1.303e-02  7.172e-03  -1.816   0.0694 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3029 on 4231 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.6105, Adjusted R-squared:  0.6097 
## F-statistic: 736.8 on 9 and 4231 DF,  p-value: < 2.2e-16

The final model was when no variables are dropped after the two were dropped from running VIF. These variables will be used in the proceeding models. The lpm model transformed 3 with the highest p-value dropped, was also tested manually dropping variables with lower p-values over 0.05 and they resulted in higher AIC as well. Therefore this model with the lowest AIC should be used.

Probit Model

# Probit Model 
probit_model <- glm(loan_status_binary ~ no_of_dependents + education +
                      self_employed + log(loan_amount) + loan_term + cibil_score + 
                      log(residential_assets_value+1) + log(commercial_assets_value + 1) + 
                      log(bank_asset_value +1), data = data,
                    family = binomial(link = "probit"))

## Warning in log(residential_assets_value + 1): NaNs produced

summary(probit_model)

## 
## Call:
## glm(formula = loan_status_binary ~ no_of_dependents + education + 
##     self_employed + log(loan_amount) + loan_term + cibil_score + 
##     log(residential_assets_value + 1) + log(commercial_assets_value + 
##     1) + log(bank_asset_value + 1), family = binomial(link = "probit"), 
##     data = data)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -6.4552995  0.6407761 -10.074   <2e-16 ***
## no_of_dependents                  -0.0067010  0.0186097  -0.360    0.719    
## education Not Graduate            -0.0471894  0.0631855  -0.747    0.455    
## self_employed Yes                  0.0374856  0.0630765   0.594    0.552    
## log(loan_amount)                   0.0788525  0.0672682   1.172    0.241    
## loan_term                         -0.0871854  0.0060266 -14.467   <2e-16 ***
## cibil_score                        0.0126906  0.0003735  33.977   <2e-16 ***
## log(residential_assets_value + 1)  0.0259045  0.0182690   1.418    0.156    
## log(commercial_assets_value + 1)   0.0175391  0.0131984   1.329    0.184    
## log(bank_asset_value + 1)         -0.0770497  0.0487035  -1.582    0.114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5622.1  on 4240  degrees of freedom
## Residual deviance: 1961.8  on 4231  degrees of freedom
##   (28 observations deleted due to missingness)
## AIC: 1981.8
## 
## Number of Fisher Scoring iterations: 7

Logit Model

logit_model <- glm(loan_status_binary ~ no_of_dependents + education + 
                     self_employed + log(loan_amount) + loan_term + cibil_score + 
                     log(residential_assets_value+1) + log(commercial_assets_value + 1) + 
                     log(bank_asset_value +1), data = data,
                   family = binomial(link = "logit"))

## Warning in log(residential_assets_value + 1): NaNs produced

summary(logit_model)

## 
## Call:
## glm(formula = loan_status_binary ~ no_of_dependents + education + 
##     self_employed + log(loan_amount) + loan_term + cibil_score + 
##     log(residential_assets_value + 1) + log(commercial_assets_value + 
##     1) + log(bank_asset_value + 1), family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.273e+01  1.209e+00 -10.527   <2e-16 ***
## no_of_dependents                  -2.058e-02  3.452e-02  -0.596   0.5511    
## education Not Graduate            -1.271e-01  1.173e-01  -1.084   0.2786    
## self_employed Yes                  8.908e-02  1.170e-01   0.762   0.4463    
## log(loan_amount)                   2.027e-01  1.239e-01   1.636   0.1018    
## loan_term                         -1.493e-01  1.127e-02 -13.252   <2e-16 ***
## cibil_score                        2.432e-02  8.165e-04  29.783   <2e-16 ***
## log(residential_assets_value + 1)  2.619e-02  3.510e-02   0.746   0.4555    
## log(commercial_assets_value + 1)   1.398e-02  2.463e-02   0.567   0.5704    
## log(bank_asset_value + 1)         -1.515e-01  8.733e-02  -1.735   0.0827 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5622.1  on 4240  degrees of freedom
## Residual deviance: 1911.7  on 4231  degrees of freedom
##   (28 observations deleted due to missingness)
## AIC: 1931.7
## 
## Number of Fisher Scoring iterations: 7

Chosing the Model

AIC and BIC Tests

AIC(lpm_final, probit_model, logit_model)

BIC(lpm_final, probit_model, logit_model)

Confusion Matricies

Confusion matrix for Linear Probability Model:

data_from_model <- model.frame(lpm_final)
lpm_pred <- predict(lpm_final, type = "response")
lpm_class <- ifelse(lpm_pred >= 0.5, 1, 0)
table(Predicted = lpm_class, Actual = data_from_model$loan_status_binary)

##          Actual
## Predicted    0    1
##         0 1500  210
##         1  101 2430

Confusion Matrix for Probit Model:

data_from_model2 <- model.frame(probit_model)
probit_pred <- predict(probit_model, type = "response")
probit_class <- ifelse(probit_pred >= 0.5, 1, 0)
table(Predicted = probit_class, Actual = data_from_model2$loan_status_binary)

##          Actual
## Predicted    0    1
##         0 1397  173
##         1  204 2467

Confusion Matrix for Logit Model:

data_from_model3 <- model.frame(logit_model)
logit_pred <- predict(logit_model, type = "response")
logit_class <- ifelse(logit_pred >= 0.5, 1, 0)
table(Predicted = logit_class, Actual = data_from_model3$loan_status_binary)

##          Actual
## Predicted    0    1
##         0 1414  174
##         1  187 2466

Although the Linear Probability Model (LPM) shows the best AIC value (1916.87) and highest accuracy (approximately 92.66 percent), followed by the Logit model (AIC = 1931.70, accuracy about 91.49 percent) and the Probit model (AIC = 1981.77, accuracy about 91.10 percent), theoretical considerations are important. The LPM, has strong numeric metrics, but is known to suffer from issues such as heteroskedasticity and predicted probabilities that may fall outside the [0,1] interval. This limits it’s reliability for binary outcomes. In contrast, the Logit model is better suited for modeling binary dependent variables because it ensures valid probability estimates. Therefore, the Logit model is selected as the final model to best represent the data, while the superior performance of the LPM in terms of accuracy and AIC is acknowledged as part of the overall evaluation.

logit_model_final_pick <- glm(loan_status_binary ~ no_of_dependents + education +
                                self_employed + log(loan_amount) + loan_term + cibil_score + 
                     log(residential_assets_value+1) + log(commercial_assets_value + 1) +
                       log(bank_asset_value +1), data = data,
                   family = binomial(link = "logit"))

summary(logit_model_final_pick)

## 
## Call:
## glm(formula = loan_status_binary ~ no_of_dependents + education + 
##     self_employed + log(loan_amount) + loan_term + cibil_score + 
##     log(residential_assets_value + 1) + log(commercial_assets_value + 
##     1) + log(bank_asset_value + 1), family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.273e+01  1.209e+00 -10.527   <2e-16 ***
## no_of_dependents                  -2.058e-02  3.452e-02  -0.596   0.5511    
## education Not Graduate            -1.271e-01  1.173e-01  -1.084   0.2786    
## self_employed Yes                  8.908e-02  1.170e-01   0.762   0.4463    
## log(loan_amount)                   2.027e-01  1.239e-01   1.636   0.1018    
## loan_term                         -1.493e-01  1.127e-02 -13.252   <2e-16 ***
## cibil_score                        2.432e-02  8.165e-04  29.783   <2e-16 ***
## log(residential_assets_value + 1)  2.619e-02  3.510e-02   0.746   0.4555    
## log(commercial_assets_value + 1)   1.398e-02  2.463e-02   0.567   0.5704    
## log(bank_asset_value + 1)         -1.515e-01  8.733e-02  -1.735   0.0827 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5622.1  on 4240  degrees of freedom
## Residual deviance: 1911.7  on 4231  degrees of freedom
##   (28 observations deleted due to missingness)
## AIC: 1931.7
## 
## Number of Fisher Scoring iterations: 7

confint(logit_model)

## Waiting for profiling to be done...

##                                          2.5 %       97.5 %
## (Intercept)                       -15.12188262 -10.38050296
## no_of_dependents                   -0.08829879   0.04709291
## education Not Graduate             -0.35734683   0.10258109
## self_employed Yes                  -0.14010266   0.31864935
## log(loan_amount)                   -0.04002269   0.44648504
## loan_term                          -0.17171741  -0.12752575
## cibil_score                         0.02276492   0.02596736
## log(residential_assets_value + 1)  -0.04337982   0.09409977
## log(commercial_assets_value + 1)   -0.03430664   0.06230666
## log(bank_asset_value + 1)          -0.33007784   0.01836002

#AME Calculation 
logit_margins <- margins(logit_model_final_pick)
summary(logit_margins)

Conclusion

In the final logit model predicting the probability of loan approval show that two variables: loan term and CIBIL score are statistically significant at the 1 percent level. The model uses a logistic link function, meaning each coefficient reflects the change in the log-odds of loan approval associated with a one-unit increase in the predictors, holding all other variables constant. The coefficient for loan term is approximately -0.149, with a 95 percent confidence interval ranging from -0.172 to -0.128. This negative value suggests that longer loan terms are associated with lower odds of loan approval. More specifically, for each one-month increase in the loan term, the log-odds of getting approved decrease, and since the confidence interval does not include zero, this effect is statistically reliable. The CIBIL score, has a positive and significant coefficient of about 0.024, with a 95 percent confidence interval from 0.023 to 0.026. This indicates that higher CIBIL scores are associated with higher odds of loan approval. Again, the confidence interval does not cross zero, supporting the reliability of this effect. The Average Marginal Effects (AMEs) were calculated for better interpretation of the significant predictors on the predicted probability of loan approval. The AME for CIBIL score is approximately 0.0017, which means that for every one-point increase in a borrower’s credit score, the probability of loan approval increases by about 0.17 percent, holding all other variables constant. Similarly, the loan term has an AME of approximately -0.0103, indicating that each additional month in loan duration is associated with a 1.03 percent decrease in the probability of approval, on average. Another predictor that could be considered is bank asset value with a p-value of about 0.08. However since it’s confidence interval crosses 0 it may not be the most reliable estimate. The other variables are considered in the model because removing them would not improve AIC and BIC. These findings imply how loan approval (in India) is determined mainly by a applicants CIBIL score and how long they want to take that loan out for. Therefore, applicants would be advised to have or build a strong CIBIL score and financial policies could implement programs for people wanting to improve their CIBIL score. Also, since longer loan terms are less likely to approved further analysis could be done to see why this is, how loaners pay back over a longer vs. a short period of time etc. This data set could be furthered analyzed by splitting the data into a training and testing model, applying more complicated models, or collecting observations (different variables such as inflation etc.) to see how they affect loan status in India.

QDV Modeling (Part 2)

Anvita Chinthada

2025-07-18