Research Question

We will be using data from a financial company’s loan approval history, which includes loans from different people, information about those people, and whether or not the loan was accepted or denied. Our research question will be: How does a customer’s demographics affect their loan acceptance rate, and how well can we automate the eligibility process?

Loading the Data

The data will come from this Kaggle dataset: https://www.kaggle.com/datasets/krishnaraj30/finance-loan-approval-prediction-data

Removing NaN Values

Before we can perform data visualization or create our model, the first thing we need to do is to ensure that there are no null values in the dataset, as this will cause future code to error out. We can check how much NaN values are in each column:

##           Loan_ID            Gender           Married        Dependents 
##                 0                13                 3                15 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                32                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0

We can see seven columns that have NaN values. Because each column’s null values make up a very small portion of the actual dataset, we can just go ahead and fill those values with the mean/mode, and print the NaN values again:

##           Loan_ID            Gender           Married        Dependents 
##                 0                 0                 0                 0 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                 0                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                 0                 0                 0                 0 
##       Loan_Status 
##                 0

Data Visualization

We will look at the relationships between the variables and the Loan Status before building the ML model in order to better understand this dataset.

First of all, we’ll see how the income and loan amount of the customer affects their loan acceptance:

## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: Removed 12 rows containing missing values (`geom_point()`).

We can see off of this visualization that there actually isn’t much of a meaningful relationship here; whether the applicant has a high or low Income-to-LoanAmount ratio doesn’t seem to affect the loan’s approval. We can also see that generally, as income increases the loan amount both linearly increases and increases in range.

Next up, we’ll see how education and loan amount affects the loan status. One may hypothesize that individuals with education would be able to “get away” with requesting higher loans because loaners might see them as more likely to be able to repay their debts.

## No summary function supplied, defaulting to `mean_se()`

Based on the graph, we can see the hypothesis was not true. If it were true, for non-graduates, the loan amount for rejected loans would be much higher on average than for the accepted loans. In this regard, education doesn’t affect loan status.

Finally, we’ll see the distribution of our target variable: the Loan Status.

We can see that most of the loans in the dataset were approved. If we feed this data straight into our model, the model will get a bias towards choosing yes, which is something we want to avoid. We will account for this later in the model building.

Converting Data to Numbers

Now that we’ve finished basic data visualization and have upsampled the data, before actually creating the model we have to convert all of the string features to numbers, or else the model won’t be able to understand it.

## [1] "Gender"
## [1] "Married"
## [1] "Dependents"
## [1] "Education"
## [1] "Self_Employed"
## [1] "Loan_Amount_Term"
## [1] "Property_Area"
## [1] "Loan_Status"

Logistic Regression Assumptions

We also have to make sure the data meets the two main assumptions of Logistic Regression before we analyze the coefficients and perform predictions.

  1. There is no multicolinearity

When two or more of the pvredictor features are highly correlated with each other, this creates multicolinearity, and prevents the model - which is parametric - from assigning controlled coefficients for each feature. We will use the VIF metric to check if this is the case for any feature.

## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3

A VIF > 5 indicates severe multicolinearity, and given that none of the features here have a VIF of 2 or greater, it is safe to say that there is no multicolinearity in the data.

  1. There are no outliers

Because of the underlying linear model in Logistic Regression, outliers in the data will cause the Logistic Regression model to perform worse, so there should be no outliers present. We can use Cook’s Distance for each observation to measure this

## [1] 0

Given that none of the Cook’s Distance is greater than 0.5, which indicates an outlier, we can say that there are no extreme outliers present in our data.

Interpreting Coefficients

Now, we can work towards actually creating the model and getting information from its coefficients.

## 
## Call:
## glm(formula = Loan_Status ~ Gender + Married + Dependents + Education + 
##     Self_Employed + ApplicantIncome + CoapplicantIncome + LoanAmount + 
##     Loan_Amount_Term + Credit_History + Property_Area, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0638  -0.3725   0.5950   0.6987   2.5326  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.022e+00  7.720e-01  -2.620   0.0088 ** 
## Gender            -9.458e-02  2.884e-01  -0.328   0.7429    
## Married            5.747e-01  2.427e-01   2.368   0.0179 *  
## Dependents         4.841e-02  1.159e-01   0.418   0.6760    
## Education         -4.422e-01  2.569e-01  -1.721   0.0852 .  
## Self_Employed     -2.920e-03  3.099e-01  -0.009   0.9925    
## ApplicantIncome    8.702e-06  2.344e-05   0.371   0.7104    
## CoapplicantIncome -5.340e-05  3.353e-05  -1.593   0.1113    
## LoanAmount        -2.052e-03  1.567e-03  -1.310   0.1903    
## Loan_Amount_Term  -8.070e-02  9.843e-02  -0.820   0.4123    
## Credit_History     3.865e+00  4.143e-01   9.329   <2e-16 ***
## Property_Area      8.299e-02  1.349e-01   0.615   0.5385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 762.89  on 613  degrees of freedom
## Residual deviance: 572.53  on 602  degrees of freedom
## AIC: 596.53
## 
## Number of Fisher Scoring iterations: 5

Based on this, we can see that there arew only four variables that we can say with a significance level of 0.05 affect the Loan Approval: Marriage, Education, Coapplicant’s Income, and Credit History. We can use the coefficients of these features along with their standard errors to get these statistics:

Evaluating Logistic Regression

Now we will go ahead and test how good Logistic Regression can predict off of the dataset. We will partition 70% of the dataset to a training set and 30% to the testing set.

In order to account for the class imbalance in the dataset, we’ll change the cutoff for predicting “Yes” to the loan approval from 50% to 75%. We will train the model on the training set and get its accuracy on the testing set.

## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## [1] 0.726776
##      string_pred
##        No Yes
##   No   32  25
##   Yes  25 101

We got an accuracy of around 73%, with a significantly higher accuracy for the “Yes” observations.

Conclusion

Based off of the Logistic Regression model, we were able to make a few interesting conclusions from its coefficients(see “Interpreting Coefficients”). However, when looking at the model accuracy, which was only around 73% and imbalanced between “Yes” and “No”, it is clear that the dataset did not include all of the meaningful features that influence a loan’s acceptance. Financial companies should record more data and more meaningful features if they want to in any capacity automate the loan acceptance process.