We will be using data from a financial company’s loan approval history, which includes loans from different people, information about those people, and whether or not the loan was accepted or denied. Our research question will be: How does a customer’s demographics affect their loan acceptance rate, and how well can we automate the eligibility process?
The data will come from this Kaggle dataset: https://www.kaggle.com/datasets/krishnaraj30/finance-loan-approval-prediction-data
Before we can perform data visualization or create our model, the first thing we need to do is to ensure that there are no null values in the dataset, as this will cause future code to error out. We can check how much NaN values are in each column:
## Loan_ID Gender Married Dependents
## 0 13 3 15
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 32 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 22 14 50 0
## Loan_Status
## 0
We can see seven columns that have NaN values. Because each column’s null values make up a very small portion of the actual dataset, we can just go ahead and fill those values with the mean/mode, and print the NaN values again:
## Loan_ID Gender Married Dependents
## 0 0 0 0
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 0 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 0 0 0 0
## Loan_Status
## 0
We will look at the relationships between the variables and the Loan Status before building the ML model in order to better understand this dataset.
First of all, we’ll see how the income and loan amount of the customer affects their loan acceptance:
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: Removed 12 rows containing missing values (`geom_point()`).
We can see off of this visualization that there actually isn’t much of a meaningful relationship here; whether the applicant has a high or low Income-to-LoanAmount ratio doesn’t seem to affect the loan’s approval. We can also see that generally, as income increases the loan amount both linearly increases and increases in range.
Next up, we’ll see how education and loan amount affects the loan status. One may hypothesize that individuals with education would be able to “get away” with requesting higher loans because loaners might see them as more likely to be able to repay their debts.
## No summary function supplied, defaulting to `mean_se()`
Based on the graph, we can see the hypothesis was not true. If it were true, for non-graduates, the loan amount for rejected loans would be much higher on average than for the accepted loans. In this regard, education doesn’t affect loan status.
Finally, we’ll see the distribution of our target variable: the Loan Status.
We can see that most of the loans in the dataset were approved. If we feed this data straight into our model, the model will get a bias towards choosing yes, which is something we want to avoid. We will account for this later in the model building.
Now that we’ve finished basic data visualization and have upsampled the data, before actually creating the model we have to convert all of the string features to numbers, or else the model won’t be able to understand it.
## [1] "Gender"
## [1] "Married"
## [1] "Dependents"
## [1] "Education"
## [1] "Self_Employed"
## [1] "Loan_Amount_Term"
## [1] "Property_Area"
## [1] "Loan_Status"
We also have to make sure the data meets the two main assumptions of Logistic Regression before we analyze the coefficients and perform predictions.
When two or more of the pvredictor features are highly correlated with each other, this creates multicolinearity, and prevents the model - which is parametric - from assigning controlled coefficients for each feature. We will use the VIF metric to check if this is the case for any feature.
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3
A VIF > 5 indicates severe multicolinearity, and given that none of the features here have a VIF of 2 or greater, it is safe to say that there is no multicolinearity in the data.
Because of the underlying linear model in Logistic Regression, outliers in the data will cause the Logistic Regression model to perform worse, so there should be no outliers present. We can use Cook’s Distance for each observation to measure this
## [1] 0
Given that none of the Cook’s Distance is greater than 0.5, which indicates an outlier, we can say that there are no extreme outliers present in our data.
Now, we can work towards actually creating the model and getting information from its coefficients.
##
## Call:
## glm(formula = Loan_Status ~ Gender + Married + Dependents + Education +
## Self_Employed + ApplicantIncome + CoapplicantIncome + LoanAmount +
## Loan_Amount_Term + Credit_History + Property_Area, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0638 -0.3725 0.5950 0.6987 2.5326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.022e+00 7.720e-01 -2.620 0.0088 **
## Gender -9.458e-02 2.884e-01 -0.328 0.7429
## Married 5.747e-01 2.427e-01 2.368 0.0179 *
## Dependents 4.841e-02 1.159e-01 0.418 0.6760
## Education -4.422e-01 2.569e-01 -1.721 0.0852 .
## Self_Employed -2.920e-03 3.099e-01 -0.009 0.9925
## ApplicantIncome 8.702e-06 2.344e-05 0.371 0.7104
## CoapplicantIncome -5.340e-05 3.353e-05 -1.593 0.1113
## LoanAmount -2.052e-03 1.567e-03 -1.310 0.1903
## Loan_Amount_Term -8.070e-02 9.843e-02 -0.820 0.4123
## Credit_History 3.865e+00 4.143e-01 9.329 <2e-16 ***
## Property_Area 8.299e-02 1.349e-01 0.615 0.5385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 762.89 on 613 degrees of freedom
## Residual deviance: 572.53 on 602 degrees of freedom
## AIC: 596.53
##
## Number of Fisher Scoring iterations: 5
Based on this, we can see that there arew only four variables that we can say with a significance level of 0.05 affect the Loan Approval: Marriage, Education, Coapplicant’s Income, and Credit History. We can use the coefficients of these features along with their standard errors to get these statistics:
We can say with 95% confidence that married people, when controlling for the other predictors in the dataset, have 1.10 - 2.86 times the odds of getting their loan approved.
We can say with 95% confidence that people who’s credit history meets its deadlines, when controlling for the other predictors in the dataset, have 21 - 107 times the odds of getting their loans approved.
Now we will go ahead and test how good Logistic Regression can predict off of the dataset. We will partition 70% of the dataset to a training set and 30% to the testing set.
In order to account for the class imbalance in the dataset, we’ll change the cutoff for predicting “Yes” to the loan approval from 50% to 75%. We will train the model on the training set and get its accuracy on the testing set.
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
## [1] 0.726776
## string_pred
## No Yes
## No 32 25
## Yes 25 101
We got an accuracy of around 73%, with a significantly higher accuracy for the “Yes” observations.
Based off of the Logistic Regression model, we were able to make a few interesting conclusions from its coefficients(see “Interpreting Coefficients”). However, when looking at the model accuracy, which was only around 73% and imbalanced between “Yes” and “No”, it is clear that the dataset did not include all of the meaningful features that influence a loan’s acceptance. Financial companies should record more data and more meaningful features if they want to in any capacity automate the loan acceptance process.