Abstract
Predicting loan eligibility is a hot topic. Most retail credit lending firms are leveraging digital platform to enable customers to check loan eligibility based on information provided online. In this paper, I argue about a machine learning approach to predict home loan eligibility.
Executive summary
Aim
- It is of interest to explore and understand the relationship between variables that influence home loan eligibility. In particular, it is of interest to select a machine learning model to predict home loan eligibility.
Methodology
Data
Two cross-sectional multivariate data sets with twelve (12) predictor variables, were used for this analysis. The currency used in these data sets were INR.
The training data set had 614 sample observations with twelve (12) predictor variables, and one (1) response variable, loan status.
The test data set had 367 sample observations with twelve (12) predictor variables.
Technique(s)
Supervised machine learning.
Three machine learning techniques, namely, logistic regression, decision tree and naive Bayes classifier were used to analyse, classify and predict loan eligibility.
The training data set was randomly split into two data sets with ratio of 60:40 for model selection and evaluation, respectively.
The loan eligibility were predicted using the test data set.
Results
Logistic regression
- The logistic regression model predicted the loan eligibility based on credit history, marital status and education. The odds increase exponentially when an applicant has good credit history.
Decision tree
- The decision tree model predicted the loan eligibility solely based on credit history, resulting in a highly biased model.
Naive-Bayes classifier
- The naive Bayes classifier model predicted the loan eligibility based on conditional probability across all predictor variables. Note the trade-offs between true positive rate and true negative rate.
| Model | Accuracy % | True positive rate % | True negative rate % |
|---|---|---|---|
| Logistic regression | 82 | 97 | 49 |
| Decision tree | 82 | 97 | 48 |
| Naive-Bayes classifier | 82 | 96 | 51 |
Conclusion
From model interpretability and performance standpoint, the logistic regression is better than the decision tree and naive Bayes classifier models.
Additionally, both logistic regression and naive Bayes classifier models could be further trained with additional variables pertaining to applicant and coapplicant age, assets, pre-existing debt, and coapplicant credit history to assess the models performance.
Technical notes
Initial hypotheses
The hypotheses in this section were generated based on the objective of this paper, and prior knowledge about the subject-matter. The hypotheses were defined prior to analysing the data to help with better understanding of data analysis in the forthcoming sections.
Based on prior knowledge, it is known that a variety of factors affect the odds of home loan eligibility. Some of the key factors that might affect the odds of home loan eligibility are applicants credit history, disposable income, assets, age and education.
- Credit history: If the credit history is good, then the odds of home loan eligibility is high.
- Income: Higher the disposable income, higher the odds of home loan eligibility.
- Assets: Higher the value of assets, higher the odds of home loan eligibility.
- Age: Early to middle-aged adults have higher potential for employment than senior adults. Hence, higher the odds of home loan eligibility.
- Education: Highly educated adults have higher potential to earn, that leads to increase in the odds of home loan eligibility.
Therefore in summary, it is of initial assumption that the key factors that might affect the odds of home loan eligibility are credit history, disposable income, assets, age and education.
Exploratory data analysis
Descriptive statistics of the training and test data sets
Data Frame Summary
trainRDimensions: 614 x 13
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Loan_ID [factor] | 1. LP001002 2. LP001003 3. LP001005 4. LP001006 [ 610 others ] |
|
0 (0%) | |||||||||||||||||||||
| 2 | Gender [factor] | 1. Female 2. Male |
|
13 (2.12%) | |||||||||||||||||||||
| 3 | Married [factor] | 1. No 2. Yes |
|
3 (0.49%) | |||||||||||||||||||||
| 4 | Dependents [factor] | 1. 0 2. 1 3. 2 4. 3+ |
|
15 (2.44%) | |||||||||||||||||||||
| 5 | Education [factor] | 1. Graduate 2. Not Graduate |
|
0 (0%) | |||||||||||||||||||||
| 6 | Self_Employed [factor] | 1. No 2. Yes |
|
32 (5.21%) | |||||||||||||||||||||
| 7 | ApplicantIncome [integer] | Mean (sd) : 5403.5 (6109) min < med < max: 150 < 3812.5 < 81000 IQR (CV) : 2917.5 (1.1) | 505 distinct values | 0 (0%) | |||||||||||||||||||||
| 8 | CoapplicantIncome [numeric] | Mean (sd) : 1621.2 (2926.2) min < med < max: 0 < 1188.5 < 41667 IQR (CV) : 2297.2 (1.8) | 287 distinct values | 0 (0%) | |||||||||||||||||||||
| 9 | LoanAmount [integer] | Mean (sd) : 146.4 (85.6) min < med < max: 9 < 128 < 700 IQR (CV) : 68 (0.6) | 203 distinct values | 22 (3.58%) | |||||||||||||||||||||
| 10 | Loan_Amount_Term [integer] | Mean (sd) : 342 (65.1) min < med < max: 12 < 360 < 480 IQR (CV) : 0 (0.2) | 10 distinct values | 14 (2.28%) | |||||||||||||||||||||
| 11 | Credit_History [integer] | Min : 0 Mean : 0.8 Max : 1 |
|
50 (8.14%) | |||||||||||||||||||||
| 12 | Property_Area [factor] | 1. Rural 2. Semiurban 3. Urban |
|
0 (0%) | |||||||||||||||||||||
| 13 | Loan_Status [factor] | 1. N 2. Y |
|
0 (0%) |
The summary output above is for the training data set.
About the data set:
- There are 614 sample observations and 13 variables in the data set.
- There are six (6) categorical variables, namely, Gender, Married, Dependents, Education, Self Employed, and Property Area.
- There are five (5) numeric variables, namely, Applicant Income, Co-applicant Income, Loan Amount, Loan Amount Term and Credit History.
- The response variable, Loan Status, is of numeric type. Loan Status 1 suggests the loan was approved, and 0, not approved.
- The Loan ID variable is an identifier variable, and will not be of any value for this analysis.
Missing data:
- Categorical variables: Four (4) variables have missing data, namely, Gender, Married, Dependents and Self Employed.
- Numeric variables: Three (3) variables have missing data, namely, Loan Amount, Loan Amount Term and Credit History.
Class distribution:
- Gender: 81% were males, and 19% were females.
- Marital status: 65% were married, and 35% were unmarried.
- Dependents: 58% had no dependents, 17% had one dependent, 17% had two dependents, and 8% had three or more than three dependents.
- Education: 78% had a graduate degree, and 22% had no graduate-level qualification.
- Employment type: 86% were not self-employed, and 14% were self-employed.
- Credit History: 84% had good credit history, and 16% had poor credit history.
- Property Location: 38% were semi-urban, 33% were urban, and 29% were rural.
Response variable:
- Loan Status: 69% of the home loans were approved, and 31% were rejected.
Data Frame Summary
testRDimensions: 367 x 12
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Loan_ID [factor] | 1. LP001015 2. LP001022 3. LP001031 4. LP001035 [ 363 others ] |
|
0 (0%) | |||||||||||||||||||||
| 2 | Gender [factor] | 1. Female 2. Male |
|
11 (3%) | |||||||||||||||||||||
| 3 | Married [factor] | 1. No 2. Yes |
|
0 (0%) | |||||||||||||||||||||
| 4 | Dependents [factor] | 1. 0 2. 1 3. 2 4. 3+ |
|
10 (2.72%) | |||||||||||||||||||||
| 5 | Education [factor] | 1. Graduate 2. Not Graduate |
|
0 (0%) | |||||||||||||||||||||
| 6 | Self_Employed [factor] | 1. No 2. Yes |
|
23 (6.27%) | |||||||||||||||||||||
| 7 | ApplicantIncome [integer] | Mean (sd) : 4805.6 (4910.7) min < med < max: 0 < 3786 < 72529 IQR (CV) : 2196 (1) | 314 distinct values | 0 (0%) | |||||||||||||||||||||
| 8 | CoapplicantIncome [integer] | Mean (sd) : 1569.6 (2334.2) min < med < max: 0 < 1025 < 24000 IQR (CV) : 2430.5 (1.5) | 194 distinct values | 0 (0%) | |||||||||||||||||||||
| 9 | LoanAmount [integer] | Mean (sd) : 136.1 (61.4) min < med < max: 28 < 125 < 550 IQR (CV) : 57.8 (0.5) | 144 distinct values | 5 (1.36%) | |||||||||||||||||||||
| 10 | Loan_Amount_Term [integer] | Mean (sd) : 342.5 (65.2) min < med < max: 6 < 360 < 480 IQR (CV) : 0 (0.2) | 12 distinct values | 6 (1.63%) | |||||||||||||||||||||
| 11 | Credit_History [integer] | Min : 0 Mean : 0.8 Max : 1 |
|
29 (7.9%) | |||||||||||||||||||||
| 12 | Property_Area [factor] | 1. Rural 2. Semiurban 3. Urban |
|
0 (0%) |
The summary output above is for the test data set.
About the data set:
- There are 367 sample observations and 12 variables in the data set.
- There are six (6) categorical variables, namely, Gender, Married, Dependents, Education, Self Employed, and Property Area.
- There are five (5) numeric variables, namely, Applicant Income, Co-applicant Income, Loan Amount, Loan Amount Term and Credit History.
- The response variable, Loan Status, should be predicted using the best model.
- The Loan ID variable is an identifier variable, and will not be of any value for this analysis.
Missing data:
- Categorical variables: Three (3) variables have missing data. Namely, Gender, Dependents and Self Employed.
- Numeric variables: Three (3) variables have missing data. Namely, Loan Amount, Loan Amount Term and Credit History.
Class distribution:
- Gender: 80% were males, and 20% were females.
- Marital status: 64% were married, and 36% were unmarried.
- Dependents: 56% had no dependents, 16% had one dependent, 17% had two dependents, and 11% had three or more than three dependents.
- Education: 77% had a graduate degree, and 23% had no graduate-level qualification.
- Employment type: 89% were not self-employed, and 11% were self-employed.
- Credit History: 83% had good credit history, and 17% had poor credit history.
- Property Location: 38% were urban, 32% were semi-urban, and 30% were rural.
Visualisation of the data
Univariate analysis
- The histograms and box-plots above represent the distribution, shape and spread of the numeric variables Applicant Income, Coapplicant Income, and Loan Amount, from the training data set.
Distribution and skewness:
Income: Both Applicant Income and Coapplicant Income variables does not have a Normal distribution. Both variables show strong right-skewed distribution.
Loan Amount: The Loan Amount variable shows right-skewed distribution.
Outliers:
- Income: There are a couple of outliers with very high income in both Applicant Income and Coapplicant Income variables.
Therefore, it is sensible to perform appropriate transformation for these variables.
Bivariate analysis
The bar-plots above show the bivariate analysis between categorical variables and loan status from the training data set.
Row 1 bar-plots:
- Marital Status: Applicants who were married have higher proportion of approved loans.
- Dependents: Applicants with two (2) dependents have higher proportion of approved loans.
- Gender: Both genders have equal proportion of approved and unapproved loans.
Row 2 bar-plots:
- Credit History: Applicants with good credit history have higher proportion of approved loans.
- Education: Applicants with graduate degree have higher proportion of approved loans.
- Employment Type: The proportion of approved and unapproved loans appear to be similar for both groups.
Row 3 bar-plots:
- Property Area: Semi-urban have higher proportion of approved loans.
- Income Group: The applicant income and coapplicant income were summed and binned as Income Group (Low < 2500, Average < 4000, High < 6000, Affluent < 81000). The Average, High and Affluent groups have higher proportion of approved loans, suggesting higher the household income higher the chances of obtaining loan approval.
- Loan Amount Group: The Loan Amount was binned into three groups (Low < 100, Average < 200, High < 700). The Low and Average loan amount have higher proportion of approved loans.
In summary, the variables credit history, marital status, education, property area, income and loan amount appear to affect the home loan eligibility.
Correlation analysis
The heatmap above shows the correlation between numeric variables in the training data set. The variables Applicant Income and Loan Amount are correlated (0.5), and Credit History and Loan Status are correlated (0.53).
Therefore, it is sensible to perform appropriate transformation for Applicant Income, Coapplicant Income and Loan Amount variables.
Data cleaning
Missing data treatment
## Loan_ID 0
## Gender 0
## Married 0
## Dependents 0
## Education 0
## Self_Employed 0
## ApplicantIncome 0
## CoapplicantIncome 0
## LoanAmount 0
## Loan_Amount_Term 0
## Credit_History 0
## Property_Area 0
## dtype: int64
The summary output above shows that the missing data in training and test data sets were imputed and there are no missing data in both data sets.
The missing data for categorical variables were imputed using mode. The missing data for numeric variable Loan Amount was imputed with median, and Loan Amount Term was imputed with mode.
Feature engineering
Multicollinearity treatment
- The variables Applicant Income and Coapplicant Income, can be summed and merged as Total Income.
- \(Total\,Income_i = Applicant\,Income_i + Coapplicant\,Income_i\).
- The variable Loan Amount can be divided by Loan Amount Term and created as EMI (equated-monthly installments).
- \(EMI_i = Loan\,Amount_i / Loan\,Amount\,Term_i\).
Skewness treatment
As we saw earlier in the univariate analysis the income distribution was right-skewed, so it is sensible to log-transform the newly created Total Income variable.
Likewise the Loan Amount variable was right-skewed, it is sensible to log-transform the EMI variable.
Inspect feature engineering effects
- Following log-transformation, the histograms above show that the Total Income and EMI variables have reasonable Normal distribution.
The heatmap above shows the correlation between newly created and log-transformed variables, Total Income and EMI, and existing variables Credit History and Loan Status.
In summary, there are no major concerns with the feature engineering and transformation.
Finally, the training and test data sets are ready for model selection, evaluation and prediction.
Model selection, evaluation and prediction
- The training data set was split into two data sets with 60:40 ratio. The former will be used for model selection and latter will be used for model evaluation purposes.
Model: Logistic regression
Model selection
##
## Call:
## glm(formula = Loan_Status ~ Married + Education + Credit_History,
## family = "binomial", data = trainSet1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9196 -0.3736 0.5874 0.7326 2.3843
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7823 0.5081 -5.477 4.34e-08 ***
## MarriedYes 0.6468 0.2800 2.310 0.0209 *
## EducationNot Graduate -0.4914 0.3167 -1.552 0.1207
## Credit_History1 3.8053 0.4906 7.756 8.77e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 469.75 on 368 degrees of freedom
## Residual deviance: 340.92 on 365 degrees of freedom
## AIC: 348.92
##
## Number of Fisher Scoring iterations: 5
Using AIC (Akaike Information Criterion) as a goodness-of-fit statistic, the logistic regression summary output above is the best model with lowest AIC.
Model interpretation:
Based on the output above,
The variable Credit History, is highly significant (P-value ~ 0), suggesting when an applicant has good credit history the odds of loan approval is multiplied by 45.15 (holding all constant).
The variable Married, is significant (P-value ~ 0.02), suggesting when an applicant is married the odds of loan approval is multiplied by 1.92 (holding all constant).
Model equation:
\(log(Odds_i) = \beta_0 + \beta_1 * Married_{Yes} + \beta_2 * Education_{NG} + \beta_3 * Credit History\)
Deviance as a goodness-of-fit statistic:
- There is no evidence (P-value ~ 0.81) to suggest lack-of-fit.
Model evaluation
| No | Yes | |
|---|---|---|
| No | 37 | 40 |
| Yes | 5 | 163 |
| Specificity | Sensitivity | Prediction Error |
|---|---|---|
| 0.48 | 0.97 | 0.18 |
The classification accuracy table above shows three key metrics,
Prediction error: The estimated prediction error is 0.18 i.e., 18%, suggesting the mis-classification rate.
Sensitivity: The estimated sensitivity is 0.97 i.e., 97%, suggesting the model’s true positive rate.
Specificity: The estimated specificity is 0.49 i.e., 49%, suggesting the model’s true negative rate.
Model prediction
| Status | Count |
|---|---|
| No | 59 |
| Yes | 308 |
- The summary output above shows the model’s prediction based on test data set.
Model: Decision tree
Model selection
##
## Call:
## C5.0.formula(formula = Loan_Status ~ ., data = trainSet1DT)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Dec 28 00:10:56 2020
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 369 cases (10 attributes) from undefined.data
##
## Decision tree:
##
## Credit_History = 0: N (64/5)
## Credit_History = 1: Y (305/64)
##
##
## Evaluation on training data (369 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 2 69(18.7%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 59 64 (a): class N
## 5 241 (b): class Y
##
##
## Attribute usage:
##
## 100.00% Credit_History
##
##
## Time: 0.0 secs
- The decision tree model summary output above shows that loan eligibility can be predicted based on applicant’s credit history. However, this will lead to a highly biased model.
Model evaluation
| No | Yes | |
|---|---|---|
| No | 37 | 40 |
| Yes | 5 | 163 |
| Specificity | Sensitivity | Prediction Error |
|---|---|---|
| 0.48 | 0.97 | 0.18 |
The classification accuracy table above shows three key metrics,
Prediction error: The estimated prediction error is 0.18 i.e., 18%, suggesting the mis-classification rate.
Sensitivity: The estimated sensitivity is 0.97 i.e., 97%, suggesting the model’s true positive rate.
Specificity: The estimated specificity is 0.48 i.e., 48%, suggesting the model’s true negative rate.
Model prediction
| Loan status | Count |
|---|---|
| N | 59 |
| Y | 308 |
- The summary output above shows the model’s prediction based on test data set.
Model: Naive-Bayes classifier
Model selection
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = trainSet1DT[, c(1:7, 9:10)], y = trainSet1DT[,
## c(8)])
##
## A-priori probabilities:
## trainSet1DT[, c(8)]
## N Y
## 0.3333333 0.6666667
##
## Conditional probabilities:
## Gender
## trainSet1DT[, c(8)] Female Male
## N 0.1951220 0.8048780
## Y 0.1463415 0.8536585
##
## Married
## trainSet1DT[, c(8)] No Yes
## N 0.4390244 0.5609756
## Y 0.2845528 0.7154472
##
## Dependents
## trainSet1DT[, c(8)] 0 1 2 3+
## N 0.60975610 0.19512195 0.12195122 0.07317073
## Y 0.57317073 0.15853659 0.18292683 0.08536585
##
## Education
## trainSet1DT[, c(8)] Graduate Not Graduate
## N 0.7479675 0.2520325
## Y 0.8089431 0.1910569
##
## Self_Employed
## trainSet1DT[, c(8)] No Yes
## N 0.8943089 0.1056911
## Y 0.8658537 0.1341463
##
## Credit_History
## trainSet1DT[, c(8)] 0 1
## N 0.4796748 0.5203252
## Y 0.0203252 0.9796748
##
## Property_Area
## trainSet1DT[, c(8)] Rural Semiurban Urban
## N 0.3414634 0.2845528 0.3739837
## Y 0.2845528 0.3780488 0.3373984
##
## logTotInc
## trainSet1DT[, c(8)] [,1] [,2]
## N 8.650276 0.5882107
## Y 8.689458 0.4956212
##
## logEMI
## trainSet1DT[, c(8)] [,1] [,2]
## N 5.965753 0.5636649
## Y 5.991300 0.5614327
Model evaluation
| No | Yes | |
|---|---|---|
| No | 39 | 38 |
| Yes | 7 | 161 |
| Specificity | Sensitivity | Prediction Error |
|---|---|---|
| 0.51 | 0.96 | 0.18 |
The classification accuracy table above shows three key metrics,
Prediction error: The estimated prediction error is 0.18 i.e., 18%, suggesting the mis-classification rate.
Sensitivity: The estimated sensitivity is 0.96 i.e., 96%, suggesting the model’s true positive rate.
Specificity: The estimated specificity is 0.51 i.e., 51%, suggesting the model’s true negative rate.
Model prediction
| Loan status | Count |
|---|---|
| N | 65 |
| Y | 302 |
- The summary output above shows the model’s prediction based on test data set.
Model comparison
The receiver operating characteristic (ROC) two-dimensional plot above compares the three models based on its classification accuracy.
Using the area-under-curve (AUC) statistic, the Naive-Bayes classifier model is slightly better than the logistic regression and decision tree models. However, there are no significant differences between models.
References
Loan prediction, Anonymous, 2016, Data set, https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/.
summarytools, Dominic Comtois, 2020, R package, https://cran.r-project.org/web/packages/summarytools/index.html.
GGPlot2, Hadley Wickham, 2016, R package, https://ggplot2.tidyverse.org.
gridExtra, Baptiste Auguie, Anton Antonov, 2017, R package, https://cran.r-project.org/web/packages/gridExtra/index.html.
knitr, Several, 2020, R package, https://cran.r-project.org/web/packages/knitr/index.html.
dplyr, Several, 2020, R package, https://cran.r-project.org/web/packages/dplyr/index.html.
ggpubr, Alboukadel Kassambara, 2020, R package, https://cran.csiro.au/web/packages/ggpubr/ggpubr.pdf.
ggcorrplot, Alboukadel Kassambara, 2019, R package, https://cran.r-project.org/web/packages/ggcorrplot/index.html.
tidyverse, Hadley Wickham, 2019, R package, https://cran.r-project.org/web/packages/tidyverse/index.html.
C50, Several, 2020, R package, https://cran.r-project.org/web/packages/C50/index.html.
e1071, Several, 2019, R package, https://cran.r-project.org/web/packages/e1071/index.html.
pROC, Several, 2020, R package, https://cran.r-project.org/web/packages/pROC/index.html.
pandas, Several, 2020, Python module, https://pandas.pydata.org/about/citing.html.
matplotlib, J.D. Hunter, 2007, Python module, https://matplotlib.org.
seaborn, Several, 2017, Python module, https://seaborn.pydata.org.
prettydoc, Several, 2020, R package, https://cran.r-project.org/web/packages/prettydoc/vignettes/cayman.html.
Keywords
- \(loan\, prediction\), \(multivariate\), \(machine\,learning\), \(supervised\, learning\), \(classification\), \(prediction\), \(logistic\, regression\), \(decision\,tree\), \(naive\,Bayes\, classifier\).