Home loan eligibility: What are the odds?

Abstract
Executive summary
- Aim
- Methodology
- Results
- Conclusion
Technical notes
References
Keywords

Abstract

Predicting loan eligibility is a hot topic. Most retail credit lending firms are leveraging digital platform to enable customers to check loan eligibility based on information provided online. In this paper, I argue about a machine learning approach to predict home loan eligibility.

Executive summary

Aim

It is of interest to explore and understand the relationship between variables that influence home loan eligibility. In particular, it is of interest to select a machine learning model to predict home loan eligibility.

Methodology

Data

Two cross-sectional multivariate data sets with twelve (12) predictor variables, were used for this analysis. The currency used in these data sets were INR.
The training data set had 614 sample observations with twelve (12) predictor variables, and one (1) response variable, loan status.
The test data set had 367 sample observations with twelve (12) predictor variables.

Technique(s)

Supervised machine learning.
Three machine learning techniques, namely, logistic regression, decision tree and naive Bayes classifier were used to analyse, classify and predict loan eligibility.
The training data set was randomly split into two data sets with ratio of 60:40 for model selection and evaluation, respectively.
The loan eligibility were predicted using the test data set.

Results

Logistic regression

The logistic regression model predicted the loan eligibility based on credit history, marital status and education. The odds increase exponentially when an applicant has good credit history.

Decision tree

The decision tree model predicted the loan eligibility solely based on credit history, resulting in a highly biased model.

Naive-Bayes classifier

The naive Bayes classifier model predicted the loan eligibility based on conditional probability across all predictor variables. Note the trade-offs between true positive rate and true negative rate.

Model comparison
Model	Accuracy %	True positive rate %	True negative rate %
Logistic regression	82	97	49
Decision tree	82	97	48
Naive-Bayes classifier	82	96	51

Conclusion

From model interpretability and performance standpoint, the logistic regression is better than the decision tree and naive Bayes classifier models.
Additionally, both logistic regression and naive Bayes classifier models could be further trained with additional variables pertaining to applicant and coapplicant age, assets, pre-existing debt, and coapplicant credit history to assess the models performance.

Technical notes

Initial hypotheses

The hypotheses in this section were generated based on the objective of this paper, and prior knowledge about the subject-matter. The hypotheses were defined prior to analysing the data to help with better understanding of data analysis in the forthcoming sections.

Based on prior knowledge, it is known that a variety of factors affect the odds of home loan eligibility. Some of the key factors that might affect the odds of home loan eligibility are applicants credit history, disposable income, assets, age and education.

Credit history: If the credit history is good, then the odds of home loan eligibility is high.
Income: Higher the disposable income, higher the odds of home loan eligibility.
Assets: Higher the value of assets, higher the odds of home loan eligibility.
Age: Early to middle-aged adults have higher potential for employment than senior adults. Hence, higher the odds of home loan eligibility.
Education: Highly educated adults have higher potential to earn, that leads to increase in the odds of home loan eligibility.

Therefore in summary, it is of initial assumption that the key factors that might affect the odds of home loan eligibility are credit history, disposable income, assets, age and education.

Exploratory data analysis

Descriptive statistics of the training and test data sets

Data Frame Summary

trainR
Dimensions: 614 x 13
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

Loan_ID [factor]

1. LP001002 2. LP001003 3. LP001005 4. LP001006 [ 610 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
610	(	99.4%	)

0 (0%)

Gender [factor]

1. Female 2. Male

112	(	18.6%	)
489	(	81.4%	)

13 (2.12%)

Married [factor]

1. No 2. Yes

213	(	34.9%	)
398	(	65.1%	)

3 (0.49%)

Dependents [factor]

1. 0 2. 1 3. 2 4. 3+

345	(	57.6%	)
102	(	17.0%	)
101	(	16.9%	)
51	(	8.5%	)

15 (2.44%)

Education [factor]

1. Graduate 2. Not Graduate

480	(	78.2%	)
134	(	21.8%	)

0 (0%)

Self_Employed [factor]

1. No 2. Yes

500	(	85.9%	)
82	(	14.1%	)

32 (5.21%)

ApplicantIncome [integer]

Mean (sd) : 5403.5 (6109) min < med < max: 150 < 3812.5 < 81000 IQR (CV) : 2917.5 (1.1)

505 distinct values

0 (0%)

CoapplicantIncome [numeric]

Mean (sd) : 1621.2 (2926.2) min < med < max: 0 < 1188.5 < 41667 IQR (CV) : 2297.2 (1.8)

287 distinct values

0 (0%)

LoanAmount [integer]

Mean (sd) : 146.4 (85.6) min < med < max: 9 < 128 < 700 IQR (CV) : 68 (0.6)

203 distinct values

22 (3.58%)

Loan_Amount_Term [integer]

Mean (sd) : 342 (65.1) min < med < max: 12 < 360 < 480 IQR (CV) : 0 (0.2)

10 distinct values

14 (2.28%)

Credit_History [integer]

Min : 0 Mean : 0.8 Max : 1

0	:	89	(	15.8%	)
1	:	475	(	84.2%	)

50 (8.14%)

Property_Area [factor]

1. Rural 2. Semiurban 3. Urban

179	(	29.1%	)
233	(	38.0%	)
202	(	32.9%	)

0 (0%)

Loan_Status [factor]

1. N 2. Y

192	(	31.3%	)
422	(	68.7%	)

0 (0%)

The summary output above is for the training data set.

About the data set:

There are 614 sample observations and 13 variables in the data set.
There are six (6) categorical variables, namely, Gender, Married, Dependents, Education, Self Employed, and Property Area.
There are five (5) numeric variables, namely, Applicant Income, Co-applicant Income, Loan Amount, Loan Amount Term and Credit History.
The response variable, Loan Status, is of numeric type. Loan Status 1 suggests the loan was approved, and 0, not approved.
The Loan ID variable is an identifier variable, and will not be of any value for this analysis.

Missing data:

Categorical variables: Four (4) variables have missing data, namely, Gender, Married, Dependents and Self Employed.
Numeric variables: Three (3) variables have missing data, namely, Loan Amount, Loan Amount Term and Credit History.

Class distribution:

Gender: 81% were males, and 19% were females.
Marital status: 65% were married, and 35% were unmarried.
Dependents: 58% had no dependents, 17% had one dependent, 17% had two dependents, and 8% had three or more than three dependents.
Education: 78% had a graduate degree, and 22% had no graduate-level qualification.
Employment type: 86% were not self-employed, and 14% were self-employed.
Credit History: 84% had good credit history, and 16% had poor credit history.
Property Location: 38% were semi-urban, 33% were urban, and 29% were rural.

Response variable:

Loan Status: 69% of the home loans were approved, and 31% were rejected.

Data Frame Summary

testR
Dimensions: 367 x 12
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

Loan_ID [factor]

1. LP001015 2. LP001022 3. LP001031 4. LP001035 [ 363 others ]

1	(	0.3%	)
1	(	0.3%	)
1	(	0.3%	)
1	(	0.3%	)
363	(	98.9%	)

0 (0%)

Gender [factor]

1. Female 2. Male

70	(	19.7%	)
286	(	80.3%	)

11 (3%)

Married [factor]

1. No 2. Yes

134	(	36.5%	)
233	(	63.5%	)

0 (0%)

Dependents [factor]

1. 0 2. 1 3. 2 4. 3+

200	(	56.0%	)
58	(	16.2%	)
59	(	16.5%	)
40	(	11.2%	)

10 (2.72%)

Education [factor]

1. Graduate 2. Not Graduate

283	(	77.1%	)
84	(	22.9%	)

0 (0%)

Self_Employed [factor]

1. No 2. Yes

307	(	89.2%	)
37	(	10.8%	)

23 (6.27%)

ApplicantIncome [integer]

Mean (sd) : 4805.6 (4910.7) min < med < max: 0 < 3786 < 72529 IQR (CV) : 2196 (1)

314 distinct values

0 (0%)

CoapplicantIncome [integer]

Mean (sd) : 1569.6 (2334.2) min < med < max: 0 < 1025 < 24000 IQR (CV) : 2430.5 (1.5)

194 distinct values

0 (0%)

LoanAmount [integer]

Mean (sd) : 136.1 (61.4) min < med < max: 28 < 125 < 550 IQR (CV) : 57.8 (0.5)

144 distinct values

5 (1.36%)

Loan_Amount_Term [integer]

Mean (sd) : 342.5 (65.2) min < med < max: 6 < 360 < 480 IQR (CV) : 0 (0.2)

12 distinct values

6 (1.63%)

Credit_History [integer]

Min : 0 Mean : 0.8 Max : 1

0	:	59	(	17.5%	)
1	:	279	(	82.5%	)

29 (7.9%)

Property_Area [factor]

1. Rural 2. Semiurban 3. Urban

111	(	30.2%	)
116	(	31.6%	)
140	(	38.1%	)

0 (0%)

The summary output above is for the test data set.

About the data set:

There are 367 sample observations and 12 variables in the data set.
There are six (6) categorical variables, namely, Gender, Married, Dependents, Education, Self Employed, and Property Area.
There are five (5) numeric variables, namely, Applicant Income, Co-applicant Income, Loan Amount, Loan Amount Term and Credit History.
The response variable, Loan Status, should be predicted using the best model.
The Loan ID variable is an identifier variable, and will not be of any value for this analysis.

Missing data:

Categorical variables: Three (3) variables have missing data. Namely, Gender, Dependents and Self Employed.
Numeric variables: Three (3) variables have missing data. Namely, Loan Amount, Loan Amount Term and Credit History.

Class distribution:

Gender: 80% were males, and 20% were females.
Marital status: 64% were married, and 36% were unmarried.
Dependents: 56% had no dependents, 16% had one dependent, 17% had two dependents, and 11% had three or more than three dependents.
Education: 77% had a graduate degree, and 23% had no graduate-level qualification.
Employment type: 89% were not self-employed, and 11% were self-employed.
Credit History: 83% had good credit history, and 17% had poor credit history.
Property Location: 38% were urban, 32% were semi-urban, and 30% were rural.

Visualisation of the data

Univariate analysis

The histograms and box-plots above represent the distribution, shape and spread of the numeric variables Applicant Income, Coapplicant Income, and Loan Amount, from the training data set.

Distribution and skewness:

Income: Both Applicant Income and Coapplicant Income variables does not have a Normal distribution. Both variables show strong right-skewed distribution.
Loan Amount: The Loan Amount variable shows right-skewed distribution.

Outliers:

Income: There are a couple of outliers with very high income in both Applicant Income and Coapplicant Income variables.

Therefore, it is sensible to perform appropriate transformation for these variables.

Bivariate analysis

The bar-plots above show the bivariate analysis between categorical variables and loan status from the training data set.

Row 1 bar-plots:

Marital Status: Applicants who were married have higher proportion of approved loans.
Dependents: Applicants with two (2) dependents have higher proportion of approved loans.
Gender: Both genders have equal proportion of approved and unapproved loans.

Row 2 bar-plots:

Credit History: Applicants with good credit history have higher proportion of approved loans.
Education: Applicants with graduate degree have higher proportion of approved loans.
Employment Type: The proportion of approved and unapproved loans appear to be similar for both groups.

Row 3 bar-plots:

Property Area: Semi-urban have higher proportion of approved loans.
Income Group: The applicant income and coapplicant income were summed and binned as Income Group (Low < 2500, Average < 4000, High < 6000, Affluent < 81000). The Average, High and Affluent groups have higher proportion of approved loans, suggesting higher the household income higher the chances of obtaining loan approval.
Loan Amount Group: The Loan Amount was binned into three groups (Low < 100, Average < 200, High < 700). The Low and Average loan amount have higher proportion of approved loans.

In summary, the variables credit history, marital status, education, property area, income and loan amount appear to affect the home loan eligibility.

Correlation analysis

The heatmap above shows the correlation between numeric variables in the training data set. The variables Applicant Income and Loan Amount are correlated (0.5), and Credit History and Loan Status are correlated (0.53).
Therefore, it is sensible to perform appropriate transformation for Applicant Income, Coapplicant Income and Loan Amount variables.

Data cleaning

Missing data treatment

## Loan_ID              0
## Gender               0
## Married              0
## Dependents           0
## Education            0
## Self_Employed        0
## ApplicantIncome      0
## CoapplicantIncome    0
## LoanAmount           0
## Loan_Amount_Term     0
## Credit_History       0
## Property_Area        0
## dtype: int64

The summary output above shows that the missing data in training and test data sets were imputed and there are no missing data in both data sets.
The missing data for categorical variables were imputed using mode. The missing data for numeric variable Loan Amount was imputed with median, and Loan Amount Term was imputed with mode.

Feature engineering

Multicollinearity treatment

The variables Applicant Income and Coapplicant Income, can be summed and merged as Total Income.
- \(Total\,Income_i = Applicant\,Income_i + Coapplicant\,Income_i\).
The variable Loan Amount can be divided by Loan Amount Term and created as EMI (equated-monthly installments).
- \(EMI_i = Loan\,Amount_i / Loan\,Amount\,Term_i\).

Skewness treatment

As we saw earlier in the univariate analysis the income distribution was right-skewed, so it is sensible to log-transform the newly created Total Income variable.
Likewise the Loan Amount variable was right-skewed, it is sensible to log-transform the EMI variable.

Inspect feature engineering effects

Following log-transformation, the histograms above show that the Total Income and EMI variables have reasonable Normal distribution.

The heatmap above shows the correlation between newly created and log-transformed variables, Total Income and EMI, and existing variables Credit History and Loan Status.
In summary, there are no major concerns with the feature engineering and transformation.
Finally, the training and test data sets are ready for model selection, evaluation and prediction.

Model selection, evaluation and prediction

The training data set was split into two data sets with 60:40 ratio. The former will be used for model selection and latter will be used for model evaluation purposes.

Model: Logistic regression

Model selection

## 
## Call:
## glm(formula = Loan_Status ~ Married + Education + Credit_History, 
##     family = "binomial", data = trainSet1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9196  -0.3736   0.5874   0.7326   2.3843  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.7823     0.5081  -5.477 4.34e-08 ***
## MarriedYes              0.6468     0.2800   2.310   0.0209 *  
## EducationNot Graduate  -0.4914     0.3167  -1.552   0.1207    
## Credit_History1         3.8053     0.4906   7.756 8.77e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 469.75  on 368  degrees of freedom
## Residual deviance: 340.92  on 365  degrees of freedom
## AIC: 348.92
## 
## Number of Fisher Scoring iterations: 5

Using AIC (Akaike Information Criterion) as a goodness-of-fit statistic, the logistic regression summary output above is the best model with lowest AIC.

Model interpretation:

Based on the output above,

The variable Credit History, is highly significant (P-value ~ 0), suggesting when an applicant has good credit history the odds of loan approval is multiplied by 45.15 (holding all constant).
The variable Married, is significant (P-value ~ 0.02), suggesting when an applicant is married the odds of loan approval is multiplied by 1.92 (holding all constant).

Model equation:

\(log(Odds_i) = \beta_0 + \beta_1 * Married_{Yes} + \beta_2 * Education_{NG} + \beta_3 * Credit History\)

Deviance as a goodness-of-fit statistic:

There is no evidence (P-value ~ 0.81) to suggest lack-of-fit.

Model evaluation

Classification Matrix
	No	Yes
No	37	40
Yes	5	163

Classification Accuracy
Specificity	Sensitivity	Prediction Error
0.48	0.97	0.18

The classification accuracy table above shows three key metrics,

Prediction error: The estimated prediction error is 0.18 i.e., 18%, suggesting the mis-classification rate.
Sensitivity: The estimated sensitivity is 0.97 i.e., 97%, suggesting the model’s true positive rate.
Specificity: The estimated specificity is 0.49 i.e., 49%, suggesting the model’s true negative rate.

Model prediction

Logistic regression: Test data prediction
Status	Count
No	59
Yes	308

The summary output above shows the model’s prediction based on test data set.

Model: Decision tree

Model selection

## 
## Call:
## C5.0.formula(formula = Loan_Status ~ ., data = trainSet1DT)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Dec 28 00:10:56 2020
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 369 cases (10 attributes) from undefined.data
## 
## Decision tree:
## 
## Credit_History = 0: N (64/5)
## Credit_History = 1: Y (305/64)
## 
## 
## Evaluation on training data (369 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       2   69(18.7%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##      59    64    (a): class N
##       5   241    (b): class Y
## 
## 
##  Attribute usage:
## 
##  100.00% Credit_History
## 
## 
## Time: 0.0 secs

The decision tree model summary output above shows that loan eligibility can be predicted based on applicant’s credit history. However, this will lead to a highly biased model.

Model evaluation

Decision tree: Classification matrix
	No	Yes
No	37	40
Yes	5	163

Decision tree: Classification Accuracy
Specificity	Sensitivity	Prediction Error
0.48	0.97	0.18

The classification accuracy table above shows three key metrics,

Prediction error: The estimated prediction error is 0.18 i.e., 18%, suggesting the mis-classification rate.
Sensitivity: The estimated sensitivity is 0.97 i.e., 97%, suggesting the model’s true positive rate.
Specificity: The estimated specificity is 0.48 i.e., 48%, suggesting the model’s true negative rate.

Model prediction

Decision tree: Test data prediction
Loan status	Count
N	59
Y	308

The summary output above shows the model’s prediction based on test data set.

Model: Naive-Bayes classifier

Model selection

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = trainSet1DT[, c(1:7, 9:10)], y = trainSet1DT[, 
##     c(8)])
## 
## A-priori probabilities:
## trainSet1DT[, c(8)]
##         N         Y 
## 0.3333333 0.6666667 
## 
## Conditional probabilities:
##                    Gender
## trainSet1DT[, c(8)]    Female      Male
##                   N 0.1951220 0.8048780
##                   Y 0.1463415 0.8536585
## 
##                    Married
## trainSet1DT[, c(8)]        No       Yes
##                   N 0.4390244 0.5609756
##                   Y 0.2845528 0.7154472
## 
##                    Dependents
## trainSet1DT[, c(8)]          0          1          2         3+
##                   N 0.60975610 0.19512195 0.12195122 0.07317073
##                   Y 0.57317073 0.15853659 0.18292683 0.08536585
## 
##                    Education
## trainSet1DT[, c(8)]  Graduate Not Graduate
##                   N 0.7479675    0.2520325
##                   Y 0.8089431    0.1910569
## 
##                    Self_Employed
## trainSet1DT[, c(8)]        No       Yes
##                   N 0.8943089 0.1056911
##                   Y 0.8658537 0.1341463
## 
##                    Credit_History
## trainSet1DT[, c(8)]         0         1
##                   N 0.4796748 0.5203252
##                   Y 0.0203252 0.9796748
## 
##                    Property_Area
## trainSet1DT[, c(8)]     Rural Semiurban     Urban
##                   N 0.3414634 0.2845528 0.3739837
##                   Y 0.2845528 0.3780488 0.3373984
## 
##                    logTotInc
## trainSet1DT[, c(8)]     [,1]      [,2]
##                   N 8.650276 0.5882107
##                   Y 8.689458 0.4956212
## 
##                    logEMI
## trainSet1DT[, c(8)]     [,1]      [,2]
##                   N 5.965753 0.5636649
##                   Y 5.991300 0.5614327

Model evaluation

Naive-Bayes model: Classification matrix
	No	Yes
No	39	38
Yes	7	161

Naive-Bayes model: Classification Accuracy
Specificity	Sensitivity	Prediction Error
0.51	0.96	0.18

The classification accuracy table above shows three key metrics,

Prediction error: The estimated prediction error is 0.18 i.e., 18%, suggesting the mis-classification rate.
Sensitivity: The estimated sensitivity is 0.96 i.e., 96%, suggesting the model’s true positive rate.
Specificity: The estimated specificity is 0.51 i.e., 51%, suggesting the model’s true negative rate.

Model prediction

Naive-Bayes model: Test data prediction
Loan status	Count
N	65
Y	302

The summary output above shows the model’s prediction based on test data set.

Model comparison

The receiver operating characteristic (ROC) two-dimensional plot above compares the three models based on its classification accuracy.
Using the area-under-curve (AUC) statistic, the Naive-Bayes classifier model is slightly better than the logistic regression and decision tree models. However, there are no significant differences between models.

References

Loan prediction, Anonymous, 2016, Data set, https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/.
summarytools, Dominic Comtois, 2020, R package, https://cran.r-project.org/web/packages/summarytools/index.html.
GGPlot2, Hadley Wickham, 2016, R package, https://ggplot2.tidyverse.org.
gridExtra, Baptiste Auguie, Anton Antonov, 2017, R package, https://cran.r-project.org/web/packages/gridExtra/index.html.
knitr, Several, 2020, R package, https://cran.r-project.org/web/packages/knitr/index.html.
dplyr, Several, 2020, R package, https://cran.r-project.org/web/packages/dplyr/index.html.
ggpubr, Alboukadel Kassambara, 2020, R package, https://cran.csiro.au/web/packages/ggpubr/ggpubr.pdf.
ggcorrplot, Alboukadel Kassambara, 2019, R package, https://cran.r-project.org/web/packages/ggcorrplot/index.html.
tidyverse, Hadley Wickham, 2019, R package, https://cran.r-project.org/web/packages/tidyverse/index.html.
C50, Several, 2020, R package, https://cran.r-project.org/web/packages/C50/index.html.
e1071, Several, 2019, R package, https://cran.r-project.org/web/packages/e1071/index.html.
pROC, Several, 2020, R package, https://cran.r-project.org/web/packages/pROC/index.html.
pandas, Several, 2020, Python module, https://pandas.pydata.org/about/citing.html.
matplotlib, J.D. Hunter, 2007, Python module, https://matplotlib.org.
seaborn, Several, 2017, Python module, https://seaborn.pydata.org.
prettydoc, Several, 2020, R package, https://cran.r-project.org/web/packages/prettydoc/vignettes/cayman.html.

Keywords

\(loan\, prediction\), \(multivariate\), \(machine\,learning\), \(supervised\, learning\), \(classification\), \(prediction\), \(logistic\, regression\), \(decision\,tree\), \(naive\,Bayes\, classifier\).