A bank wants to automate Loan applications’ approval process, by fitting a machine learning model.
I trained a random forest model using historical loan applications data and tested the model using test data set unseen by the model.
Credit history was the most significant variable in determining loan application’s approval.
Model performance:
Accuracy: 80%.
Sensitivity: 65%, ability to identify unworthy loan applications.
Specificity 88%, ability to identity worthy loan applications.
Area Under the Curve: 77%.
The model will be monitored and improved further as more data is collected.
Dependent Variable, Loan Status
Majority of the loan applications were approved as seen in the table below.
N Y
192 422
Most significant Variable
Credit history of the loan applicants was most significant variable in determining whether the loan application was approved or not approved, as seen in the figure below.
Model Evaluation
Constructing confusion matrix to access model performance as the target variable is a binary response.
Confusion Matrix and Statistics
Reference
Prediction N Y
N 31 13
Y 17 92
Accuracy : 0.8039
95% CI : (0.7321, 0.8636)
No Information Rate : 0.6863
P-Value [Acc > NIR] : 0.0007737
Kappa : 0.5341
Mcnemar's Test P-Value : 0.5838824
Sensitivity : 0.6458
Specificity : 0.8762
Pos Pred Value : 0.7045
Neg Pred Value : 0.8440
Prevalence : 0.3137
Detection Rate : 0.2026
Detection Prevalence : 0.2876
Balanced Accuracy : 0.7610
'Positive' Class : N
Roc Curve
Construction area under the curve. This is only possible since this a classification problem.
The model has AUC of 76%.
End Notes
Find source code and data used here.
.