1. Problem Statement

A bank wants to automate Loan applications’ approval process, by fitting a machine learning model.

2. Solution Summary

I trained a random forest model using historical loan applications data and tested the model using test data set unseen by the model.

Credit history was the most significant variable in determining loan application’s approval.

Model performance:
Accuracy: 80%.
Sensitivity: 65%, ability to identify unworthy loan applications.
Specificity 88%, ability to identity worthy loan applications.
Area Under the Curve: 77%.

The model will be monitored and improved further as more data is collected.

3. Data Analysis

Dependent Variable, Loan Status
Majority of the loan applications were approved as seen in the table below.

  
    N   Y 
  192 422

Most significant Variable
Credit history of the loan applicants was most significant variable in determining whether the loan application was approved or not approved, as seen in the figure below.

Model Evaluation
Constructing confusion matrix to access model performance as the target variable is a binary response.

  Confusion Matrix and Statistics
  
            Reference
  Prediction  N  Y
           N 31 13
           Y 17 92
                                            
                 Accuracy : 0.8039          
                   95% CI : (0.7321, 0.8636)
      No Information Rate : 0.6863          
      P-Value [Acc > NIR] : 0.0007737       
                                            
                    Kappa : 0.5341          
                                            
   Mcnemar's Test P-Value : 0.5838824       
                                            
              Sensitivity : 0.6458          
              Specificity : 0.8762          
           Pos Pred Value : 0.7045          
           Neg Pred Value : 0.8440          
               Prevalence : 0.3137          
           Detection Rate : 0.2026          
     Detection Prevalence : 0.2876          
        Balanced Accuracy : 0.7610          
                                            
         'Positive' Class : N               
  

Roc Curve
Construction area under the curve. This is only possible since this a classification problem.

The model has AUC of 76%.

End Notes
Find source code and data used here.

Code and data

.