Bank Loan Classifier

Author

Darwhin Gomez

1 Overview

Financial institutions generate revenue by providing a wide range of services, including access to capital through offerings like loans, credit cards, and other financial products. One of the critical challenges they face is determining whether a client should be approved for a loan or credit. This decision requires careful consideration of multiple factors, including the applicant’s income, existing debts, age, salary, assets, and other financial metrics.

Machine learning can be effectively applied to this problem by leveraging historical data on past loan approvals to build predictive models. These models can analyze various factors and accurately classify whether a client should be approved for a loan or not. In this project, I will train three different machine learning models on a labeled financial dataset to classify clients as either “approved” or “not approved.”

However, the use of machine learning in credit approval must adhere to regulatory requirements that prioritize transparency and fairness. Financial institutions are often required by law to provide clear explanations for their credit decisions, ensuring that customers understand why they were approved or denied. This regulatory requirement is driven by principles of consumer protection and fairness, as outlined in laws such as the Fair Credit Reporting Act (FCRA) in the United States and the General Data Protection Regulation (GDPR) in Europe.

This need for transparency directly influences the choice of machine learning models used in credit decisioning. While complex models like Random Forests and Gradient Boosting can offer high accuracy, they are often considered “black boxes” due to their lack of interpretability. In contrast, simpler models like Logistic Regression provide clear and easily understandable explanations for decisions, making them more compliant with regulatory standards.

In this project, I will implement and compare three machine learning models:

  • Logistic Regression: A transparent and interpretable model that provides clear insights into how features impact approval decisions.

  • Random Forest: A more complex model that can capture non-linear relationships but requires additional methods for interpretation.

  • Neural Network: A binary classification neural network that will be implemented to gauge the effectiveness of deep learning in this context. This model has the potential to capture intricate patterns.

By comparing these models, I aim to not only achieve high predictive performance but also maintain a balance between accuracy and transparency, ensuring that the final model can be effectively explained to both customers and regulatory bodies.

2 EDA

2.1 Data Descriptions

Column Description Type
person_age Age of the person Float
person_gender Gender of the person Categorical
person_education Highest education level Categorical
person_income Annual income Float
person_emp_exp Years of employment experience Integer
person_home_ownership Home ownership status (e.g., rent, own, mortgage) Categorical
loan_amnt Loan amount requested Float
loan_intent Purpose of the loan Categorical
loan_int_rate Loan interest rate Float
loan_percent_income Loan amount as a percentage of annual income Float
cb_person_cred_hist_length Length of credit history in years Float
credit_score Credit score of the person Integer
previous_loan_defaults_on_file Indicator of previous loan defaults Categorical
loan_status (target variable) Loan approval status: 1 = approved; 0 = rejected Integer
Code
 loan_data<- read.csv("loan_data.csv")
 str(loan_data)
'data.frame':   45000 obs. of  14 variables:
 $ person_age                    : num  22 21 25 23 24 21 26 24 24 21 ...
 $ person_gender                 : chr  "female" "female" "female" "female" ...
 $ person_education              : chr  "Master" "High School" "High School" "Bachelor" ...
 $ person_income                 : num  71948 12282 12438 79753 66135 ...
 $ person_emp_exp                : int  0 0 3 0 1 0 1 5 3 0 ...
 $ person_home_ownership         : chr  "RENT" "OWN" "MORTGAGE" "RENT" ...
 $ loan_amnt                     : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
 $ loan_intent                   : chr  "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
 $ loan_int_rate                 : num  16 11.1 12.9 15.2 14.3 ...
 $ loan_percent_income           : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
 $ cb_person_cred_hist_length    : num  3 2 3 2 4 2 3 4 2 3 ...
 $ credit_score                  : int  561 504 635 675 586 532 701 585 544 640 ...
 $ previous_loan_defaults_on_file: chr  "No" "Yes" "No" "No" ...
 $ loan_status                   : int  1 0 1 1 1 1 1 1 1 1 ...
Code
 skim(loan_data)
Data summary
Name loan_data
Number of rows 45000
Number of columns 14
_______________________
Column type frequency:
character 5
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
person_gender 0 1 4 6 0 2 0
person_education 0 1 6 11 0 5 0
person_home_ownership 0 1 3 8 0 4 0
loan_intent 0 1 7 17 0 6 0
previous_loan_defaults_on_file 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
person_age 0 1 27.76 6.05 20.00 24.00 26.00 30.00 144.00 ▇▁▁▁▁
person_income 0 1 80319.05 80422.50 8000.00 47204.00 67048.00 95789.25 7200766.00 ▇▁▁▁▁
person_emp_exp 0 1 5.41 6.06 0.00 1.00 4.00 8.00 125.00 ▇▁▁▁▁
loan_amnt 0 1 9583.16 6314.89 500.00 5000.00 8000.00 12237.25 35000.00 ▇▆▂▁▁
loan_int_rate 0 1 11.01 2.98 5.42 8.59 11.01 12.99 20.00 ▆▇▆▃▁
loan_percent_income 0 1 0.14 0.09 0.00 0.07 0.12 0.19 0.66 ▇▅▁▁▁
cb_person_cred_hist_length 0 1 5.87 3.88 2.00 3.00 4.00 8.00 30.00 ▇▂▁▁▁
credit_score 0 1 632.61 50.44 390.00 601.00 640.00 670.00 850.00 ▁▂▇▃▁
loan_status 0 1 0.22 0.42 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
Code
table(loan_data$person_education)

  Associate    Bachelor   Doctorate High School      Master 
      12028       13399         621       11972        6980 
Code
table(loan_data$person_gender)

female   male 
 20159  24841 
Code
table(loan_data$loan_intent)

DEBTCONSOLIDATION         EDUCATION   HOMEIMPROVEMENT           MEDICAL 
             7145              9153              4783              8548 
         PERSONAL           VENTURE 
             7552              7819 
Code
table(loan_data$person_home_ownership)

MORTGAGE    OTHER      OWN     RENT 
   18489      117     2951    23443 
Code
table(loan_data$previous_loan_defaults_on_file)

   No   Yes 
22142 22858 
Code
barplot(table(loan_data$person_education), 
        main = "Distribution of Education Levels",
        col = "grey", 
        las = 2, 
        cex.names = 0.8)

Code
barplot(table(loan_data$person_gender), 
        main = "Distribution of Gender",
       col = "grey", 
        las = 2, 
        cex.names = 0.8)

Code
barplot(table(loan_data$loan_intent), 
        main = "Distribution of Loan Intent",
        col = "grey", 
        las = 2, 
        cex.names = 0.8)

Code
barplot(table(loan_data$person_home_ownership), 
        main = "Distribution of Home Ownership",
        col = "grey", 
        las = 2, 
        cex.names = 0.8)

Code
barplot(table(loan_data$previous_loan_defaults_on_file), 
        main = "Distribution of Previous Loan Defaults",
        col = "grey",  
        las = 2, 
        cex.names = 0.8)

Code
numerical_features <- loan_data %>% 
  select_if(is.numeric ) %>%
  select(-loan_status)%>%
  colnames()

# Create box plots and histograms for each numerical feature
for (feature in numerical_features) {
  # Boxplot
  boxplot_plot <- ggplot(loan_data, aes(y = !!sym(feature))) +
    geom_boxplot(fill = "skyblue", color = "darkblue", outlier.color = "red") +
    labs(title = paste("Boxplot of", feature), y = feature, x = "") +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 30, face = "bold"),
      axis.title.y = element_text(size = 30),
      axis.text.y = element_text(size = 25),
      axis.text.x = element_text(size = 25)
    )
  
  # Histogram
  hist_plot <- ggplot(loan_data, aes(x = !!sym(feature))) +
    geom_histogram(aes(y = ..density..), fill = "lightgreen", color = "darkgreen", bins = 30) +
    geom_density(color = "red", linewidth = 1) +  # Updated to linewidth
    labs(title = paste("Histogram of", feature), x = feature, y = "Density") +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 30, face = "bold"),
      axis.title.x = element_text(size = 20),
      axis.title.y = element_text(size = 20),
      axis.text.x = element_text(size = 15),
      axis.text.y = element_text(size = 15)
    )
  
  # Display plots side by side
  grid.arrange(boxplot_plot, hist_plot, ncol = 2)
}
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Code
loan_data|>
  filter(person_age>100)|>
  print()
  person_age person_gender person_education person_income person_emp_exp
1        144          male         Bachelor        300616            125
2        144          male        Associate        241424            121
3        123        female      High School         97140            101
4        123          male         Bachelor         94723            100
5        144        female        Associate       7200766            124
6        116          male         Bachelor       5545545             93
7        109          male      High School       5556399             85
  person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income
1                  RENT      4800     VENTURE         13.57                0.02
2              MORTGAGE      6000   EDUCATION         11.86                0.02
3                  RENT     20400   EDUCATION         10.25                0.21
4                  RENT     20000     VENTURE         11.01                0.21
5              MORTGAGE      5000    PERSONAL         12.73                0.00
6              MORTGAGE      3823     VENTURE         12.15                0.00
7              MORTGAGE      6195     VENTURE         12.58                0.00
  cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
1                          3          789                             No
2                          2          807                             No
3                          3          805                            Yes
4                          4          714                            Yes
5                         25          850                             No
6                         24          708                             No
7                         22          792                             No
  loan_status
1           0
2           0
3           0
4           0
5           0
6           0
7           0
Code
table(loan_data$loan_status)

    0     1 
35000 10000 
Code
hist(loan_data$loan_status)

2.2 Outlier Detection and Removal

During the exploratory data analysis, it became apparent that there were clear outliers in the dataset, particularly in the person_age variable. Specifically, some applicants were listed as being over 100 years old, with the highest age reaching 109. Given that it is highly unlikely for individuals of such advanced age to be actively applying for new loans, these records were deemed unrealistic and were removed from the dataset. This decision ensures that the model is trained on more accurate and reliable data, minimizing the risk of skewed predictions due to extreme values.

Code
# Removing rows where person_age is 100 or more
loan_data_cleaned <- loan_data %>% 
  filter(person_age < 100)

3 PreProcessing

count(loan_data_encoded$loan_status)

Code
# Binary Encoding
loan_data_cleaned <- loan_data_cleaned %>%
  mutate(
    person_gender = ifelse(person_gender == "male", 1, 0),
    previous_loan_defaults_on_file = ifelse(previous_loan_defaults_on_file == "Yes", 1, 0)
  )

# One-Hot Encoding for Nominal Features
loan_data_encoded <- loan_data_cleaned %>%
  mutate(
    person_education = as.factor(person_education),
    person_home_ownership = as.factor(person_home_ownership),
    loan_intent = as.factor(loan_intent)
  ) %>%
  # Applying one-hot encoding
  model.matrix(~ . - 1, data = .) %>% 
  as.data.frame()
# Convert the target variable to a clean binary factor (Yes/No)
loan_data_encoded$loan_status <- factor(loan_data_encoded$loan_status, 
                                        levels = c("0", "1"), 
                                        labels = c("No", "Yes"))
head(loan_data_encoded,10)
   person_age person_gender person_educationAssociate person_educationBachelor
1          22             0                         0                        0
2          21             0                         0                        0
3          25             0                         0                        0
4          23             0                         0                        1
5          24             1                         0                        0
6          21             0                         0                        0
7          26             0                         0                        1
8          24             0                         0                        0
9          24             0                         1                        0
10         21             0                         0                        0
   person_educationDoctorate person_educationHigh School person_educationMaster
1                          0                           0                      1
2                          0                           1                      0
3                          0                           1                      0
4                          0                           0                      0
5                          0                           0                      1
6                          0                           1                      0
7                          0                           0                      0
8                          0                           1                      0
9                          0                           0                      0
10                         0                           1                      0
   person_income person_emp_exp person_home_ownershipOTHER
1          71948              0                          0
2          12282              0                          0
3          12438              3                          0
4          79753              0                          0
5          66135              1                          0
6          12951              0                          0
7          93471              1                          0
8          95550              5                          0
9         100684              3                          0
10         12739              0                          0
   person_home_ownershipOWN person_home_ownershipRENT loan_amnt
1                         0                         1     35000
2                         1                         0      1000
3                         0                         0      5500
4                         0                         1     35000
5                         0                         1     35000
6                         1                         0      2500
7                         0                         1     35000
8                         0                         1     35000
9                         0                         1     35000
10                        1                         0      1600
   loan_intentEDUCATION loan_intentHOMEIMPROVEMENT loan_intentMEDICAL
1                     0                          0                  0
2                     1                          0                  0
3                     0                          0                  1
4                     0                          0                  1
5                     0                          0                  1
6                     0                          0                  0
7                     1                          0                  0
8                     0                          0                  1
9                     0                          0                  0
10                    0                          0                  0
   loan_intentPERSONAL loan_intentVENTURE loan_int_rate loan_percent_income
1                    1                  0         16.02                0.49
2                    0                  0         11.14                0.08
3                    0                  0         12.87                0.44
4                    0                  0         15.23                0.44
5                    0                  0         14.27                0.53
6                    0                  1          7.14                0.19
7                    0                  0         12.42                0.37
8                    0                  0         11.11                0.37
9                    1                  0          8.90                0.35
10                   0                  1         14.74                0.13
   cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
1                           3          561                              0
2                           2          504                              1
3                           3          635                              0
4                           2          675                              0
5                           4          586                              0
6                           2          532                              0
7                           3          701                              0
8                           4          585                              0
9                           2          544                              0
10                          3          640                              0
   loan_status
1          Yes
2           No
3          Yes
4          Yes
5          Yes
6          Yes
7          Yes
8          Yes
9          Yes
10         Yes
Code
table(loan_data_encoded$loan_status)

   No   Yes 
34993 10000 

4 Modeling

Code
# Splitting the data into training and test sets (75/25)
train_index <- createDataPartition(loan_data_encoded$loan_status, p = 0.75, list = FALSE)
train_data <- loan_data_encoded[train_index, ]
test_data <- loan_data_encoded[-train_index, ]

# Crossvalidtion
cv_control <- trainControl(
  method = "cv", 
  number = 5, 
  verboseIter = TRUE,
  classProbs = TRUE, 
  summaryFunction = twoClassSummary
)
train_data$loan_status <- as.factor(train_data$loan_status)
test_data$loan_status <- as.factor(test_data$loan_status)

# Check the levels to ensure it is binary
levels(train_data$loan_status)
[1] "No"  "Yes"
Code
head(train_data)
  person_age person_gender person_educationAssociate person_educationBachelor
2         21             0                         0                        0
3         25             0                         0                        0
4         23             0                         0                        1
6         21             0                         0                        0
7         26             0                         0                        1
8         24             0                         0                        0
  person_educationDoctorate person_educationHigh School person_educationMaster
2                         0                           1                      0
3                         0                           1                      0
4                         0                           0                      0
6                         0                           1                      0
7                         0                           0                      0
8                         0                           1                      0
  person_income person_emp_exp person_home_ownershipOTHER
2         12282              0                          0
3         12438              3                          0
4         79753              0                          0
6         12951              0                          0
7         93471              1                          0
8         95550              5                          0
  person_home_ownershipOWN person_home_ownershipRENT loan_amnt
2                        1                         0      1000
3                        0                         0      5500
4                        0                         1     35000
6                        1                         0      2500
7                        0                         1     35000
8                        0                         1     35000
  loan_intentEDUCATION loan_intentHOMEIMPROVEMENT loan_intentMEDICAL
2                    1                          0                  0
3                    0                          0                  1
4                    0                          0                  1
6                    0                          0                  0
7                    1                          0                  0
8                    0                          0                  1
  loan_intentPERSONAL loan_intentVENTURE loan_int_rate loan_percent_income
2                   0                  0         11.14                0.08
3                   0                  0         12.87                0.44
4                   0                  0         15.23                0.44
6                   0                  1          7.14                0.19
7                   0                  0         12.42                0.37
8                   0                  0         11.11                0.37
  cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
2                          2          504                              1
3                          3          635                              0
4                          2          675                              0
6                          2          532                              0
7                          3          701                              0
8                          4          585                              0
  loan_status
2          No
3         Yes
4         Yes
6         Yes
7         Yes
8         Yes
Code
### Logistic Regression with Grid Search (CV) ###
lr_grid <- expand.grid(
  alpha = c(0, 0.5, 1), # 0 = Ridge, 1 = Lasso, 0.5 = Elastic
  lambda = seq(0.001, 0.1, length = 10)  # Regularization strengt
)
### Logistic Regression with CV ###
lr_cv <- train(
  loan_status ~ ., 
  data = train_data, 
  method = "glmnet",    
  trControl = cv_control,
  tuneGrid = lr_grid,
  metric = "ROC"
)
+ Fold1: alpha=0.0, lambda=0.1 
- Fold1: alpha=0.0, lambda=0.1 
+ Fold1: alpha=0.5, lambda=0.1 
- Fold1: alpha=0.5, lambda=0.1 
+ Fold1: alpha=1.0, lambda=0.1 
- Fold1: alpha=1.0, lambda=0.1 
+ Fold2: alpha=0.0, lambda=0.1 
- Fold2: alpha=0.0, lambda=0.1 
+ Fold2: alpha=0.5, lambda=0.1 
- Fold2: alpha=0.5, lambda=0.1 
+ Fold2: alpha=1.0, lambda=0.1 
- Fold2: alpha=1.0, lambda=0.1 
+ Fold3: alpha=0.0, lambda=0.1 
- Fold3: alpha=0.0, lambda=0.1 
+ Fold3: alpha=0.5, lambda=0.1 
- Fold3: alpha=0.5, lambda=0.1 
+ Fold3: alpha=1.0, lambda=0.1 
- Fold3: alpha=1.0, lambda=0.1 
+ Fold4: alpha=0.0, lambda=0.1 
- Fold4: alpha=0.0, lambda=0.1 
+ Fold4: alpha=0.5, lambda=0.1 
- Fold4: alpha=0.5, lambda=0.1 
+ Fold4: alpha=1.0, lambda=0.1 
- Fold4: alpha=1.0, lambda=0.1 
+ Fold5: alpha=0.0, lambda=0.1 
- Fold5: alpha=0.0, lambda=0.1 
+ Fold5: alpha=0.5, lambda=0.1 
- Fold5: alpha=0.5, lambda=0.1 
+ Fold5: alpha=1.0, lambda=0.1 
- Fold5: alpha=1.0, lambda=0.1 
Aggregating results
Selecting tuning parameters
Fitting alpha = 1, lambda = 0.001 on full training set
Code
lr_cv
glmnet 

33745 samples
   23 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 26996, 26996, 26996, 26996, 26996 
Resampling results across tuning parameters:

  alpha  lambda  ROC        Sens       Spec      
  0.0    0.001   0.9513692  0.9546580  0.68933333
  0.0    0.012   0.9513692  0.9546580  0.68933333
  0.0    0.023   0.9513404  0.9547723  0.68840000
  0.0    0.034   0.9503969  0.9606020  0.66106667
  0.0    0.045   0.9496148  0.9644885  0.63586667
  0.0    0.056   0.9489433  0.9687560  0.61146667
  0.0    0.067   0.9483548  0.9722995  0.58760000
  0.0    0.078   0.9478280  0.9755382  0.56346667
  0.0    0.089   0.9473556  0.9782054  0.53946667
  0.0    0.100   0.9469073  0.9813298  0.51386667
  0.5    0.001   0.9546551  0.9393789  0.74453333
  0.5    0.012   0.9525433  0.9471899  0.71066667
  0.5    0.023   0.9492251  0.9529053  0.67360000
  0.5    0.034   0.9464651  0.9597638  0.62933333
  0.5    0.045   0.9440454  0.9656315  0.57840000
  0.5    0.056   0.9427355  0.9725281  0.52893333
  0.5    0.067   0.9423622  0.9798819  0.48013333
  0.5    0.078   0.9421940  0.9858259  0.42960000
  0.5    0.089   0.9419933  0.9898266  0.36600000
  0.5    0.100   0.9416990  0.9935226  0.30453333
  1.0    0.001   0.9546706  0.9393408  0.74506667
  1.0    0.012   0.9506973  0.9454753  0.70840000
  1.0    0.023   0.9445611  0.9519528  0.65400000
  1.0    0.034   0.9426180  0.9607925  0.59480000
  1.0    0.045   0.9419336  0.9711564  0.52720000
  1.0    0.056   0.9405963  0.9823204  0.44466667
  1.0    0.067   0.9381269  0.9894456  0.33253333
  1.0    0.078   0.9352390  0.9941703  0.20546667
  1.0    0.089   0.9334380  0.9975233  0.09706667
  1.0    0.100   0.9295429  0.9996190  0.02186667

ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 1 and lambda = 0.001.
Code
rf_grid <- expand.grid(
  mtry = c(3, 5, 7),             
  splitrule = "gini",            # Split rule (Gini for classification)
  min.node.size = c(5, 10)       # Minimum leaf size
)

rf_cv <- train(
  loan_status ~ ., 
  data = train_data, 
  method = "ranger",  # Fast Random Forest
  trControl = cv_control,
  tuneGrid = rf_grid,
  metric = "ROC"
)
+ Fold1: mtry=3, splitrule=gini, min.node.size= 5 
- Fold1: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold1: mtry=5, splitrule=gini, min.node.size= 5 
- Fold1: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold1: mtry=7, splitrule=gini, min.node.size= 5 
- Fold1: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold1: mtry=3, splitrule=gini, min.node.size=10 
- Fold1: mtry=3, splitrule=gini, min.node.size=10 
+ Fold1: mtry=5, splitrule=gini, min.node.size=10 
- Fold1: mtry=5, splitrule=gini, min.node.size=10 
+ Fold1: mtry=7, splitrule=gini, min.node.size=10 
- Fold1: mtry=7, splitrule=gini, min.node.size=10 
+ Fold2: mtry=3, splitrule=gini, min.node.size= 5 
- Fold2: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold2: mtry=5, splitrule=gini, min.node.size= 5 
- Fold2: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold2: mtry=7, splitrule=gini, min.node.size= 5 
- Fold2: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold2: mtry=3, splitrule=gini, min.node.size=10 
- Fold2: mtry=3, splitrule=gini, min.node.size=10 
+ Fold2: mtry=5, splitrule=gini, min.node.size=10 
- Fold2: mtry=5, splitrule=gini, min.node.size=10 
+ Fold2: mtry=7, splitrule=gini, min.node.size=10 
- Fold2: mtry=7, splitrule=gini, min.node.size=10 
+ Fold3: mtry=3, splitrule=gini, min.node.size= 5 
- Fold3: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold3: mtry=5, splitrule=gini, min.node.size= 5 
- Fold3: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold3: mtry=7, splitrule=gini, min.node.size= 5 
- Fold3: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold3: mtry=3, splitrule=gini, min.node.size=10 
- Fold3: mtry=3, splitrule=gini, min.node.size=10 
+ Fold3: mtry=5, splitrule=gini, min.node.size=10 
- Fold3: mtry=5, splitrule=gini, min.node.size=10 
+ Fold3: mtry=7, splitrule=gini, min.node.size=10 
- Fold3: mtry=7, splitrule=gini, min.node.size=10 
+ Fold4: mtry=3, splitrule=gini, min.node.size= 5 
- Fold4: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold4: mtry=5, splitrule=gini, min.node.size= 5 
- Fold4: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold4: mtry=7, splitrule=gini, min.node.size= 5 
- Fold4: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold4: mtry=3, splitrule=gini, min.node.size=10 
- Fold4: mtry=3, splitrule=gini, min.node.size=10 
+ Fold4: mtry=5, splitrule=gini, min.node.size=10 
- Fold4: mtry=5, splitrule=gini, min.node.size=10 
+ Fold4: mtry=7, splitrule=gini, min.node.size=10 
- Fold4: mtry=7, splitrule=gini, min.node.size=10 
+ Fold5: mtry=3, splitrule=gini, min.node.size= 5 
- Fold5: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold5: mtry=5, splitrule=gini, min.node.size= 5 
- Fold5: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold5: mtry=7, splitrule=gini, min.node.size= 5 
- Fold5: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold5: mtry=3, splitrule=gini, min.node.size=10 
- Fold5: mtry=3, splitrule=gini, min.node.size=10 
+ Fold5: mtry=5, splitrule=gini, min.node.size=10 
- Fold5: mtry=5, splitrule=gini, min.node.size=10 
+ Fold5: mtry=7, splitrule=gini, min.node.size=10 
- Fold5: mtry=7, splitrule=gini, min.node.size=10 
Aggregating results
Selecting tuning parameters
Fitting mtry = 7, splitrule = gini, min.node.size = 5 on full training set
Code
nnet_grid <- expand.grid(
  size = c(5, 10, 15),   # Neurons in the hidden layer
  decay = c(0.001, 0.01, 0.1)  # Regularization parameter
)

### Cross-Validated NNet Training ###
nnet_cv <- train(
  loan_status ~ ., 
  data = train_data, 
  method = "nnet",
  trControl = cv_control,
  tuneGrid = nnet_grid,
  metric = "ROC",   # Optimizing for AUC-ROC
  linout = FALSE,   # Sigmoid for binary classification
  trace = FALSE     # Suppress training output
)
+ Fold1: size= 5, decay=0.001 
- Fold1: size= 5, decay=0.001 
+ Fold1: size=10, decay=0.001 
- Fold1: size=10, decay=0.001 
+ Fold1: size=15, decay=0.001 
- Fold1: size=15, decay=0.001 
+ Fold1: size= 5, decay=0.010 
- Fold1: size= 5, decay=0.010 
+ Fold1: size=10, decay=0.010 
- Fold1: size=10, decay=0.010 
+ Fold1: size=15, decay=0.010 
- Fold1: size=15, decay=0.010 
+ Fold1: size= 5, decay=0.100 
- Fold1: size= 5, decay=0.100 
+ Fold1: size=10, decay=0.100 
- Fold1: size=10, decay=0.100 
+ Fold1: size=15, decay=0.100 
- Fold1: size=15, decay=0.100 
+ Fold2: size= 5, decay=0.001 
- Fold2: size= 5, decay=0.001 
+ Fold2: size=10, decay=0.001 
- Fold2: size=10, decay=0.001 
+ Fold2: size=15, decay=0.001 
- Fold2: size=15, decay=0.001 
+ Fold2: size= 5, decay=0.010 
- Fold2: size= 5, decay=0.010 
+ Fold2: size=10, decay=0.010 
- Fold2: size=10, decay=0.010 
+ Fold2: size=15, decay=0.010 
- Fold2: size=15, decay=0.010 
+ Fold2: size= 5, decay=0.100 
- Fold2: size= 5, decay=0.100 
+ Fold2: size=10, decay=0.100 
- Fold2: size=10, decay=0.100 
+ Fold2: size=15, decay=0.100 
- Fold2: size=15, decay=0.100 
+ Fold3: size= 5, decay=0.001 
- Fold3: size= 5, decay=0.001 
+ Fold3: size=10, decay=0.001 
- Fold3: size=10, decay=0.001 
+ Fold3: size=15, decay=0.001 
- Fold3: size=15, decay=0.001 
+ Fold3: size= 5, decay=0.010 
- Fold3: size= 5, decay=0.010 
+ Fold3: size=10, decay=0.010 
- Fold3: size=10, decay=0.010 
+ Fold3: size=15, decay=0.010 
- Fold3: size=15, decay=0.010 
+ Fold3: size= 5, decay=0.100 
- Fold3: size= 5, decay=0.100 
+ Fold3: size=10, decay=0.100 
- Fold3: size=10, decay=0.100 
+ Fold3: size=15, decay=0.100 
- Fold3: size=15, decay=0.100 
+ Fold4: size= 5, decay=0.001 
- Fold4: size= 5, decay=0.001 
+ Fold4: size=10, decay=0.001 
- Fold4: size=10, decay=0.001 
+ Fold4: size=15, decay=0.001 
- Fold4: size=15, decay=0.001 
+ Fold4: size= 5, decay=0.010 
- Fold4: size= 5, decay=0.010 
+ Fold4: size=10, decay=0.010 
- Fold4: size=10, decay=0.010 
+ Fold4: size=15, decay=0.010 
- Fold4: size=15, decay=0.010 
+ Fold4: size= 5, decay=0.100 
- Fold4: size= 5, decay=0.100 
+ Fold4: size=10, decay=0.100 
- Fold4: size=10, decay=0.100 
+ Fold4: size=15, decay=0.100 
- Fold4: size=15, decay=0.100 
+ Fold5: size= 5, decay=0.001 
- Fold5: size= 5, decay=0.001 
+ Fold5: size=10, decay=0.001 
- Fold5: size=10, decay=0.001 
+ Fold5: size=15, decay=0.001 
- Fold5: size=15, decay=0.001 
+ Fold5: size= 5, decay=0.010 
- Fold5: size= 5, decay=0.010 
+ Fold5: size=10, decay=0.010 
- Fold5: size=10, decay=0.010 
+ Fold5: size=15, decay=0.010 
- Fold5: size=15, decay=0.010 
+ Fold5: size= 5, decay=0.100 
- Fold5: size= 5, decay=0.100 
+ Fold5: size=10, decay=0.100 
- Fold5: size=10, decay=0.100 
+ Fold5: size=15, decay=0.100 
- Fold5: size=15, decay=0.100 
Aggregating results
Selecting tuning parameters
Fitting size = 10, decay = 0.01 on full training set
Code
print(lr_cv)
glmnet 

33745 samples
   23 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 26996, 26996, 26996, 26996, 26996 
Resampling results across tuning parameters:

  alpha  lambda  ROC        Sens       Spec      
  0.0    0.001   0.9513692  0.9546580  0.68933333
  0.0    0.012   0.9513692  0.9546580  0.68933333
  0.0    0.023   0.9513404  0.9547723  0.68840000
  0.0    0.034   0.9503969  0.9606020  0.66106667
  0.0    0.045   0.9496148  0.9644885  0.63586667
  0.0    0.056   0.9489433  0.9687560  0.61146667
  0.0    0.067   0.9483548  0.9722995  0.58760000
  0.0    0.078   0.9478280  0.9755382  0.56346667
  0.0    0.089   0.9473556  0.9782054  0.53946667
  0.0    0.100   0.9469073  0.9813298  0.51386667
  0.5    0.001   0.9546551  0.9393789  0.74453333
  0.5    0.012   0.9525433  0.9471899  0.71066667
  0.5    0.023   0.9492251  0.9529053  0.67360000
  0.5    0.034   0.9464651  0.9597638  0.62933333
  0.5    0.045   0.9440454  0.9656315  0.57840000
  0.5    0.056   0.9427355  0.9725281  0.52893333
  0.5    0.067   0.9423622  0.9798819  0.48013333
  0.5    0.078   0.9421940  0.9858259  0.42960000
  0.5    0.089   0.9419933  0.9898266  0.36600000
  0.5    0.100   0.9416990  0.9935226  0.30453333
  1.0    0.001   0.9546706  0.9393408  0.74506667
  1.0    0.012   0.9506973  0.9454753  0.70840000
  1.0    0.023   0.9445611  0.9519528  0.65400000
  1.0    0.034   0.9426180  0.9607925  0.59480000
  1.0    0.045   0.9419336  0.9711564  0.52720000
  1.0    0.056   0.9405963  0.9823204  0.44466667
  1.0    0.067   0.9381269  0.9894456  0.33253333
  1.0    0.078   0.9352390  0.9941703  0.20546667
  1.0    0.089   0.9334380  0.9975233  0.09706667
  1.0    0.100   0.9295429  0.9996190  0.02186667

ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 1 and lambda = 0.001.
Code
print(rf_cv)
Random Forest 

33745 samples
   23 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 26996, 26996, 26996, 26996, 26996 
Resampling results across tuning parameters:

  mtry  min.node.size  ROC        Sens       Spec     
  3      5             0.9727072  0.9801486  0.7450667
  3     10             0.9726066  0.9801105  0.7430667
  5      5             0.9744779  0.9748905  0.7676000
  5     10             0.9744929  0.9752715  0.7660000
  7      5             0.9750875  0.9740141  0.7733333
  7     10             0.9747832  0.9740903  0.7730667

Tuning parameter 'splitrule' was held constant at a value of gini
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 7, splitrule = gini
 and min.node.size = 5.
Code
print(nnet_cv)
Neural Network 

33745 samples
   23 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 26996, 26996, 26996, 26996, 26996 
Resampling results across tuning parameters:

  size  decay  ROC        Sens       Spec     
   5    0.001  0.5000000  1.0000000  0.0000000
   5    0.010  0.8072246  0.9362926  0.4952000
   5    0.100  0.8834763  0.9232235  0.6530667
  10    0.001  0.7131240  0.9476853  0.4129333
  10    0.010  0.8989247  0.9007811  0.6910667
  10    0.100  0.8643101  0.8982282  0.6724000
  15    0.001  0.7150516  0.9602210  0.3084000
  15    0.010  0.8140593  0.9367880  0.4981333
  15    0.100  0.8524729  0.9152982  0.5858667

ROC was used to select the optimal model using the largest value.
The final values used for the model were size = 10 and decay = 0.01.

5 Predictions

Code
# Making predictions with Logistic Regression on the test set
lr_test_preds <- predict(lr_cv, test_data, type = "prob")[, 2]
lr_test_class <- ifelse(lr_test_preds > 0.5, "Yes", "No")

lr_conf_matrix <- confusionMatrix(
  as.factor(lr_test_class), 
  as.factor(test_data$loan_status), 
  positive = "Yes"
)

# Displaying the Confusion Matrix and Metrics
print(lr_conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  8192  627
       Yes  556 1873
                                         
               Accuracy : 0.8948         
                 95% CI : (0.889, 0.9004)
    No Information Rate : 0.7777         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.6927         
                                         
 Mcnemar's Test P-Value : 0.04183        
                                         
            Sensitivity : 0.7492         
            Specificity : 0.9364         
         Pos Pred Value : 0.7711         
         Neg Pred Value : 0.9289         
             Prevalence : 0.2223         
         Detection Rate : 0.1665         
   Detection Prevalence : 0.2159         
      Balanced Accuracy : 0.8428         
                                         
       'Positive' Class : Yes            
                                         
Code
# Making predictions with Random Forest on the test set
rf_test_preds <- predict(rf_cv, test_data, type = "prob")[, 2]
rf_test_class <- ifelse(rf_test_preds > 0.5, "Yes", "No")
rf_conf_matrix <- confusionMatrix(
  as.factor(rf_test_class), 
  as.factor(test_data$loan_status), 
  positive = "Yes"
)

# Displaying the Confusion Matrix and Metrics
print(rf_conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  8505  585
       Yes  243 1915
                                          
               Accuracy : 0.9264          
                 95% CI : (0.9214, 0.9311)
    No Information Rate : 0.7777          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7761          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7660          
            Specificity : 0.9722          
         Pos Pred Value : 0.8874          
         Neg Pred Value : 0.9356          
             Prevalence : 0.2223          
         Detection Rate : 0.1703          
   Detection Prevalence : 0.1919          
      Balanced Accuracy : 0.8691          
                                          
       'Positive' Class : Yes             
                                          
Code
# Making predictions with Neural Network on the test set
nnet_test_preds <- predict(nnet_cv, test_data, type = "prob")[, 2]
nnet_test_class <- ifelse(nnet_test_preds > 0.5, "Yes", "No")

# Generating the Confusion Matrix for Neural Network
nnet_conf_matrix <- confusionMatrix(
  as.factor(nnet_test_class), 
  as.factor(test_data$loan_status), 
  positive = "Yes"
)

# Displaying the Confusion Matrix and Metrics
print(nnet_conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  8361 1532
       Yes  387  968
                                          
               Accuracy : 0.8294          
                 95% CI : (0.8223, 0.8363)
    No Information Rate : 0.7777          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.41            
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.38720         
            Specificity : 0.95576         
         Pos Pred Value : 0.71439         
         Neg Pred Value : 0.84514         
             Prevalence : 0.22226         
         Detection Rate : 0.08606         
   Detection Prevalence : 0.12047         
      Balanced Accuracy : 0.67148         
                                          
       'Positive' Class : Yes             
                                          
Code
# Load necessary library
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
Code
library(caret)

# Evaluation Function (with ROC Calculation)
evaluate_model <- function(true_labels, predicted_probs, predicted_classes) {
  roc_curve <- roc(true_labels, predicted_probs)
  auc <- auc(roc_curve)
  
  confusion <- confusionMatrix(as.factor(predicted_classes), as.factor(true_labels))
  
  list(
    Accuracy = confusion$overall['Accuracy'],
    Precision = confusion$byClass['Precision'],
    Recall = confusion$byClass['Recall'],
    F1_Score = confusion$byClass['F1'],
    AUC = auc,
    ROC_Curve = roc_curve  # Storing the ROC curve for later plotting
  )
}

# Calculating ROC and other metrics for each model
lr_results <- evaluate_model(test_data$loan_status, lr_test_preds, lr_test_class)
Setting levels: control = No, case = Yes
Setting direction: controls < cases
Code
rf_results <- evaluate_model(test_data$loan_status, rf_test_preds, rf_test_class)
Setting levels: control = No, case = Yes
Setting direction: controls < cases
Code
nnet_results <- evaluate_model(test_data$loan_status, nnet_test_preds, nnet_test_class)
Setting levels: control = No, case = Yes
Setting direction: controls < cases
Code
# Displaying the Results in a Comparison Table
comparison_results <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "Neural Network"),
  Accuracy = c(lr_results$Accuracy, rf_results$Accuracy, nnet_results$Accuracy),
  Precision = c(lr_results$Precision, rf_results$Precision, nnet_results$Precision),
  Recall = c(lr_results$Recall, rf_results$Recall, nnet_results$Recall),
  F1_Score = c(lr_results$F1_Score, rf_results$F1_Score, nnet_results$F1_Score),
  AUC = c(lr_results$AUC, rf_results$AUC, nnet_results$AUC)
)

print(comparison_results)
                Model  Accuracy Precision    Recall  F1_Score       AUC
1 Logistic Regression 0.8948257 0.9289035 0.9364426 0.9326578 0.9522683
2       Random Forest 0.9263869 0.9356436 0.9722222 0.9535822 0.9733427
3      Neural Network 0.8293919 0.8451430 0.9557613 0.8970549 0.7080928
Code
plot(1 - lr_results$ROC_Curve$specificities, lr_results$ROC_Curve$sensitivities, 
     col = "blue", type = "l", main = "ROC Curve Comparison (Corrected)",
     xlim = c(0, 1), ylim = c(0, 1), 
     xlab = "False Positive Rate (1 - Specificity)", 
     ylab = "True Positive Rate (Sensitivity)", lwd = 2)
lines(1 - rf_results$ROC_Curve$specificities, rf_results$ROC_Curve$sensitivities, col = "green", lwd = 2)
lines(1 - nnet_results$ROC_Curve$specificities, nnet_results$ROC_Curve$sensitivities, col = "black", lwd = 2)

# Adding the Random Guess Line (Diagonal)
abline(a=0, b=1, col="red", lty = 2, lwd = 2)

# Adding a clean, clear legend
legend("bottomright", 
       legend = c("Logistic Regression", "Random Forest", "Neural Network", "Random Guess"),
       col = c("blue", "green", "black", "red"), 
       lwd = 2, lty = c(1, 1, 1, 2))

6 Important Features

Code
# Plotting Feature Importance (Logistic Regression)
ggplot(lr_importance_df[1:10, ], aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Feature Importance (Logistic Regression)",
       x = "Feature", y = "Importance")

Code
# Random Forest Feature Importance
ggplot(rf_importance_df[1:10, ], aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  coord_flip() +
  labs(title = "Top 10 Feature Importance (Random Forest - Ranger)",
       x = "Feature", y = "Importance")

7 Conclusion

The dataset was initially explored, and severe unexplained outliers were identified and removed to ensure data quality. Following this, models were trained using a 5-fold cross-validation approach, with a grid search to test multiple hyperparameters for each model. The ROC metric was used as the primary criterion for selecting the best-performing models.

After cross-validation, the top-performing models were further evaluated on a separate test dataset that was withheld from the training process. The Random Forest model, using the configuration of mtry = 7, splitrule = “gini”, and min.node.size = 5, achieved the highest performance as demonstrated in the ROC plot and results table.

Feature importance was then extracted from the Logistic Regression (Lasso) and Random Forest models, revealing that income, previous loan defaults, and the loan-to-income ratio were the most influential variables in determining loan approval outcomes. These features were consistently ranked highest across both models, highlighting their critical impact.

The Neural Network model performed above a random classifier but did not achieve the same level of performance as the Logistic Regression and Random Forest models. This may be due to limitations in the available data or insufficient compute resources for optimal tuning. However, even without further tuning, the Random Forest model demonstrated strong performance on unseen data, making it a robust choice for this task.