Bank Loan Classifier

Author

Darwhin Gomez

1 Overview

Financial institutions generate revenue by providing a wide range of services, including access to capital through offerings like loans, credit cards, and other financial products. One of the critical challenges they face is determining whether a client should be approved for a loan or credit. This decision requires careful consideration of multiple factors, including the applicant’s income, existing debts, age, salary, assets, and other financial metrics.

Machine learning can be effectively applied to this problem by leveraging historical data on past loan approvals to build predictive models. These models can analyze various factors and accurately classify whether a client should be approved for a loan or not. In this project, I will train three different machine learning models on a labeled financial dataset to classify clients as either “approved” or “not approved.”

However, the use of machine learning in credit approval must adhere to regulatory requirements that prioritize transparency and fairness. Financial institutions are often required by law to provide clear explanations for their credit decisions, ensuring that customers understand why they were approved or denied. This regulatory requirement is driven by principles of consumer protection and fairness, as outlined in laws such as the Fair Credit Reporting Act (FCRA) in the United States and the General Data Protection Regulation (GDPR) in Europe.

This need for transparency directly influences the choice of machine learning models used in credit decisioning. While complex models like Random Forests and Gradient Boosting can offer high accuracy, they are often considered “black boxes” due to their lack of interpretability. In contrast, simpler models like Logistic Regression provide clear and easily understandable explanations for decisions, making them more compliant with regulatory standards.

In this project, I will implement and compare three machine learning models:

Logistic Regression: A transparent and interpretable model that provides clear insights into how features impact approval decisions.
Random Forest: A more complex model that can capture non-linear relationships but requires additional methods for interpretation.
Neural Network: A binary classification neural network that will be implemented to gauge the effectiveness of deep learning in this context. This model has the potential to capture intricate patterns.

By comparing these models, I aim to not only achieve high predictive performance but also maintain a balance between accuracy and transparency, ensuring that the final model can be effectively explained to both customers and regulatory bodies.

2 EDA

2.1 Data Descriptions

Column	Description	Type
`person_age`	Age of the person	Float
`person_gender`	Gender of the person	Categorical
`person_education`	Highest education level	Categorical
`person_income`	Annual income	Float
`person_emp_exp`	Years of employment experience	Integer
`person_home_ownership`	Home ownership status (e.g., rent, own, mortgage)	Categorical
`loan_amnt`	Loan amount requested	Float
`loan_intent`	Purpose of the loan	Categorical
`loan_int_rate`	Loan interest rate	Float
`loan_percent_income`	Loan amount as a percentage of annual income	Float
`cb_person_cred_hist_length`	Length of credit history in years	Float
`credit_score`	Credit score of the person	Integer
`previous_loan_defaults_on_file`	Indicator of previous loan defaults	Categorical
`loan_status` (target variable)	Loan approval status: 1 = approved; 0 = rejected	Integer

Code

 loan_data<- read.csv("loan_data.csv")
 str(loan_data)

'data.frame':   45000 obs. of  14 variables:
 $ person_age                    : num  22 21 25 23 24 21 26 24 24 21 ...
 $ person_gender                 : chr  "female" "female" "female" "female" ...
 $ person_education              : chr  "Master" "High School" "High School" "Bachelor" ...
 $ person_income                 : num  71948 12282 12438 79753 66135 ...
 $ person_emp_exp                : int  0 0 3 0 1 0 1 5 3 0 ...
 $ person_home_ownership         : chr  "RENT" "OWN" "MORTGAGE" "RENT" ...
 $ loan_amnt                     : num  35000 1000 5500 35000 35000 2500 35000 35000 35000 1600 ...
 $ loan_intent                   : chr  "PERSONAL" "EDUCATION" "MEDICAL" "MEDICAL" ...
 $ loan_int_rate                 : num  16 11.1 12.9 15.2 14.3 ...
 $ loan_percent_income           : num  0.49 0.08 0.44 0.44 0.53 0.19 0.37 0.37 0.35 0.13 ...
 $ cb_person_cred_hist_length    : num  3 2 3 2 4 2 3 4 2 3 ...
 $ credit_score                  : int  561 504 635 675 586 532 701 585 544 640 ...
 $ previous_loan_defaults_on_file: chr  "No" "Yes" "No" "No" ...
 $ loan_status                   : int  1 0 1 1 1 1 1 1 1 1 ...

Code

 skim(loan_data)

Data summary
Name	loan_data
Number of rows	45000
Number of columns	14
_______________________
Column type frequency:
character	5
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
person_gender	1	4	6	2
person_education	1	6	11	5
person_home_ownership	1	3	8	4
loan_intent	1	7	17	6
previous_loan_defaults_on_file	1	2	3	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
person_age	1	27.76	6.05	20.00	24.00	26.00	30.00	144.00	▇▁▁▁▁
person_income	1	80319.05	80422.50	8000.00	47204.00	67048.00	95789.25	7200766.00	▇▁▁▁▁
person_emp_exp	1	5.41	6.06	0.00	1.00	4.00	8.00	125.00	▇▁▁▁▁
loan_amnt	1	9583.16	6314.89	500.00	5000.00	8000.00	12237.25	35000.00	▇▆▂▁▁
loan_int_rate	1	11.01	2.98	5.42	8.59	11.01	12.99	20.00	▆▇▆▃▁
loan_percent_income	1	0.14	0.09	0.00	0.07	0.12	0.19	0.66	▇▅▁▁▁
cb_person_cred_hist_length	1	5.87	3.88	2.00	3.00	4.00	8.00	30.00	▇▂▁▁▁
credit_score	1	632.61	50.44	390.00	601.00	640.00	670.00	850.00	▁▂▇▃▁
loan_status	1	0.22	0.42	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂

Code

table(loan_data$person_education)


  Associate    Bachelor   Doctorate High School      Master 
      12028       13399         621       11972        6980

Code

table(loan_data$person_gender)


female   male 
 20159  24841

Code

table(loan_data$loan_intent)


DEBTCONSOLIDATION         EDUCATION   HOMEIMPROVEMENT           MEDICAL 
             7145              9153              4783              8548 
         PERSONAL           VENTURE 
             7552              7819

Code

table(loan_data$person_home_ownership)


MORTGAGE    OTHER      OWN     RENT 
   18489      117     2951    23443

Code

table(loan_data$previous_loan_defaults_on_file)


   No   Yes 
22142 22858

Code

barplot(table(loan_data$person_education), 
        main = "Distribution of Education Levels",
        col = "grey", 
        las = 2, 
        cex.names = 0.8)

Code

barplot(table(loan_data$person_gender), 
        main = "Distribution of Gender",
       col = "grey", 
        las = 2, 
        cex.names = 0.8)

Code

barplot(table(loan_data$loan_intent), 
        main = "Distribution of Loan Intent",
        col = "grey", 
        las = 2, 
        cex.names = 0.8)

Code

barplot(table(loan_data$person_home_ownership), 
        main = "Distribution of Home Ownership",
        col = "grey", 
        las = 2, 
        cex.names = 0.8)

Code

barplot(table(loan_data$previous_loan_defaults_on_file), 
        main = "Distribution of Previous Loan Defaults",
        col = "grey",  
        las = 2, 
        cex.names = 0.8)

Code

numerical_features <- loan_data %>% 
  select_if(is.numeric ) %>%
  select(-loan_status)%>%
  colnames()

# Create box plots and histograms for each numerical feature
for (feature in numerical_features) {
  # Boxplot
  boxplot_plot <- ggplot(loan_data, aes(y = !!sym(feature))) +
    geom_boxplot(fill = "skyblue", color = "darkblue", outlier.color = "red") +
    labs(title = paste("Boxplot of", feature), y = feature, x = "") +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 30, face = "bold"),
      axis.title.y = element_text(size = 30),
      axis.text.y = element_text(size = 25),
      axis.text.x = element_text(size = 25)
    )
  
  # Histogram
  hist_plot <- ggplot(loan_data, aes(x = !!sym(feature))) +
    geom_histogram(aes(y = ..density..), fill = "lightgreen", color = "darkgreen", bins = 30) +
    geom_density(color = "red", linewidth = 1) +  # Updated to linewidth
    labs(title = paste("Histogram of", feature), x = feature, y = "Density") +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 30, face = "bold"),
      axis.title.x = element_text(size = 20),
      axis.title.y = element_text(size = 20),
      axis.text.x = element_text(size = 15),
      axis.text.y = element_text(size = 15)
    )
  
  # Display plots side by side
  grid.arrange(boxplot_plot, hist_plot, ncol = 2)
}

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Code

loan_data|>
  filter(person_age>100)|>
  print()

  person_age person_gender person_education person_income person_emp_exp
1        144          male         Bachelor        300616            125
2        144          male        Associate        241424            121
3        123        female      High School         97140            101
4        123          male         Bachelor         94723            100
5        144        female        Associate       7200766            124
6        116          male         Bachelor       5545545             93
7        109          male      High School       5556399             85
  person_home_ownership loan_amnt loan_intent loan_int_rate loan_percent_income
1                  RENT      4800     VENTURE         13.57                0.02
2              MORTGAGE      6000   EDUCATION         11.86                0.02
3                  RENT     20400   EDUCATION         10.25                0.21
4                  RENT     20000     VENTURE         11.01                0.21
5              MORTGAGE      5000    PERSONAL         12.73                0.00
6              MORTGAGE      3823     VENTURE         12.15                0.00
7              MORTGAGE      6195     VENTURE         12.58                0.00
  cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
1                          3          789                             No
2                          2          807                             No
3                          3          805                            Yes
4                          4          714                            Yes
5                         25          850                             No
6                         24          708                             No
7                         22          792                             No
  loan_status
1           0
2           0
3           0
4           0
5           0
6           0
7           0

Code

table(loan_data$loan_status)


    0     1 
35000 10000

Code

hist(loan_data$loan_status)

2.2 Outlier Detection and Removal

During the exploratory data analysis, it became apparent that there were clear outliers in the dataset, particularly in the person_age variable. Specifically, some applicants were listed as being over 100 years old, with the highest age reaching 109. Given that it is highly unlikely for individuals of such advanced age to be actively applying for new loans, these records were deemed unrealistic and were removed from the dataset. This decision ensures that the model is trained on more accurate and reliable data, minimizing the risk of skewed predictions due to extreme values.

Code

# Removing rows where person_age is 100 or more
loan_data_cleaned <- loan_data %>% 
  filter(person_age < 100)

3 PreProcessing

count(loan_data_encoded$loan_status)

Code

# Binary Encoding
loan_data_cleaned <- loan_data_cleaned %>%
  mutate(
    person_gender = ifelse(person_gender == "male", 1, 0),
    previous_loan_defaults_on_file = ifelse(previous_loan_defaults_on_file == "Yes", 1, 0)
  )

# One-Hot Encoding for Nominal Features
loan_data_encoded <- loan_data_cleaned %>%
  mutate(
    person_education = as.factor(person_education),
    person_home_ownership = as.factor(person_home_ownership),
    loan_intent = as.factor(loan_intent)
  ) %>%
  # Applying one-hot encoding
  model.matrix(~ . - 1, data = .) %>% 
  as.data.frame()
# Convert the target variable to a clean binary factor (Yes/No)
loan_data_encoded$loan_status <- factor(loan_data_encoded$loan_status, 
                                        levels = c("0", "1"), 
                                        labels = c("No", "Yes"))
head(loan_data_encoded,10)

   person_age person_gender person_educationAssociate person_educationBachelor
1          22             0                         0                        0
2          21             0                         0                        0
3          25             0                         0                        0
4          23             0                         0                        1
5          24             1                         0                        0
6          21             0                         0                        0
7          26             0                         0                        1
8          24             0                         0                        0
9          24             0                         1                        0
10         21             0                         0                        0
   person_educationDoctorate person_educationHigh School person_educationMaster
1                          0                           0                      1
2                          0                           1                      0
3                          0                           1                      0
4                          0                           0                      0
5                          0                           0                      1
6                          0                           1                      0
7                          0                           0                      0
8                          0                           1                      0
9                          0                           0                      0
10                         0                           1                      0
   person_income person_emp_exp person_home_ownershipOTHER
1          71948              0                          0
2          12282              0                          0
3          12438              3                          0
4          79753              0                          0
5          66135              1                          0
6          12951              0                          0
7          93471              1                          0
8          95550              5                          0
9         100684              3                          0
10         12739              0                          0
   person_home_ownershipOWN person_home_ownershipRENT loan_amnt
1                         0                         1     35000
2                         1                         0      1000
3                         0                         0      5500
4                         0                         1     35000
5                         0                         1     35000
6                         1                         0      2500
7                         0                         1     35000
8                         0                         1     35000
9                         0                         1     35000
10                        1                         0      1600
   loan_intentEDUCATION loan_intentHOMEIMPROVEMENT loan_intentMEDICAL
1                     0                          0                  0
2                     1                          0                  0
3                     0                          0                  1
4                     0                          0                  1
5                     0                          0                  1
6                     0                          0                  0
7                     1                          0                  0
8                     0                          0                  1
9                     0                          0                  0
10                    0                          0                  0
   loan_intentPERSONAL loan_intentVENTURE loan_int_rate loan_percent_income
1                    1                  0         16.02                0.49
2                    0                  0         11.14                0.08
3                    0                  0         12.87                0.44
4                    0                  0         15.23                0.44
5                    0                  0         14.27                0.53
6                    0                  1          7.14                0.19
7                    0                  0         12.42                0.37
8                    0                  0         11.11                0.37
9                    1                  0          8.90                0.35
10                   0                  1         14.74                0.13
   cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
1                           3          561                              0
2                           2          504                              1
3                           3          635                              0
4                           2          675                              0
5                           4          586                              0
6                           2          532                              0
7                           3          701                              0
8                           4          585                              0
9                           2          544                              0
10                          3          640                              0
   loan_status
1          Yes
2           No
3          Yes
4          Yes
5          Yes
6          Yes
7          Yes
8          Yes
9          Yes
10         Yes

Code

table(loan_data_encoded$loan_status)


   No   Yes 
34993 10000

4 Modeling

Code

# Splitting the data into training and test sets (75/25)
train_index <- createDataPartition(loan_data_encoded$loan_status, p = 0.75, list = FALSE)
train_data <- loan_data_encoded[train_index, ]
test_data <- loan_data_encoded[-train_index, ]

# Crossvalidtion
cv_control <- trainControl(
  method = "cv", 
  number = 5, 
  verboseIter = TRUE,
  classProbs = TRUE, 
  summaryFunction = twoClassSummary
)
train_data$loan_status <- as.factor(train_data$loan_status)
test_data$loan_status <- as.factor(test_data$loan_status)

# Check the levels to ensure it is binary
levels(train_data$loan_status)

[1] "No"  "Yes"

Code

head(train_data)

  person_age person_gender person_educationAssociate person_educationBachelor
2         21             0                         0                        0
3         25             0                         0                        0
4         23             0                         0                        1
6         21             0                         0                        0
7         26             0                         0                        1
8         24             0                         0                        0
  person_educationDoctorate person_educationHigh School person_educationMaster
2                         0                           1                      0
3                         0                           1                      0
4                         0                           0                      0
6                         0                           1                      0
7                         0                           0                      0
8                         0                           1                      0
  person_income person_emp_exp person_home_ownershipOTHER
2         12282              0                          0
3         12438              3                          0
4         79753              0                          0
6         12951              0                          0
7         93471              1                          0
8         95550              5                          0
  person_home_ownershipOWN person_home_ownershipRENT loan_amnt
2                        1                         0      1000
3                        0                         0      5500
4                        0                         1     35000
6                        1                         0      2500
7                        0                         1     35000
8                        0                         1     35000
  loan_intentEDUCATION loan_intentHOMEIMPROVEMENT loan_intentMEDICAL
2                    1                          0                  0
3                    0                          0                  1
4                    0                          0                  1
6                    0                          0                  0
7                    1                          0                  0
8                    0                          0                  1
  loan_intentPERSONAL loan_intentVENTURE loan_int_rate loan_percent_income
2                   0                  0         11.14                0.08
3                   0                  0         12.87                0.44
4                   0                  0         15.23                0.44
6                   0                  1          7.14                0.19
7                   0                  0         12.42                0.37
8                   0                  0         11.11                0.37
  cb_person_cred_hist_length credit_score previous_loan_defaults_on_file
2                          2          504                              1
3                          3          635                              0
4                          2          675                              0
6                          2          532                              0
7                          3          701                              0
8                          4          585                              0
  loan_status
2          No
3         Yes
4         Yes
6         Yes
7         Yes
8         Yes

Code

### Logistic Regression with Grid Search (CV) ###
lr_grid <- expand.grid(
  alpha = c(0, 0.5, 1), # 0 = Ridge, 1 = Lasso, 0.5 = Elastic
  lambda = seq(0.001, 0.1, length = 10)  # Regularization strengt
)
### Logistic Regression with CV ###
lr_cv <- train(
  loan_status ~ ., 
  data = train_data, 
  method = "glmnet",    
  trControl = cv_control,
  tuneGrid = lr_grid,
  metric = "ROC"
)

+ Fold1: alpha=0.0, lambda=0.1 
- Fold1: alpha=0.0, lambda=0.1 
+ Fold1: alpha=0.5, lambda=0.1 
- Fold1: alpha=0.5, lambda=0.1 
+ Fold1: alpha=1.0, lambda=0.1 
- Fold1: alpha=1.0, lambda=0.1 
+ Fold2: alpha=0.0, lambda=0.1 
- Fold2: alpha=0.0, lambda=0.1 
+ Fold2: alpha=0.5, lambda=0.1 
- Fold2: alpha=0.5, lambda=0.1 
+ Fold2: alpha=1.0, lambda=0.1 
- Fold2: alpha=1.0, lambda=0.1 
+ Fold3: alpha=0.0, lambda=0.1 
- Fold3: alpha=0.0, lambda=0.1 
+ Fold3: alpha=0.5, lambda=0.1 
- Fold3: alpha=0.5, lambda=0.1 
+ Fold3: alpha=1.0, lambda=0.1 
- Fold3: alpha=1.0, lambda=0.1 
+ Fold4: alpha=0.0, lambda=0.1 
- Fold4: alpha=0.0, lambda=0.1 
+ Fold4: alpha=0.5, lambda=0.1 
- Fold4: alpha=0.5, lambda=0.1 
+ Fold4: alpha=1.0, lambda=0.1 
- Fold4: alpha=1.0, lambda=0.1 
+ Fold5: alpha=0.0, lambda=0.1 
- Fold5: alpha=0.0, lambda=0.1 
+ Fold5: alpha=0.5, lambda=0.1 
- Fold5: alpha=0.5, lambda=0.1 
+ Fold5: alpha=1.0, lambda=0.1 
- Fold5: alpha=1.0, lambda=0.1 
Aggregating results
Selecting tuning parameters
Fitting alpha = 1, lambda = 0.001 on full training set

Code

lr_cv

glmnet 

33745 samples
   23 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 26996, 26996, 26996, 26996, 26996 
Resampling results across tuning parameters:

  alpha  lambda  ROC        Sens       Spec      
  0.0    0.001   0.9513692  0.9546580  0.68933333
  0.0    0.012   0.9513692  0.9546580  0.68933333
  0.0    0.023   0.9513404  0.9547723  0.68840000
  0.0    0.034   0.9503969  0.9606020  0.66106667
  0.0    0.045   0.9496148  0.9644885  0.63586667
  0.0    0.056   0.9489433  0.9687560  0.61146667
  0.0    0.067   0.9483548  0.9722995  0.58760000
  0.0    0.078   0.9478280  0.9755382  0.56346667
  0.0    0.089   0.9473556  0.9782054  0.53946667
  0.0    0.100   0.9469073  0.9813298  0.51386667
  0.5    0.001   0.9546551  0.9393789  0.74453333
  0.5    0.012   0.9525433  0.9471899  0.71066667
  0.5    0.023   0.9492251  0.9529053  0.67360000
  0.5    0.034   0.9464651  0.9597638  0.62933333
  0.5    0.045   0.9440454  0.9656315  0.57840000
  0.5    0.056   0.9427355  0.9725281  0.52893333
  0.5    0.067   0.9423622  0.9798819  0.48013333
  0.5    0.078   0.9421940  0.9858259  0.42960000
  0.5    0.089   0.9419933  0.9898266  0.36600000
  0.5    0.100   0.9416990  0.9935226  0.30453333
  1.0    0.001   0.9546706  0.9393408  0.74506667
  1.0    0.012   0.9506973  0.9454753  0.70840000
  1.0    0.023   0.9445611  0.9519528  0.65400000
  1.0    0.034   0.9426180  0.9607925  0.59480000
  1.0    0.045   0.9419336  0.9711564  0.52720000
  1.0    0.056   0.9405963  0.9823204  0.44466667
  1.0    0.067   0.9381269  0.9894456  0.33253333
  1.0    0.078   0.9352390  0.9941703  0.20546667
  1.0    0.089   0.9334380  0.9975233  0.09706667
  1.0    0.100   0.9295429  0.9996190  0.02186667

ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 1 and lambda = 0.001.

Code

rf_grid <- expand.grid(
  mtry = c(3, 5, 7),             
  splitrule = "gini",            # Split rule (Gini for classification)
  min.node.size = c(5, 10)       # Minimum leaf size
)

rf_cv <- train(
  loan_status ~ ., 
  data = train_data, 
  method = "ranger",  # Fast Random Forest
  trControl = cv_control,
  tuneGrid = rf_grid,
  metric = "ROC"
)

+ Fold1: mtry=3, splitrule=gini, min.node.size= 5 
- Fold1: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold1: mtry=5, splitrule=gini, min.node.size= 5 
- Fold1: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold1: mtry=7, splitrule=gini, min.node.size= 5 
- Fold1: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold1: mtry=3, splitrule=gini, min.node.size=10 
- Fold1: mtry=3, splitrule=gini, min.node.size=10 
+ Fold1: mtry=5, splitrule=gini, min.node.size=10 
- Fold1: mtry=5, splitrule=gini, min.node.size=10 
+ Fold1: mtry=7, splitrule=gini, min.node.size=10 
- Fold1: mtry=7, splitrule=gini, min.node.size=10 
+ Fold2: mtry=3, splitrule=gini, min.node.size= 5 
- Fold2: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold2: mtry=5, splitrule=gini, min.node.size= 5 
- Fold2: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold2: mtry=7, splitrule=gini, min.node.size= 5 
- Fold2: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold2: mtry=3, splitrule=gini, min.node.size=10 
- Fold2: mtry=3, splitrule=gini, min.node.size=10 
+ Fold2: mtry=5, splitrule=gini, min.node.size=10 
- Fold2: mtry=5, splitrule=gini, min.node.size=10 
+ Fold2: mtry=7, splitrule=gini, min.node.size=10 
- Fold2: mtry=7, splitrule=gini, min.node.size=10 
+ Fold3: mtry=3, splitrule=gini, min.node.size= 5 
- Fold3: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold3: mtry=5, splitrule=gini, min.node.size= 5 
- Fold3: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold3: mtry=7, splitrule=gini, min.node.size= 5 
- Fold3: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold3: mtry=3, splitrule=gini, min.node.size=10 
- Fold3: mtry=3, splitrule=gini, min.node.size=10 
+ Fold3: mtry=5, splitrule=gini, min.node.size=10 
- Fold3: mtry=5, splitrule=gini, min.node.size=10 
+ Fold3: mtry=7, splitrule=gini, min.node.size=10 
- Fold3: mtry=7, splitrule=gini, min.node.size=10 
+ Fold4: mtry=3, splitrule=gini, min.node.size= 5 
- Fold4: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold4: mtry=5, splitrule=gini, min.node.size= 5 
- Fold4: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold4: mtry=7, splitrule=gini, min.node.size= 5 
- Fold4: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold4: mtry=3, splitrule=gini, min.node.size=10 
- Fold4: mtry=3, splitrule=gini, min.node.size=10 
+ Fold4: mtry=5, splitrule=gini, min.node.size=10 
- Fold4: mtry=5, splitrule=gini, min.node.size=10 
+ Fold4: mtry=7, splitrule=gini, min.node.size=10 
- Fold4: mtry=7, splitrule=gini, min.node.size=10 
+ Fold5: mtry=3, splitrule=gini, min.node.size= 5 
- Fold5: mtry=3, splitrule=gini, min.node.size= 5 
+ Fold5: mtry=5, splitrule=gini, min.node.size= 5 
- Fold5: mtry=5, splitrule=gini, min.node.size= 5 
+ Fold5: mtry=7, splitrule=gini, min.node.size= 5 
- Fold5: mtry=7, splitrule=gini, min.node.size= 5 
+ Fold5: mtry=3, splitrule=gini, min.node.size=10 
- Fold5: mtry=3, splitrule=gini, min.node.size=10 
+ Fold5: mtry=5, splitrule=gini, min.node.size=10 
- Fold5: mtry=5, splitrule=gini, min.node.size=10 
+ Fold5: mtry=7, splitrule=gini, min.node.size=10 
- Fold5: mtry=7, splitrule=gini, min.node.size=10 
Aggregating results
Selecting tuning parameters
Fitting mtry = 7, splitrule = gini, min.node.size = 5 on full training set

Code

nnet_grid <- expand.grid(
  size = c(5, 10, 15),   # Neurons in the hidden layer
  decay = c(0.001, 0.01, 0.1)  # Regularization parameter
)

### Cross-Validated NNet Training ###
nnet_cv <- train(
  loan_status ~ ., 
  data = train_data, 
  method = "nnet",
  trControl = cv_control,
  tuneGrid = nnet_grid,
  metric = "ROC",   # Optimizing for AUC-ROC
  linout = FALSE,   # Sigmoid for binary classification
  trace = FALSE     # Suppress training output
)

+ Fold1: size= 5, decay=0.001 
- Fold1: size= 5, decay=0.001 
+ Fold1: size=10, decay=0.001 
- Fold1: size=10, decay=0.001 
+ Fold1: size=15, decay=0.001 
- Fold1: size=15, decay=0.001 
+ Fold1: size= 5, decay=0.010 
- Fold1: size= 5, decay=0.010 
+ Fold1: size=10, decay=0.010 
- Fold1: size=10, decay=0.010 
+ Fold1: size=15, decay=0.010 
- Fold1: size=15, decay=0.010 
+ Fold1: size= 5, decay=0.100 
- Fold1: size= 5, decay=0.100 
+ Fold1: size=10, decay=0.100 
- Fold1: size=10, decay=0.100 
+ Fold1: size=15, decay=0.100 
- Fold1: size=15, decay=0.100 
+ Fold2: size= 5, decay=0.001 
- Fold2: size= 5, decay=0.001 
+ Fold2: size=10, decay=0.001 
- Fold2: size=10, decay=0.001 
+ Fold2: size=15, decay=0.001 
- Fold2: size=15, decay=0.001 
+ Fold2: size= 5, decay=0.010 
- Fold2: size= 5, decay=0.010 
+ Fold2: size=10, decay=0.010 
- Fold2: size=10, decay=0.010 
+ Fold2: size=15, decay=0.010 
- Fold2: size=15, decay=0.010 
+ Fold2: size= 5, decay=0.100 
- Fold2: size= 5, decay=0.100 
+ Fold2: size=10, decay=0.100 
- Fold2: size=10, decay=0.100 
+ Fold2: size=15, decay=0.100 
- Fold2: size=15, decay=0.100 
+ Fold3: size= 5, decay=0.001 
- Fold3: size= 5, decay=0.001 
+ Fold3: size=10, decay=0.001 
- Fold3: size=10, decay=0.001 
+ Fold3: size=15, decay=0.001 
- Fold3: size=15, decay=0.001 
+ Fold3: size= 5, decay=0.010 
- Fold3: size= 5, decay=0.010 
+ Fold3: size=10, decay=0.010 
- Fold3: size=10, decay=0.010 
+ Fold3: size=15, decay=0.010 
- Fold3: size=15, decay=0.010 
+ Fold3: size= 5, decay=0.100 
- Fold3: size= 5, decay=0.100 
+ Fold3: size=10, decay=0.100 
- Fold3: size=10, decay=0.100 
+ Fold3: size=15, decay=0.100 
- Fold3: size=15, decay=0.100 
+ Fold4: size= 5, decay=0.001 
- Fold4: size= 5, decay=0.001 
+ Fold4: size=10, decay=0.001 
- Fold4: size=10, decay=0.001 
+ Fold4: size=15, decay=0.001 
- Fold4: size=15, decay=0.001 
+ Fold4: size= 5, decay=0.010 
- Fold4: size= 5, decay=0.010 
+ Fold4: size=10, decay=0.010 
- Fold4: size=10, decay=0.010 
+ Fold4: size=15, decay=0.010 
- Fold4: size=15, decay=0.010 
+ Fold4: size= 5, decay=0.100 
- Fold4: size= 5, decay=0.100 
+ Fold4: size=10, decay=0.100 
- Fold4: size=10, decay=0.100 
+ Fold4: size=15, decay=0.100 
- Fold4: size=15, decay=0.100 
+ Fold5: size= 5, decay=0.001 
- Fold5: size= 5, decay=0.001 
+ Fold5: size=10, decay=0.001 
- Fold5: size=10, decay=0.001 
+ Fold5: size=15, decay=0.001 
- Fold5: size=15, decay=0.001 
+ Fold5: size= 5, decay=0.010 
- Fold5: size= 5, decay=0.010 
+ Fold5: size=10, decay=0.010 
- Fold5: size=10, decay=0.010 
+ Fold5: size=15, decay=0.010 
- Fold5: size=15, decay=0.010 
+ Fold5: size= 5, decay=0.100 
- Fold5: size= 5, decay=0.100 
+ Fold5: size=10, decay=0.100 
- Fold5: size=10, decay=0.100 
+ Fold5: size=15, decay=0.100 
- Fold5: size=15, decay=0.100 
Aggregating results
Selecting tuning parameters
Fitting size = 10, decay = 0.01 on full training set

Code

print(lr_cv)

glmnet 

33745 samples
   23 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 26996, 26996, 26996, 26996, 26996 
Resampling results across tuning parameters:

  alpha  lambda  ROC        Sens       Spec      
  0.0    0.001   0.9513692  0.9546580  0.68933333
  0.0    0.012   0.9513692  0.9546580  0.68933333
  0.0    0.023   0.9513404  0.9547723  0.68840000
  0.0    0.034   0.9503969  0.9606020  0.66106667
  0.0    0.045   0.9496148  0.9644885  0.63586667
  0.0    0.056   0.9489433  0.9687560  0.61146667
  0.0    0.067   0.9483548  0.9722995  0.58760000
  0.0    0.078   0.9478280  0.9755382  0.56346667
  0.0    0.089   0.9473556  0.9782054  0.53946667
  0.0    0.100   0.9469073  0.9813298  0.51386667
  0.5    0.001   0.9546551  0.9393789  0.74453333
  0.5    0.012   0.9525433  0.9471899  0.71066667
  0.5    0.023   0.9492251  0.9529053  0.67360000
  0.5    0.034   0.9464651  0.9597638  0.62933333
  0.5    0.045   0.9440454  0.9656315  0.57840000
  0.5    0.056   0.9427355  0.9725281  0.52893333
  0.5    0.067   0.9423622  0.9798819  0.48013333
  0.5    0.078   0.9421940  0.9858259  0.42960000
  0.5    0.089   0.9419933  0.9898266  0.36600000
  0.5    0.100   0.9416990  0.9935226  0.30453333
  1.0    0.001   0.9546706  0.9393408  0.74506667
  1.0    0.012   0.9506973  0.9454753  0.70840000
  1.0    0.023   0.9445611  0.9519528  0.65400000
  1.0    0.034   0.9426180  0.9607925  0.59480000
  1.0    0.045   0.9419336  0.9711564  0.52720000
  1.0    0.056   0.9405963  0.9823204  0.44466667
  1.0    0.067   0.9381269  0.9894456  0.33253333
  1.0    0.078   0.9352390  0.9941703  0.20546667
  1.0    0.089   0.9334380  0.9975233  0.09706667
  1.0    0.100   0.9295429  0.9996190  0.02186667

ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 1 and lambda = 0.001.

Code

print(rf_cv)

Random Forest 

33745 samples
   23 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 26996, 26996, 26996, 26996, 26996 
Resampling results across tuning parameters:

  mtry  min.node.size  ROC        Sens       Spec     
  3      5             0.9727072  0.9801486  0.7450667
  3     10             0.9726066  0.9801105  0.7430667
  5      5             0.9744779  0.9748905  0.7676000
  5     10             0.9744929  0.9752715  0.7660000
  7      5             0.9750875  0.9740141  0.7733333
  7     10             0.9747832  0.9740903  0.7730667

Tuning parameter 'splitrule' was held constant at a value of gini
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 7, splitrule = gini
 and min.node.size = 5.

Code

print(nnet_cv)

Neural Network 

33745 samples
   23 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 26996, 26996, 26996, 26996, 26996 
Resampling results across tuning parameters:

  size  decay  ROC        Sens       Spec     
   5    0.001  0.5000000  1.0000000  0.0000000
   5    0.010  0.8072246  0.9362926  0.4952000
   5    0.100  0.8834763  0.9232235  0.6530667
  10    0.001  0.7131240  0.9476853  0.4129333
  10    0.010  0.8989247  0.9007811  0.6910667
  10    0.100  0.8643101  0.8982282  0.6724000
  15    0.001  0.7150516  0.9602210  0.3084000
  15    0.010  0.8140593  0.9367880  0.4981333
  15    0.100  0.8524729  0.9152982  0.5858667

ROC was used to select the optimal model using the largest value.
The final values used for the model were size = 10 and decay = 0.01.

5 Predictions

Code

# Making predictions with Logistic Regression on the test set
lr_test_preds <- predict(lr_cv, test_data, type = "prob")[, 2]
lr_test_class <- ifelse(lr_test_preds > 0.5, "Yes", "No")

lr_conf_matrix <- confusionMatrix(
  as.factor(lr_test_class), 
  as.factor(test_data$loan_status), 
  positive = "Yes"
)

# Displaying the Confusion Matrix and Metrics
print(lr_conf_matrix)

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  8192  627
       Yes  556 1873
                                         
               Accuracy : 0.8948         
                 95% CI : (0.889, 0.9004)
    No Information Rate : 0.7777         
    P-Value [Acc > NIR] : < 2e-16        
                                         
                  Kappa : 0.6927         
                                         
 Mcnemar's Test P-Value : 0.04183        
                                         
            Sensitivity : 0.7492         
            Specificity : 0.9364         
         Pos Pred Value : 0.7711         
         Neg Pred Value : 0.9289         
             Prevalence : 0.2223         
         Detection Rate : 0.1665         
   Detection Prevalence : 0.2159         
      Balanced Accuracy : 0.8428         
                                         
       'Positive' Class : Yes

Code

# Making predictions with Random Forest on the test set
rf_test_preds <- predict(rf_cv, test_data, type = "prob")[, 2]
rf_test_class <- ifelse(rf_test_preds > 0.5, "Yes", "No")
rf_conf_matrix <- confusionMatrix(
  as.factor(rf_test_class), 
  as.factor(test_data$loan_status), 
  positive = "Yes"
)

# Displaying the Confusion Matrix and Metrics
print(rf_conf_matrix)

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  8505  585
       Yes  243 1915
                                          
               Accuracy : 0.9264          
                 95% CI : (0.9214, 0.9311)
    No Information Rate : 0.7777          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7761          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7660          
            Specificity : 0.9722          
         Pos Pred Value : 0.8874          
         Neg Pred Value : 0.9356          
             Prevalence : 0.2223          
         Detection Rate : 0.1703          
   Detection Prevalence : 0.1919          
      Balanced Accuracy : 0.8691          
                                          
       'Positive' Class : Yes

Code

# Making predictions with Neural Network on the test set
nnet_test_preds <- predict(nnet_cv, test_data, type = "prob")[, 2]
nnet_test_class <- ifelse(nnet_test_preds > 0.5, "Yes", "No")

# Generating the Confusion Matrix for Neural Network
nnet_conf_matrix <- confusionMatrix(
  as.factor(nnet_test_class), 
  as.factor(test_data$loan_status), 
  positive = "Yes"
)

# Displaying the Confusion Matrix and Metrics
print(nnet_conf_matrix)

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  8361 1532
       Yes  387  968
                                          
               Accuracy : 0.8294          
                 95% CI : (0.8223, 0.8363)
    No Information Rate : 0.7777          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.41            
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.38720         
            Specificity : 0.95576         
         Pos Pred Value : 0.71439         
         Neg Pred Value : 0.84514         
             Prevalence : 0.22226         
         Detection Rate : 0.08606         
   Detection Prevalence : 0.12047         
      Balanced Accuracy : 0.67148         
                                          
       'Positive' Class : Yes

Code

# Load necessary library
library(pROC)

Type 'citation("pROC")' for a citation.


Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var

Code

library(caret)

# Evaluation Function (with ROC Calculation)
evaluate_model <- function(true_labels, predicted_probs, predicted_classes) {
  roc_curve <- roc(true_labels, predicted_probs)
  auc <- auc(roc_curve)
  
  confusion <- confusionMatrix(as.factor(predicted_classes), as.factor(true_labels))
  
  list(
    Accuracy = confusion$overall['Accuracy'],
    Precision = confusion$byClass['Precision'],
    Recall = confusion$byClass['Recall'],
    F1_Score = confusion$byClass['F1'],
    AUC = auc,
    ROC_Curve = roc_curve  # Storing the ROC curve for later plotting
  )
}

# Calculating ROC and other metrics for each model
lr_results <- evaluate_model(test_data$loan_status, lr_test_preds, lr_test_class)

Setting levels: control = No, case = Yes

Setting direction: controls < cases

Code

rf_results <- evaluate_model(test_data$loan_status, rf_test_preds, rf_test_class)

Setting levels: control = No, case = Yes
Setting direction: controls < cases

Code

nnet_results <- evaluate_model(test_data$loan_status, nnet_test_preds, nnet_test_class)

Setting levels: control = No, case = Yes
Setting direction: controls < cases

Code

# Displaying the Results in a Comparison Table
comparison_results <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "Neural Network"),
  Accuracy = c(lr_results$Accuracy, rf_results$Accuracy, nnet_results$Accuracy),
  Precision = c(lr_results$Precision, rf_results$Precision, nnet_results$Precision),
  Recall = c(lr_results$Recall, rf_results$Recall, nnet_results$Recall),
  F1_Score = c(lr_results$F1_Score, rf_results$F1_Score, nnet_results$F1_Score),
  AUC = c(lr_results$AUC, rf_results$AUC, nnet_results$AUC)
)

print(comparison_results)

                Model  Accuracy Precision    Recall  F1_Score       AUC
1 Logistic Regression 0.8948257 0.9289035 0.9364426 0.9326578 0.9522683
2       Random Forest 0.9263869 0.9356436 0.9722222 0.9535822 0.9733427
3      Neural Network 0.8293919 0.8451430 0.9557613 0.8970549 0.7080928

Code

plot(1 - lr_results$ROC_Curve$specificities, lr_results$ROC_Curve$sensitivities, 
     col = "blue", type = "l", main = "ROC Curve Comparison (Corrected)",
     xlim = c(0, 1), ylim = c(0, 1), 
     xlab = "False Positive Rate (1 - Specificity)", 
     ylab = "True Positive Rate (Sensitivity)", lwd = 2)
lines(1 - rf_results$ROC_Curve$specificities, rf_results$ROC_Curve$sensitivities, col = "green", lwd = 2)
lines(1 - nnet_results$ROC_Curve$specificities, nnet_results$ROC_Curve$sensitivities, col = "black", lwd = 2)

# Adding the Random Guess Line (Diagonal)
abline(a=0, b=1, col="red", lty = 2, lwd = 2)

# Adding a clean, clear legend
legend("bottomright", 
       legend = c("Logistic Regression", "Random Forest", "Neural Network", "Random Guess"),
       col = c("blue", "green", "black", "red"), 
       lwd = 2, lty = c(1, 1, 1, 2))

6 Important Features

Code

# Plotting Feature Importance (Logistic Regression)
ggplot(lr_importance_df[1:10, ], aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Feature Importance (Logistic Regression)",
       x = "Feature", y = "Importance")

Code

# Random Forest Feature Importance
ggplot(rf_importance_df[1:10, ], aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  coord_flip() +
  labs(title = "Top 10 Feature Importance (Random Forest - Ranger)",
       x = "Feature", y = "Importance")

7 Conclusion

The dataset was initially explored, and severe unexplained outliers were identified and removed to ensure data quality. Following this, models were trained using a 5-fold cross-validation approach, with a grid search to test multiple hyperparameters for each model. The ROC metric was used as the primary criterion for selecting the best-performing models.

After cross-validation, the top-performing models were further evaluated on a separate test dataset that was withheld from the training process. The Random Forest model, using the configuration of mtry = 7, splitrule = “gini”, and min.node.size = 5, achieved the highest performance as demonstrated in the ROC plot and results table.

Feature importance was then extracted from the Logistic Regression (Lasso) and Random Forest models, revealing that income, previous loan defaults, and the loan-to-income ratio were the most influential variables in determining loan approval outcomes. These features were consistently ranked highest across both models, highlighting their critical impact.

The Neural Network model performed above a random classifier but did not achieve the same level of performance as the Logistic Regression and Random Forest models. This may be due to limitations in the available data or insufficient compute resources for optimal tuning. However, even without further tuning, the Random Forest model demonstrated strong performance on unseen data, making it a robust choice for this task.