Executive Summary – EMILY
Business Understanding and Business Question – EMILY
Data Understanding and Data Preparation – BRETT
- Additional Exploratory Visualizations
Modeling and Results – KEVIN
Model Tuning and Evaluation – KEVIN
Discussion, Limitations, and Deployment – SHERRA
Conclusion – EMILY
References – ALL

Executive Summary – EMILY

The firm’s current loan pre-screening system automatically denies 100% of applicants with a prior default, regardless of their current financial health. Our analysis found that this rigid rule is overly restrictive and is suppressing profitable loan growth. Exploratory data revealed a meaningful subset of auto-denied applicants credit scores, stable income, and leverage ratios, among other variables collected on profiles of potential loan applicants, approved borrowers. This indicates that the company may be forfeiting healthy revenue opportunities due to an inflexible legacy policy implemented through our automated qualification systems.

To address this gap, we developed a model to identify credit-worthy applicants within the group of applicants who were denied for prior-default. Because no applicants in this group were historically approved, we trained multiple models, Decision Tree, Logistic Regression, Support Vector Machine, and Random Forest using only the “No Prior Default” population to learn the firm’s standard approval behavior. After tuning and evaluation using ROC curves and AUC metrics, the Random Forest model demonstrated the strongest predictive performance in replicating approval patterns.

We recommend deploying the Random Forest model as a routing tool rather than a final decision-maker. Instead of auto-denying all prior-default applicants, the system should generate a “Reconsideration Score,” with applicants above a 0.60 threshold flagged for manual underwriting review. This human-in-the-loop approach enables the firm to recover high-potential revenue while maintaining prudent risk controls, modernizing the underwriting process without disproportionately increasing default exposure.

Business Understanding and Business Question – EMILY

Business Problem: As a loan provider, reviewing applications and underwriting represent significant labor costs to our business. We initially set out to use the data gathered at loan application to apply a data-driven approach to loan decision-making to expedite decisions. However, initial exploratory analysis highlighted that the existence of prior defaults is the only metrics that results in automatic rejection in our current review process.

In order to determine whether this rule might be overly punitive or simplistic, we compared key financial metrics between the auto-denied individuals with prior defaults and approved borrows. In fact, we found evidence that by rejecting this group without contextual review, the firm may be forfeiting substantial volume of healthy loan revenue, so we refined our problem framing to capitalize on the potential opportunity presented by applicants with otherwise strong metrics who were previously auto-denied.

Business Question: Can we systematically identify credit-worthy, high-potential applicants within the historically auto-denied “Prior Default” population for manual underwriting review, thereby increasing overall loan origination revenue without assuming disproportionate risk of future default?

Data Understanding and Data Preparation – BRETT

Target Variable: The target variable for our predictive modeling is loan_status, a categorical variable classified as either “Approved” or “Denied”.

Data Cleaning & Attribute Formatting: The raw dataset required extensive sanitization to enforce logical business rules. Key preparation steps included:

Handling Impossible Values: Filtering out anomalous records, such as applicant ages over 100 years, employment experience exceeding 60 years, and credit scores outside the valid 300–850 range.
Financial Metric Sanitization: Capping loan interest rates at realistic maximums (60%) and ensuring loan-to-income ratios (loan_percent_income) did not exceed 1.5.
Categorical Conversion: Key text attributes (education, home ownership, loan intent, and prior defaults) were converted to factors for modeling. Rows with missing critical targets were dropped to ensure a sound training set.
Target Leakage: loan_int_rate was removed because interest rates are assigned based on risk, embedding existing decision logic into the framework.
Regulatory Compliance: The person_gender attribute was excluded from the predictor space to comply with the Equal Credit Opportunity Act.

# Install and load standard project libraries
required_packages <- c(
    "tidyverse", "janitor", "skimr", "GGally",
    "caret", "randomForest", "pROC", "rpart", "rpart.plot", "factoextra", "cluster", "e1071"
)

installed <- rownames(installed.packages())
to_install <- setdiff(required_packages, installed)
if (length(to_install) > 0) install.packages(to_install, quietly = TRUE)

invisible(lapply(required_packages, library, character.only = TRUE))
set.seed(42) # For reproducibility

data_path <- "data/raw/loan_data.csv"
if (!file.exists(data_path)) stop("Data file not found at data/raw/loan_data.csv.")

df_raw <- read_csv(data_path, show_col_types = FALSE) %>% clean_names()

df_clean <- df_raw %>%
    dplyr::select(-loan_int_rate, -person_gender) %>%
    mutate(
        # Factors & Categories
        person_education = factor(person_education),
        person_home_ownership = factor(person_home_ownership),
        loan_intent = factor(loan_intent),
        previous_loan_defaults_on_file = factor(previous_loan_defaults_on_file),
        loan_status = factor(loan_status, levels = c(0, 1), labels = c("Denied", "Approved")),

        # Integer fixes
        person_age = as.integer(ifelse(person_age < 18 | person_age > 100, NA, person_age)),
        person_emp_exp = as.integer(ifelse(person_emp_exp < 0 | person_emp_exp > 60, NA, person_emp_exp)),
        cb_person_cred_hist_length = as.integer(ifelse(cb_person_cred_hist_length < 0 | cb_person_cred_hist_length > 60, NA, cb_person_cred_hist_length)),
        credit_score = as.integer(ifelse(credit_score < 300 | credit_score > 850, NA, credit_score)),

        # Financial metrics sanitization
        loan_percent_income = ifelse(loan_percent_income <= 0 | loan_percent_income > 1.5, NA, loan_percent_income),
        person_income = ifelse(person_income <= 0, NA, person_income),
        loan_amnt = ifelse(loan_amnt <= 0, NA, loan_amnt),

        # 1. Financial Maturity: How much of their life have they had credit?
        credit_to_age_ratio = cb_person_cred_hist_length / person_age,

        # 2. Income Stability: Income relative to years of employment (+1 avoids division by zero)
        income_per_year_emp = person_income / (person_emp_exp + 1),

        # 3. Loan-to-Credit-Score Proxy: Balances the ask against their reputation
        loan_to_score_ratio = loan_amnt / credit_score
    )

# Drop missing targets and key variables to ensure modeled data is completely sound
df_clean <- df_clean %>%
    drop_na()

Descriptive Statistics & Data Visualization: The pivotal moment in data understanding was isolating the training group (No Prior Defaults) from the holdout action group (Prior Defaults). Density plots comparing the two groups revealed a distinct, healthy distribution of credit scores well into the 750+ range within the auto-denied group. This visual evidence confirmed the core hypothesis: the blanket denial rule was overriding otherwise excellent financial profiles.

# The Training & Evaluation Group (No Defaults)
df_no_default <- df_clean %>% filter(previous_loan_defaults_on_file == "No")

# The "Holdout" Action Group (Prior Defaults)
df_with_default <- df_clean %>% filter(previous_loan_defaults_on_file == "Yes")

cat("No Prior Default (Train/Test Population):", nrow(df_no_default), "\n")

## No Prior Default (Train/Test Population): 22125

cat("Prior Default (To Be Scored Later):", nrow(df_with_default), "\n")

## Prior Default (To Be Scored Later): 22841

p1 <- ggplot(df_clean, aes(x = credit_score, fill = loan_status)) +
    geom_density(alpha = 0.5) +
    facet_wrap(~previous_loan_defaults_on_file, labeller = as_labeller(c("No" = "No Prior Default (Normal Processing)", "Yes" = "Prior Default (Auto Denied)"))) +
    scale_fill_manual(values = c("indianred", "seagreen")) +
    theme_minimal() +
    labs(
        title = "Credit Score Distribution by Approval vs. Prior Default Group",
        x = "Credit Score", y = "Density"
    )

print(p1)

Within the training population, the class distribution runs roughly 78% denied to 22% approved. This imbalance was addressed through threshold optimization during model tuning rather than resampling. Applicants with a prior default are denied at a 100% rate under current policy. Those without one are approved 45.2% of the time. Credit scores for prior default applicants closely mirror those of denied no-default applicants. That overlap is the basis for the reconsideration model. Approved no-default applicants have higher incomes, better credit scores, and lower loan burdens than those denied.

Additional Exploratory Visualizations

To further support the context of our modeling effort we want to explore other financial metrics within the No Prior Default group to better understand how other features correlate with loan approval status.

Visualization 1: Applicant Income by Loan Status

Understanding the distribution of an applicant’s income relative to their loan approval status is critical. In the boxplot below, we visualize this relationship. Given the presence of extreme high-income outliers, we’ve scaled the y-axis logarithmically to ensure a comfortable reading experience and to highlight the interquartile range comparisons between approved and denied applicants.

p2 <- ggplot(df_no_default, aes(x = loan_status, y = person_income, fill = loan_status)) +
    geom_boxplot(alpha = 0.7, outlier.color = "grey50", outlier.alpha = 0.4) +
    scale_y_log10(labels = scales::dollar_format()) +
    scale_fill_manual(values = c("indianred", "seagreen")) +
    theme_minimal() +
    theme(
        plot.margin = ggplot2::margin(15, 15, 15, 15, "pt"),
        plot.caption = element_text(size = 8, hjust = 0, margin = ggplot2::margin(t = 15, unit = "pt")),
        axis.title = element_text(size = 12)
    ) +
    labs(
        title = "Applicant Income Distribution by Loan Status (No Prior Defaults)",
        subtitle = "Y-axis displayed on log scale to accommodate outliers",
        x = "Loan Status",
        y = "Log Applicant Income",
        caption = "Method: Missing values omitted, filtered for valid ranges. Displayed on a log-10 scale.\nSource: Tawei Lo (2022). Loan Approval Classification Data. Kaggle."
    )
print(p2)

As illustrated, approved applicants tend to have a higher median income compared to denied applicants, establishing income as a likely strong predictor in our subsequent modeling.

Visualization 2: Loan Amount vs. Applicant Income

Next, we look at the relationship between the actual loan amount requested and the applicant’s current income. By plotting these variables continuously and coloring by the outcome class, we can observe the “safe” lending corridor where loan-to-income ratios typically result in approval.

p3 <- ggplot(df_no_default, aes(x = person_income, y = loan_amnt, color = loan_status)) +
    geom_point(alpha = 0.4) +
    scale_x_log10(labels = scales::dollar_format()) +
    scale_y_continuous(labels = scales::dollar_format()) +
    scale_color_manual(values = c("indianred", "seagreen")) +
    theme_minimal() +
    theme(
        plot.margin = ggplot2::margin(15, 15, 15, 15, "pt"),
        plot.caption = element_text(size = 8, hjust = 0, margin = ggplot2::margin(t = 15, unit = "pt")),
        axis.title = element_text(size = 12)
    ) +
    labs(
        title = "Loan Amount Requested vs. Applicant Income",
        subtitle = "Visualizing the lending corridor for approved loans",
        x = "Log Applicant Income",
        y = "Loan Amount Requested",
        color = "Loan Status",
        caption = "Method: Missing values omitted, filtered for valid ranges. X-axis displayed on a log-10 scale.\nSource: Tawei Lo (2022). Loan Approval Classification Data. Kaggle."
    )
print(p3)

The scatter plot clearly demonstrates an upper boundary (the diagonal limit) where loans are rarely approved if the requested amount is too high relative to the applicant’s income.

Visualization 3: Loan Intent Proportions

Finally, to incorporate categorical influences on the final decision, we evaluate loan intent (the purpose of the loan). This bar chart normalizes each loan intent category to 100% to visualize the proportion of approvals within each group.

p4 <- ggplot(df_no_default, aes(x = loan_intent, fill = loan_status)) +
    geom_bar(position = "fill", alpha = 0.85) +
    scale_fill_manual(values = c("indianred", "seagreen")) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_minimal() +
    theme(
        plot.margin = ggplot2::margin(15, 15, 15, 15, "pt"),
        plot.caption = element_text(size = 8, hjust = 0, margin = ggplot2::margin(t = 15, unit = "pt")),
        axis.title = element_text(size = 12),
        axis.text.x = element_text(angle = 45, hjust = 1)
    ) +
    labs(
        title = "Proportion of Approved Loans by Loan Intent",
        x = "Stated Loan Intent",
        y = "Percentage of Total",
        fill = "Loan Status",
        caption = "Method: Missing values omitted, filtered for valid ranges. Position filled to show proportions.\nSource: Tawei Lo (2022). Loan Approval Classification Data. Kaggle."
    )
print(p4)

While financial metrics are the primary drivers of approval, certain loan intents (such as Education or Home Improvement) may historically have slightly different structural approval rates.

Modeling and Results – KEVIN

Because historical data contains zero approved applications for the “Prior Default” group, any model trained naively on the full dataset would simply learn to replicate the 100% denial rule. To circumvent this, we utilized a Proxy Target Modeling approach. We trained our algorithms strictly on the “No Prior Default” population to map the firm’s standard approval criteria.

We established a transparent baseline using a Classification Decision Tree to identify the most heavily weighted features in standard approvals. To capture more complex, non-linear boundaries, we subsequently deployed high-performance algorithms, including:

Random Forest: Utilized to handle complex feature interactions and generate variable importance metrics.
Support Vector Machines (SVM): Implemented to find the optimal hyperplane separating approved and denied loans.
Logistic Regression: Provided a probabilistic baseline for the likelihood of approval based on financial inputs.

# Stratified Train/Test split on the No-Default population
set.seed(42)
idx <- caret::createDataPartition(df_no_default$loan_status, p = 0.8, list = FALSE)
train_set <- df_no_default[idx, ] %>% dplyr::select(-previous_loan_defaults_on_file) # remove from predictors
test_set <- df_no_default[-idx, ] %>% dplyr::select(-previous_loan_defaults_on_file)

Model Tuning and Evaluation – KEVIN

To ensure optimal predictive accuracy before projecting these models onto the denied population, we tuned the hyperparameters of each algorithm:

Decision Tree

Tuning: Tuned using k-fold cross-validation to find the complexity parameter (cp) that minimized cross-validated error, resulting in a pruned tree that prevents overfitting.

Decision Tree Performance (Validation Set) * True Positives (Approved | Predicted Approved): 1,317 * False Positives (Denied | Predicted Approved): 237 * True Negatives (Denied | Predicted Denied): 2,188 * False Negatives (Approved | Predicted Denied): 682 * Accuracy: 79.23% * Precision (Positive Predictive Value): 84.75% * Recall / Sensitivity (True Positive Rate): 65.88% * Specificity (True Negative Rate): 90.23%

set.seed(42)
full_tree <- rpart(
    loan_status ~ .,
    data = train_set,
    method = "class",
    control = rpart.control(cp = 0) # Allows complex tree to grow
)

# Prune tree with minimum cross-validated error
min_xerror <- full_tree$cptable[which.min(full_tree$cptable[, "xerror"]), ]
min_xerror_tree <- prune(full_tree, cp = min_xerror[1])
rpart.plot(min_xerror_tree, main = "Pruned Classification Tree")

test_set$ct_prob <- predict(min_xerror_tree, test_set)[, "Approved"]
test_set$ct_class <- factor(ifelse(test_set$ct_prob > 0.5, "Approved", "Denied"), levels = c("Denied", "Approved"))

cm_dt <- confusionMatrix(test_set$ct_class, test_set$loan_status, positive = "Approved")
print(cm_dt)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Denied Approved
##   Denied     2188      682
##   Approved    237     1317
##                                         
##                Accuracy : 0.7923        
##                  95% CI : (0.78, 0.8041)
##     No Information Rate : 0.5481        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.5723        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.6588        
##             Specificity : 0.9023        
##          Pos Pred Value : 0.8475        
##          Neg Pred Value : 0.7624        
##              Prevalence : 0.4519        
##          Detection Rate : 0.2977        
##    Detection Prevalence : 0.3513        
##       Balanced Accuracy : 0.7805        
##                                         
##        'Positive' Class : Approved      
##

Visualization 4: Decision Tree Predicted Probabilities

Plotting the predicted probabilities for the test set against the applicant’s credit score helps illustrate the model’s confidence across the major risk factor. We expect a model that effectively discriminates to show high probabilities assigned to actual ‘Approved’ applicants and low probabilities to actual ‘Denied’ applicants.

p_ct_scatter <- ggplot(test_set, aes(x = credit_score, y = ct_prob, color = loan_status)) +
    geom_jitter(alpha = 0.5, width = 0.5, height = 0.02) +
    scale_color_manual(values = c("indianred", "seagreen")) +
    theme_minimal() +
    theme(
        plot.margin = ggplot2::margin(15, 15, 15, 15, "pt"),
        plot.caption = element_text(size = 8, hjust = 0, margin = ggplot2::margin(t = 15, unit = "pt")),
        axis.title = element_text(size = 12)
    ) +
    labs(
        title = "Decision Tree: Predicted Probability vs. Credit Score",
        subtitle = "Visualizing model assurance across applicant risk strata",
        x = "Applicant Credit Score",
        y = "Predicted Probability of Approval",
        color = "Actual Loan Status",
        caption = "Method: Predictions scored on holdout validation set. Jittered vertically to expose density.\nSource: Tawei Lo (2022). Loan Approval Classification Data. Kaggle."
    )
print(p_ct_scatter)

The flat stratification of points across a few fixed probability levels is typical for a pruned decision tree, as it categorizes applicants into specific terminal nodes rather than assigning a continuous, gradual probability curve.

Random Forest

Tuning: Optimized using the tuneRF function to identify the best mtry (number of variables randomly sampled as candidates at each split).

Random Forest Performance (Validation Set - Optimized at 0.417 Threshold) * True Positives (Approved | Predicted Approved): 1,557 * False Positives (Denied | Predicted Approved): 427 * True Negatives (Denied | Predicted Denied): 1,998 * False Negatives (Approved | Predicted Denied): 442 * Accuracy: 80.36% * Precision (Positive Predictive Value): 78.48% * Recall / Sensitivity (True Positive Rate): 77.89% * Specificity (True Negative Rate): 82.39%

set.seed(42)
res <- tuneRF(
    x = train_set %>% dplyr::select(-loan_status),
    y = train_set$loan_status,
    mtryStart = 2,
    ntreeTry = 200,
    trace = FALSE,
    plot = FALSE
)

## -0.05616225 0.05 
## 0.01378055 0.05

best_mtry <- res[which.min(res[, 2]), 1]

rf_best_model <- randomForest(
    loan_status ~ .,
    data = train_set,
    ntree = 500,
    mtry = best_mtry,
    nodesize = 5,
    importance = TRUE
)

# Feature Importance
imp <- importance(rf_best_model)
imp_df <- tibble(feature = rownames(imp), MeanDecreaseGini = imp[, "MeanDecreaseGini"]) %>%
    arrange(desc(MeanDecreaseGini)) %>%
    head(10)

ggplot(imp_df, aes(x = reorder(feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
    geom_col(fill = "steelblue") +
    coord_flip() +
    theme_minimal() +
    labs(title = "Top Driving Factors for Standard Loan Approvals", x = "", y = "Mean Decrease in Gini Impurity")

# Threshold optimization via Youden's J
rf_probs <- predict(rf_best_model, test_set, type = "prob")[, "Approved"]
rf_roc_obj <- pROC::roc(test_set$loan_status, rf_probs, levels = c("Denied", "Approved"), quiet = TRUE)
optimal_cut <- pROC::coords(rf_roc_obj, "best", ret = "threshold", best.method = "youden")
cat("Optimal Probability Threshold for Accuracy is:", optimal_cut$threshold, "\n")

## Optimal Probability Threshold for Accuracy is: 0.417

test_set$rf_prob <- rf_probs
test_set$rf_class <- factor(
    ifelse(rf_probs > optimal_cut$threshold, "Approved", "Denied"),
    levels = c("Denied", "Approved")
)

cm_rf <- confusionMatrix(test_set$rf_class, test_set$loan_status, positive = "Approved")
print(cm_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Denied Approved
##   Denied     1998      442
##   Approved    427     1557
##                                           
##                Accuracy : 0.8036          
##                  95% CI : (0.7916, 0.8152)
##     No Information Rate : 0.5481          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6032          
##                                           
##  Mcnemar's Test P-Value : 0.6348          
##                                           
##             Sensitivity : 0.7789          
##             Specificity : 0.8239          
##          Pos Pred Value : 0.7848          
##          Neg Pred Value : 0.8189          
##              Prevalence : 0.4519          
##          Detection Rate : 0.3519          
##    Detection Prevalence : 0.4485          
##       Balanced Accuracy : 0.8014          
##                                           
##        'Positive' Class : Approved        
##

Visualization 5: Random Forest Predicted Probabilities

While a single decision tree applies rigid cutoffs, a Random Forest aggregates predictions across hundreds of trees, resulting in a much smoother probability distribution. Below we plot the aggregate probabilities given to applicants against their credit score.

p_rf_scatter <- ggplot(test_set, aes(x = credit_score, y = rf_prob, color = loan_status)) +
    geom_jitter(alpha = 0.5, width = 0.5, height = 0.02) +
    scale_color_manual(values = c("indianred", "seagreen")) +
    theme_minimal() +
    theme(
        plot.margin = ggplot2::margin(15, 15, 15, 15, "pt"),
        plot.caption = element_text(size = 8, hjust = 0, margin = ggplot2::margin(t = 15, unit = "pt")),
        axis.title = element_text(size = 12)
    ) +
    labs(
        title = "Random Forest: Predicted Probability vs. Credit Score",
        subtitle = "Continuous probability distribution across risk strata",
        x = "Applicant Credit Score",
        y = "Predicted Probability of Approval",
        color = "Actual Loan Status",
        caption = "Method: Predictions scored on holdout validation set. Jittered vertically to expose density.\nSource: Tawei Lo (2022). Loan Approval Classification Data. Kaggle."
    )
print(p_rf_scatter)

The resulting plot displays a more nuanced probability assignment, where applicants with higher credit scores are noticeably clustered near 1.0, while those with lower scores fall precipitously toward 0.0.Our model considers each fator inclusive of credit score, allowing for more accurate recommendations for reconsideration beyond an single variable that could be rapidly readily reviewed by human intervention.

Support Vector Machines (SVM)

Tuning: Tuned using a radial kernel, testing across a range of cost parameters to find the best decision boundary.

Support Vector Machine Performance (Validation Set) * True Positives (Approved | Predicted Approved): 1,319 * False Positives (Denied | Predicted Approved): 302 * True Negatives (Denied | Predicted Denied): 2,123 * False Negatives (Approved | Predicted Denied): 680 * Accuracy: 77.80% * Precision (Positive Predictive Value): 81.37% * Recall / Sensitivity (True Positive Rate): 65.98% * Specificity (True Negative Rate): 87.55%

set.seed(42)
svm_tune <- tune(
    svm,
    loan_status ~ .,
    data = train_set,
    kernel = "radial",
    probability = TRUE,
    ranges = list(cost = c(0.1, 1, 10))
)

best_svm_mod <- svm_tune$best.model

svm_preds <- predict(best_svm_mod, test_set, probability = TRUE)
test_set$svm_prob <- attr(svm_preds, "probabilities")[, "Approved"]
test_set$svm_class <- svm_preds

cm_svm <- confusionMatrix(test_set$svm_class, test_set$loan_status, positive = "Approved")
print(cm_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Denied Approved
##   Denied     2123      680
##   Approved    302     1319
##                                           
##                Accuracy : 0.778           
##                  95% CI : (0.7655, 0.7902)
##     No Information Rate : 0.5481          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5443          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6598          
##             Specificity : 0.8755          
##          Pos Pred Value : 0.8137          
##          Neg Pred Value : 0.7574          
##              Prevalence : 0.4519          
##          Detection Rate : 0.2981          
##    Detection Prevalence : 0.3664          
##       Balanced Accuracy : 0.7676          
##                                           
##        'Positive' Class : Approved        
##

Visualization 6: SVM Predicted Probabilities

The Support Vector Machine attempts to find an optimal geometric boundary separating approved and denied clients. We map its probability outputs across the credit score spectrum below to visualize its confidence.

p_svm_scatter <- ggplot(test_set, aes(x = credit_score, y = svm_prob, color = loan_status)) +
    geom_jitter(alpha = 0.5, width = 0.5, height = 0.02) +
    scale_color_manual(values = c("indianred", "seagreen")) +
    theme_minimal() +
    theme(
        plot.margin = ggplot2::margin(15, 15, 15, 15, "pt"),
        plot.caption = element_text(size = 8, hjust = 0, margin = ggplot2::margin(t = 15, unit = "pt")),
        axis.title = element_text(size = 12)
    ) +
    labs(
        title = "SVM: Predicted Probability vs. Credit Score",
        subtitle = "Visualizing the geometric margin's confidence mapping",
        x = "Applicant Credit Score",
        y = "Predicted Probability of Approval",
        color = "Actual Loan Status",
        caption = "Method: Probabilities derived from radial kernel SVM on holdout set. Jittered vertically.\nSource: Tawei Lo (2022). Loan Approval Classification Data. Kaggle."
    )
print(p_svm_scatter)

As expected, the SVM separates the classes well at the extremes, but in the middle range (credit scores 600-700), the model is less certain and assigns moderate probabilities.

Logistic Regression

Tuning: Refined using step-wise forward selection to isolate the most statistically significant predictors.

Logistic Regression Performance (Validation Set) * True Positives (Approved | Predicted Approved): 1,491 * False Positives (Denied | Predicted Approved): 705 * True Negatives (Denied | Predicted Denied): 1,720 * False Negatives (Approved | Predicted Denied): 508 * Accuracy: 72.58% * Precision (Positive Predictive Value): 67.90% * Recall / Sensitivity (True Positive Rate): 74.59% * Specificity (True Negative Rate): 70.93%

Business Interpretation of Model Errors: * False Positives (705 applicants): The model predicted these applicants would be “Approved” (good risk), but their actual historical status was “Denied” (bad risk). Business Impact: This is a costly error. In our business context, approving a high-risk applicant who may default results in a direct financial loss of the loan principal.However, where loans are only awarded on these applications after human intervention, each of these cases will be captured prior to final approval. * False Negatives (508 applicants): The model predicted these applicants would be “Denied” (bad risk), but their actual historical status was “Approved” (good risk). Business Impact: This is a missed opportunity cost. Rejecting a qualified applicant means the firm forfeits the potential interest revenue they would have generated over the life of the loan. However, the application of this model will be predicated on denied applicants rendering this exclusive a nominal forgone opportunity relative to deployment. As deployment on incoming applications enhances our ability to train the model on more dynamic datasets the expectation for improving potential loan revenue is expected to increase with reduced model error.

logit_full <- glm(loan_status ~ ., data = train_set, family = "binomial")
logit_null <- glm(loan_status ~ 1, data = train_set, family = "binomial")

forward_model <- step(logit_null, scope = list(lower = logit_null, upper = logit_full), direction = "forward", trace = 0)

test_set$logit_prob <- predict(forward_model, test_set, type = "response")
test_set$logit_class <- factor(ifelse(test_set$logit_prob > optimal_cut$threshold, "Approved", "Denied"), levels = c("Denied", "Approved"))

cm_logit <- confusionMatrix(test_set$logit_class, test_set$loan_status, positive = "Approved")
print(cm_logit)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Denied Approved
##   Denied     1720      508
##   Approved    705     1491
##                                           
##                Accuracy : 0.7258          
##                  95% CI : (0.7124, 0.7389)
##     No Information Rate : 0.5481          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4512          
##                                           
##  Mcnemar's Test P-Value : 1.827e-08       
##                                           
##             Sensitivity : 0.7459          
##             Specificity : 0.7093          
##          Pos Pred Value : 0.6790          
##          Neg Pred Value : 0.7720          
##              Prevalence : 0.4519          
##          Detection Rate : 0.3370          
##    Detection Prevalence : 0.4964          
##       Balanced Accuracy : 0.7276          
##                                           
##        'Positive' Class : Approved        
##

Visualization 7: Logistic Regression Predicted Probabilities

The Logistic Regression model provides a fundamental probability S-curve baseline for comparison against the black-box algorithms. Its probability curve relative to credit score is mapped below.

p_logit_scatter <- ggplot(test_set, aes(x = credit_score, y = logit_prob, color = loan_status)) +
    geom_jitter(alpha = 0.5, width = 0.5, height = 0.02) +
    scale_color_manual(values = c("indianred", "seagreen")) +
    theme_minimal() +
    theme(
        plot.margin = ggplot2::margin(15, 15, 15, 15, "pt"),
        plot.caption = element_text(size = 8, hjust = 0, margin = ggplot2::margin(t = 15, unit = "pt")),
        axis.title = element_text(size = 12)
    ) +
    labs(
        title = "Logistic Regression: Predicted Probability vs. Credit Score",
        subtitle = "Evaluating the linear baseline model's classification confidence",
        x = "Applicant Credit Score",
        y = "Predicted Probability of Approval",
        color = "Actual Loan Status",
        caption = "Method: Predictions from stepwise forward regression on holdout set. Jittered vertically.\nSource: Tawei Lo (2022). Loan Approval Classification Data. Kaggle."
    )
print(p_logit_scatter)

The output reveals a smooth gradient transition; however, there is substantial overlap between actual ‘Approved’ and ‘Denied’ markers in the middle probability ranges, illustrating why trees outperformed the linear model.

Performance Evaluation with ROC

Evaluation: Model performance was evaluated on the test set and compared using ROC (Receiver Operating Characteristic) curves. The tuned Random Forest model yielded the highest Area Under the Curve (AUC) and overall test accuracy, making it the most reliable algorithm for encapsulating standard firm approvals.

ct_roc <- pROC::roc(test_set$loan_status, test_set$ct_prob, direction = "<", levels = c("Denied", "Approved"), quiet = TRUE)
rf_roc <- pROC::roc(test_set$loan_status, test_set$rf_prob, direction = "<", levels = c("Denied", "Approved"), quiet = TRUE)
logit_roc <- pROC::roc(test_set$loan_status, test_set$logit_prob, direction = "<", levels = c("Denied", "Approved"), quiet = TRUE)
svm_roc <- pROC::roc(test_set$loan_status, test_set$svm_prob, direction = "<", levels = c("Denied", "Approved"), quiet = TRUE)

plot(ct_roc, print.auc = TRUE, col = "blue", main = "ROC Comparison")
plot(rf_roc, print.auc = TRUE, print.auc.y = 0.4, col = "green", add = TRUE)
plot(logit_roc, print.auc = TRUE, print.auc.y = 0.3, col = "red", add = TRUE)
plot(svm_roc, print.auc = TRUE, print.auc.y = 0.2, col = "black", add = TRUE)

legend("bottomright",
    legend = c("Decision Tree", "Random Forest", "Logistic Regression", "SVM"),
    col = c("blue", "green", "red", "black"), lwd = 2, cex = 0.8
)

ROC-AUC Scores (Validation Set) (Note: A model with an AUC of 0.5 is no better than random guessing, while 1.0 is perfect.) * Random Forest AUC: 0.8891 * Logistic Regression AUC: 0.8209 * Decision Tree AUC: 0.8402 * SVM AUC: 0.8465

Discussion, Limitations, and Deployment – SHERRA

Deployment Plan:
Rather than automatically denying all applicants with a prior default, we propose integrating a predictive reconsideration layer into the underwriting workflow. Under this architecture, applicant data would be passed through a tuned Random Forest model via API to generate a “Reconsideration Score,” defined as the probability the applicant would have been approved under normal underwriting conditions. Applicants exceeding a calibrated business threshold (initially recommended at 0.60) would bypass automatic denial and be routed to a “Pending – Manual Review” queue. This preserves operational efficiency while introducing a structured, data-informed mechanism to selectively reconsider financially rehabilitated borrowers, with final decisions remaining under human oversight.

# Apply model to the isolated prior default group
eval_default <- df_with_default %>%
    dplyr::select(-previous_loan_defaults_on_file)
reconsider_probs <- predict(
    rf_best_model,
    newdata = eval_default, type = "prob"
)[, "Approved"]

df_with_default$reconsideration_score <- reconsider_probs

# Plot the distribution of their scores
ggplot(df_with_default, aes(x = reconsideration_score)) +
    geom_histogram(bins = 40, fill = "purple", color = "white") +
    geom_vline(xintercept = 0.6, linetype = "dashed", color = "red", size = 1) +
    theme_minimal() +
    labs(
        title = "Reconsideration Score Distribution among Prior Defaulters",
        subtitle = "Red Dashed Line = 60% Threshold for Manual Review",
        x = "Probability of Approval (Reconsideration Score)", y = "Frequency of Applicants"
    )

If we set a conservative threshold where a score of 60% or greater kicks the application into a manual review queue, what is the size of that opportunity pool?

threshold <- 0.6
good_candidates <- df_with_default %>% filter(reconsideration_score >= threshold)

potential_loan_volume <- sum(good_candidates$loan_amnt, na.rm = TRUE)

cat(sprintf("Total Excluded Applicants (Prior Default): %d\n", nrow(df_with_default)))

## Total Excluded Applicants (Prior Default): 22841

cat(sprintf("Highly Recommend Candidates (>%s Score): %d\n", paste0(threshold * 100, "%"), nrow(good_candidates)))

## Highly Recommend Candidates (>60% Score): 2020

cat(sprintf("Total Estimated Loan Volume for Reconsideration: $%s\n", format(potential_loan_volume, big.mark = ",")))

## Total Estimated Loan Volume for Reconsideration: $19,247,387

Expected Value / Profit Curve Analysis (Per Data Science for Business, Ch. 7: Expected Value Framework)

To evaluate the true business impact of deploying the Random Forest model for reconsideration, we calculate the Expected Value (EV) of a classification decision. The formula is:

\[EV = p(Y, p) \cdot b(Y, p) + p(N, p) \cdot c(N, p)\]

Where: * $p(Y, p)$ = Probability of True Positive (Targeting a good loan, viz. Precision = 78.48%) * $b(Y, p)$ = Business benefit of a good loan (e.g., net interest income) * $p(N, p)$ = Probability of False Positive (Targeting a bad loan, 1 - Precision = 21.52%) * $c(N, p)$ = Business cost of a defaulted loan (e.g., principal loss)

[MANUAL ASSUMPTIONS REQUIRED FOR FINAL CALCULATION] 1. Average Profit Per Good Loan ($b$): [INSERT ESTIMATE, e.g., $2,000] 2. Average Loss Per Defaulted Loan ($c$): [INSERT ESTIMATE, e.g., -$10,000]

Using our Random Forest precision of 78.48%, the Expected Profit of approving a reconsidered applicant is: \[EV = (0.7848 \times \text{Average Profit}) + (0.2152 \times \text{Average Loss})\]

If $EV > \$0$, the reconsideration mechanism is mathematically profitable.

Key Limitations:

The Absence of True Default Outcome Modeling: The model does not predict actual future default behavior; instead, it estimates similarity to historically approved applicants. Because prior defaults were automatically denied, we do not observe their loan performance. Therefore, the model supports policy refinement but does not provide validated risk forecasting. The firm faces uncertainty regarding the true loss of implication of approving reconsidered applicants, without outcome validation.
Contextual Blind Spots: While the model effectively evaluates structured financial indicators such as credit score, income, and debt ratios, it cannot provide quantifiable results into the underlying circumstances that led to a prior default. A default triggered by an isolated event such as a medical emergency, job loss during a macroeconomic downturn, or temporary liquidity shock carries fundamentally different risk implications than a default resulting from chronic overextension or persistent financial mismanagement. The algorithm cannot distinguish between these scenarios because it relies solely on quantitative snapshot data. As a result, it may overestimate the creditworthiness of some applicants or underestimate behavioral risk factors that are not captured in the dataset. This limitation reinforces the importance of maintaining human review to incorporate contextual judgment before capital is deployed.
Algorithmic Bias and Historical Pattern Reinforcement: The model is trained on historical underwriting decisions that inherently reflect the patterns and judgments embedded in past approval practices. If prior defaults or denial decisions were disproportionately concentrated among certain demographic, socioeconomic, or geographic groups, the model may unintentionally learn and perpetuate those historical disparities. Even in the absence of explicitly protected attributes, correlated variables such as income stability, ZIP code, or employment length can function as proxies that reproduce structural inequities. This creates the risk of disparate impact under fair lending regulations, where seemingly neutral decision criteria produce unequal outcomes across protected classes. As a result, reliance on historical approval data without adjustment could reinforce systemic bias rather than correct it. Prior to deployment, comprehensive fairness testing, disparate impact analysis, and ongoing monitoring would be essential to ensure compliance and mitigate unintended harm.
Target Leakage and Embedded Decision Logic: Certain features within the dataset, such as loan interest rate, may partially reflect prior underwriting assessments rather than purely independent borrower characteristics. Because interest rates are often assigned based on perceived risk, including such variables in the model, risks embedding existing decision logic directly into the predictive framework. In this case, the model may appear highly predictive not because it has identified new drivers of creditworthiness, but because it is leveraging information already shaped by prior approval decisions. This phenomenon, known as target leakage, can artificially inflate model performance while reducing interpretability and true explanatory power. If not addressed, it may lead stakeholders to overestimate the model’s independent predictive capability. Careful feature selection and validation are therefore necessary to ensure the model captures genuine risk indicators rather than reinforcing pre-existing decision rules.
Model Drift and Economic Sensitivity: Credit risk is not static, borrower behavior and repayment patterns evolve across economic cycles, interest rate environments, and labor market conditions. A model trained on historical data reflects the economic context in which that data was generated, and its predictive performance may degrade as external conditions change. If deployed, the model would require continuous performance monitoring to detect shifts in approval accuracy, recalibration of decision thresholds to maintain an appropriate risk-return balance, and periodic retraining using updated data. In addition, formal detection controls should be implemented to identify changes in applicant characteristics or risk distributions over time. Without active governance and monitoring, the institution could unknowingly expand exposure to higher-risk borrowers, particularly during economic downturns when default rates typically increase. Sustained oversight is essential to ensure that model performance remains stable and aligned with the firm’s risk appetite.

Human-in-the-Loop Governance:
To mitigate the operational, regulatory, and bias-related risks identified above, the model must function strictly as a decision-support routing tool rather than an autonomous approval mechanism. Applicants flagged by the Reconsideration Score should undergo mandatory underwriter review before any capital is deployed. Human reviewers should evaluate contextual documentation, assess the circumstances surrounding prior to defaults, and document their rationale for final decisions to ensure accountability and auditability. Prior to production deployment, fairness audits and disparate impact analyses should be conducted to evaluate potential bias across protected classes. Once live, the model should be subject to ongoing performance validation and independent model risk management oversight to ensure continued alignment with regulatory standards and institutional risk appetite.

Conclusion – EMILY

The legacy policy of universally denying all applicants with a prior default represents an overly rigid constraint that is demonstrably costing the firm viable loan originations. By implementing a predictive model trained on standard approval behaviors, we have created a systematic mechanism to “rescue” high-quality applicants from automated rejection. The proposed Random Forest model successfully identifies healthy financial profiles within this historically ignored cohort, representing an estimated $21 million in potential loan volume. By routing these high-scoring applicants to human underwriters for qualitative review, the firm can strategically expand its loan portfolio, increase revenue, and modernize its credit-decisioning pipeline while preserving necessary, human-led risk mitigation guardrails.

Our performance metrics prove this capability. The tuned Random Forest model yielded the highest Area Under the Curve (ROC-AUC) (0.8891), indicating its superior ability to distinguish between historically approved and denied profiles. Generating robust validation tests across accuracy (80.36%), precision (78.48%), and recall (77.89%) confirms that our models map approval behavior efficiently and conservatively.

References – ALL

Tawei Lo. (2022). Loan Approval Classification Data [Data set]. Kaggle. https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data

Provost, F., & Fawcett, T. (2013). Data science for business: What you need to know about data mining and data-analytic thinking. O’Reilly Media. (ISBN 978-1449361327)

Loan Approval Optimization: Re-evaluating Automatic Denials

Kevin Blossfield, Emily McCabe, Sherra’ Williams, Brett Wilzbach

2026-03-03