Our existing algorithmic approach to prescreening loan applicants automatically denies 100% of applicants with a prior history of default. This policy is overly restrictive from a quantitative and business perspective, as many of these individuals possess superior credit scores and lower leverage ratios than those who are regularly approved. This report follows the Data Science lifecycle (Modules 1-5) to analyze the issue and deploy an alternative strategy: a predictive, proxy-target model to identify high-potential applicants within this previously excluded population for manual review.
The Problem: The strict business rule (decision tree outcome) automatically denies the 22,858 people who have a prior default. Because this criterion universally drove denial in historic data, any model trained naively on the full dataset learns this as an unbreakable rule. The Question: How can we identify credit-worthy applicants among those with a prior default to refer them for secondary, manual review, thereby increasing overall revenue without disproportionately raising default risk? The Solution Path: Because historical data contains zero approvals for the “Prior Default” target group, we will utilize Proxy Target Modeling. We will train an algorithm exclusively on applicants with no prior defaults to understand what a “good” applicant looks like, and then apply this scoring mechanism back to the “Prior Default” group to identify strong candidates for reconsideration.
# Install and load standard project libraries
required_packages <- c(
"tidyverse", "janitor", "skimr", "GGally",
"caret", "randomForest", "pROC", "rpart", "rpart.plot", "factoextra", "cluster", "e1071"
)
installed <- rownames(installed.packages())
to_install <- setdiff(required_packages, installed)
if (length(to_install) > 0) install.packages(to_install, quietly = TRUE)
invisible(lapply(required_packages, library, character.only = TRUE))
set.seed(42) # For reproducibilityWe load the primary dataset from the raw directory and
normalize column names.
data_path <- "data/raw/loan_data.csv"
if (!file.exists(data_path)) stop("Data file not found at data/raw/loan_data.csv.")
df_raw <- read_csv(data_path, show_col_types = FALSE) %>% clean_names()
cat("Initial Data Set Dimensions: \nRows:", nrow(df_raw), "\nColumns:", ncol(df_raw), "\n")## Initial Data Set Dimensions:
## Rows: 45000
## Columns: 14
We apply necessary business rules to handle impossibly high or low values, normalize categories, and handle missing limits explicitly.
df_clean <- df_raw %>%
mutate(
# Factors & Categories
person_gender = factor(person_gender),
person_education = factor(person_education),
person_home_ownership = factor(person_home_ownership),
loan_intent = factor(loan_intent),
previous_loan_defaults_on_file = factor(previous_loan_defaults_on_file),
loan_status = factor(loan_status, levels = c(0, 1), labels = c("Denied", "Approved")),
# Integer fixes
person_age = as.integer(ifelse(person_age < 18 | person_age > 100, NA, person_age)),
person_emp_exp = as.integer(ifelse(person_emp_exp < 0 | person_emp_exp > 60, NA, person_emp_exp)),
cb_person_cred_hist_length = as.integer(ifelse(cb_person_cred_hist_length < 0 | cb_person_cred_hist_length > 60, NA, cb_person_cred_hist_length)),
credit_score = as.integer(ifelse(credit_score < 300 | credit_score > 850, NA, credit_score)),
# Financial metrics sanitization
loan_int_rate = ifelse(loan_int_rate < 0 | loan_int_rate > 60, NA, loan_int_rate),
loan_percent_income = ifelse(loan_percent_income <= 0 | loan_percent_income > 1.5, NA, loan_percent_income),
person_income = ifelse(person_income <= 0, NA, person_income),
loan_amnt = ifelse(loan_amnt <= 0, NA, loan_amnt)
)
# Drop missing targets and key variables to ensure modeled data is completely sound
df_clean <- df_clean %>%
drop_na()
cat("Valid Rows after Business Rules & NAs removed:", nrow(df_clean), "\n")## Valid Rows after Business Rules & NAs removed: 44966
Here, we perform the crucial logical split. We isolate the “Prior Default” group (the excluded population) from the “No Prior Default” group (the training population).
# The Training & Evaluation Group (No Defaults)
df_no_default <- df_clean %>% filter(previous_loan_defaults_on_file == "No")
# The "Holdout" Action Group (Prior Defaults)
df_with_default <- df_clean %>% filter(previous_loan_defaults_on_file == "Yes")
cat("No Prior Default (Train/Test Population):", nrow(df_no_default), "\n")## No Prior Default (Train/Test Population): 22125
## Prior Default (To Be Scored Later): 22841
Let’s visually prove the core business hypothesis: many prior defaulters possess equal or better credit health than currently approved individuals.
p1 <- ggplot(df_clean, aes(x = credit_score, fill = loan_status)) +
geom_density(alpha = 0.5) +
facet_wrap(~previous_loan_defaults_on_file, labeller = as_labeller(c("No" = "No Prior Default (Normal Processing)", "Yes" = "Prior Default (Auto Denied)"))) +
scale_fill_manual(values = c("indianred", "seagreen")) +
theme_minimal() +
labs(
title = "Credit Score Distribution by Approval vs. Prior Default Group",
x = "Credit Score", y = "Density"
)
print(p1)Takeaway: There is a distinct, healthy spike of applicant credit scores well into the ~750+ range within the Automatically Denied (Prior Default = Yes) group.
We will train our classification models strictly on
df_no_default to map typical approval standards without the
bias of the absolute denial rule enforced on prior defaulters.
# Stratified Train/Test split on the No-Default population
idx <- caret::createDataPartition(df_no_default$loan_status, p = 0.8, list = FALSE)
train_set <- df_no_default[idx, ] %>% dplyr::select(-previous_loan_defaults_on_file) # remove from predictors
test_set <- df_no_default[-idx, ] %>% dplyr::select(-previous_loan_defaults_on_file)We start with a transparent tree to understand which features weigh heaviest in the “normal” approval process.
tree_model <- rpart(
loan_status ~ .,
data = train_set,
method = "class",
control = rpart.control(cp = 0.005)
)
rpart.plot(tree_model, type = 2, extra = 104, fallen.leaves = TRUE, main = "Approval Decision Tree (Normal Population)")A more robust Random Forest model will better capture the non-linear boundaries of approval criteria.
rf_model <- randomForest(
loan_status ~ .,
data = train_set,
ntree = 200,
importance = TRUE
)
# Feature Importance
imp <- importance(rf_model)
imp_df <- tibble(feature = rownames(imp), MeanDecreaseGini = imp[, "MeanDecreaseGini"]) %>%
arrange(desc(MeanDecreaseGini)) %>%
head(10)
ggplot(imp_df, aes(x = reorder(feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_col(fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Top Driving Factors for Standard Loan Approvals", x = "", y = "Mean Decrease in Gini Impurity")# Start with a linear kernel SVM
model_svm <- svm(
loan_status ~ .,
data = train_set,
kernel = "linear",
cost = 0.1
)
# Predict on test set
predicted_svm <- predict(model_svm, test_set, decision.values = TRUE)
test_set$svm_pred_class <- predicted_svm
test_set$svm_dv <- as.numeric(attr(predicted_svm, "decision.values"))##
## Call:
## glm(formula = loan_status ~ ., family = "binomial", data = train_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.686e-01 4.043e-01 -1.406 0.1596
## person_age 2.389e-02 1.226e-02 1.948 0.0514 .
## person_gendermale 1.743e-02 3.944e-02 0.442 0.6585
## person_educationBachelor -9.126e-03 5.236e-02 -0.174 0.8616
## person_educationDoctorate -4.042e-02 1.656e-01 -0.244 0.8072
## person_educationHigh School 4.303e-02 5.470e-02 0.787 0.4315
## person_educationMaster 5.297e-02 6.262e-02 0.846 0.3976
## person_income 2.707e-06 5.893e-07 4.594 4.36e-06 ***
## person_emp_exp -1.427e-02 1.084e-02 -1.316 0.1881
## person_home_ownershipOTHER 1.806e-01 3.452e-01 0.523 0.6010
## person_home_ownershipOWN -1.482e+00 1.151e-01 -12.880 < 2e-16 ***
## person_home_ownershipRENT 7.022e-01 4.477e-02 15.684 < 2e-16 ***
## loan_amnt -1.138e-04 5.743e-06 -19.813 < 2e-16 ***
## loan_intentEDUCATION -9.372e-01 6.508e-02 -14.401 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -2.523e-02 7.276e-02 -0.347 0.7287
## loan_intentMEDICAL -2.850e-01 6.258e-02 -4.555 5.25e-06 ***
## loan_intentPERSONAL -7.673e-01 6.663e-02 -11.515 < 2e-16 ***
## loan_intentVENTURE -1.197e+00 7.061e-02 -16.958 < 2e-16 ***
## loan_int_rate 3.360e-01 7.344e-03 45.758 < 2e-16 ***
## loan_percent_income 1.666e+01 4.309e-01 38.670 < 2e-16 ***
## cb_person_cred_hist_length -1.732e-02 1.039e-02 -1.667 0.0955 .
## credit_score -8.703e-03 4.562e-04 -19.077 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24375 on 17700 degrees of freedom
## Residual deviance: 15980 on 17679 degrees of freedom
## AIC: 16024
##
## Number of Fisher Scoring iterations: 5
# Predict probabilities on test set
test_set$logit_pred_prob <- predict(logit_model, test_set, type = "response")
test_set$logit_pred_class <- factor(ifelse(test_set$logit_pred_prob > 0.5, "Approved", "Denied"), levels = c("Denied", "Approved"))For many machine learning problems, simply running a model out-of-the-box and getting a prediction is not enough; you want the best model with the most accurate prediction. We will optimize the settings for specific models before projecting them to the denied population.
set.seed(42)
full_tree <- rpart(
loan_status ~ .,
data = train_set,
method = "class",
control = rpart.control(cp = 0) # Allows complex tree to grow
)
plotcp(full_tree)# Prune tree with minimum cross-validated error
min_xerror <- full_tree$cptable[which.min(full_tree$cptable[, "xerror"]), ]
min_xerror_tree <- prune(full_tree, cp = min_xerror[1])
rpart.plot(min_xerror_tree, main = "Pruned Classification Tree")test_set$ct_bp_pred_prob <- predict(min_xerror_tree, test_set)[, "Approved"]
test_set$ct_bp_pred_class <- factor(ifelse(test_set$ct_bp_pred_prob > 0.5, "Approved", "Denied"), levels = c("Denied", "Approved"))
cat("Pruned Tree Test Accuracy: ", mean(test_set$ct_bp_pred_class == test_set$loan_status), "\n")## Pruned Tree Test Accuracy: 0.8444846
set.seed(42)
res <- tuneRF(
x = train_set %>% dplyr::select(-loan_status),
y = train_set$loan_status,
mtryStart = 2,
ntreeTry = 200,
trace = FALSE,
plot = FALSE
)## -0.09566185 0.05
## 0.03374119 0.05
best_mtry <- res[which.min(res[, 2]), 1]
rf_best_model <- randomForest(
loan_status ~ .,
data = train_set,
ntree = 200,
mtry = best_mtry,
importance = TRUE
)
test_set$rf_pred_prob <- predict(rf_best_model, test_set, type = "prob")[, "Approved"]
test_set$rf_pred_class <- predict(rf_best_model, test_set, type = "class")
cat("Tuned RF Test Accuracy: ", mean(test_set$rf_pred_class == test_set$loan_status), "\n")## Tuned RF Test Accuracy: 0.8666365
set.seed(42)
svm_tune <- tune(
svm,
loan_status ~ .,
data = train_set,
kernel = "radial",
ranges = list(cost = c(0.1, 1, 10))
)
best_svm_mod <- svm_tune$best.model
test_set$svm_bp_pred_class <- predict(best_svm_mod, test_set)
test_set$svm_bp_dv <- as.numeric(attr(predict(best_svm_mod, test_set, decision.values = TRUE), "decision.values"))
cat("Tuned SVM Test Accuracy: ", mean(test_set$svm_bp_pred_class == test_set$loan_status), "\n")## Tuned SVM Test Accuracy: 0.8429024
null_model <- glm(loan_status ~ 1, data = train_set, family = "binomial")
full_model <- logit_model
forward_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward", trace = 0)
test_set$logit_step_pred_prob <- predict(forward_model, test_set, type = "response")
test_set$logit_step_pred_class <- factor(ifelse(test_set$logit_step_pred_prob > 0.5, "Approved", "Denied"), levels = c("Denied", "Approved"))
cat("Stepwise Logit Test Accuracy: ", mean(test_set$logit_step_pred_class == test_set$loan_status), "\n")## Stepwise Logit Test Accuracy: 0.7972423
ct_roc <- pROC::roc(test_set$loan_status, test_set$ct_bp_pred_prob, direction = "<", levels = c("Denied", "Approved"), quiet = TRUE)
rf_roc <- pROC::roc(test_set$loan_status, test_set$rf_pred_prob, direction = "<", levels = c("Denied", "Approved"), quiet = TRUE)
logit_roc <- pROC::roc(test_set$loan_status, test_set$logit_step_pred_prob, direction = "<", levels = c("Denied", "Approved"), quiet = TRUE)
svm_roc <- pROC::roc(test_set$loan_status, test_set$svm_bp_dv, direction = "auto", levels = c("Denied", "Approved"), quiet = TRUE)
plot(ct_roc, print.auc = TRUE, col = "blue", main = "ROC Comparison")
plot(rf_roc, print.auc = TRUE, print.auc.y = 0.4, col = "green", add = TRUE)
plot(logit_roc, print.auc = TRUE, print.auc.y = 0.3, col = "red", add = TRUE)
plot(svm_roc, print.auc = TRUE, print.auc.y = 0.2, col = "black", add = TRUE)
legend("bottomright",
legend = c("Decision Tree", "Random Forest", "Logistic Regression", "SVM"),
col = c("blue", "green", "red", "black"), lwd = 2, cex = 0.8
)Our high-performance model (Random Forest) generally yields the highest accuracy and encapsulates standard firm approvals accurately. We are now ready to project this understanding onto the manually denied subset.
We will now extract the df_with_default group and pass
them through our trusted Random Forest to estimate how likely they
would have been approved if they did not have a prior default.
We term this the Reconsideration Score.
# Apply model to the isolated prior default group
eval_default <- df_with_default %>%
dplyr::select(-previous_loan_defaults_on_file)
reconsider_probs <- predict(
rf_best_model,
newdata = eval_default, type = "prob"
)[, "Approved"]
df_with_default$reconsideration_score <- reconsider_probs
# Plot the distribution of their scores!
ggplot(df_with_default, aes(x = reconsideration_score)) +
geom_histogram(bins = 40, fill = "purple", color = "white") +
geom_vline(xintercept = 0.6, linetype = "dashed", color = "red", size = 1) +
theme_minimal() +
labs(
title = "Reconsideration Score Distribution among Prior Defaulters",
subtitle = "Red Dashed Line = 60% Threshold for Manual Review",
x = "Probability of Approval (Reconsideration Score)", y = "Frequency of Applicants"
)If we set a conservative threshold where a score of 60% or greater kicks the application into a manual review queue, what is the size of that opportunity pool?
threshold <- 0.6
good_candidates <- df_with_default %>% filter(reconsideration_score >= threshold)
# Calculate total impact proportions relative to the entire dataset
total_applicants <- nrow(df_clean)
total_excluded <- nrow(df_with_default)
recons_count <- nrow(good_candidates)
pct_of_excluded <- (recons_count / total_excluded) * 100
pct_of_total <- (recons_count / total_applicants) * 100
potential_loan_volume <- sum(good_candidates$loan_amnt, na.rm = TRUE)
cat(sprintf("Total Historical Applicants Context: %d\n", total_applicants))## Total Historical Applicants Context: 44966
## Total Excluded Applicants (Prior Default): 22841
cat(sprintf("Highly Recommend Candidates (>%s Score): %d\n", paste0(threshold * 100, "%"), recons_count))## Highly Recommend Candidates (>60% Score): 1601
## ---------
## Percentage of Excluded that should be manually reviewed: 7.01%
cat(sprintf("Percentage of TOTAL Applicant Pool represented by this cohort: %.2f%%\n", pct_of_total))## Percentage of TOTAL Applicant Pool represented by this cohort: 3.56%
cat(sprintf("Total Estimated Loan Volume for Reconsideration: $%s\n", format(potential_loan_volume, big.mark = ",")))## Total Estimated Loan Volume for Reconsideration: $14,801,975
## =========
# Comparative Descriptive Statistics: Why is this group highly relevant?
# We compare our 60%+ reconsideration group against people who actually got approved historically.
historical_approved <- df_no_default %>% filter(loan_status == "Approved")
comp_stats <- tibble(
Cohort = c("Historical Approvals", "Recommended for Reconsideration (>60% Score)"),
Avg_Credit_Score = c(mean(historical_approved$credit_score, na.rm = TRUE), mean(good_candidates$credit_score, na.rm = TRUE)),
Avg_Income = c(mean(historical_approved$person_income, na.rm = TRUE), mean(good_candidates$person_income, na.rm = TRUE)),
Avg_Loan_Amount = c(mean(historical_approved$loan_amnt, na.rm = TRUE), mean(good_candidates$loan_amnt, na.rm = TRUE)),
Avg_Leverage = c(mean(historical_approved$loan_percent_income, na.rm = TRUE), mean(good_candidates$loan_percent_income, na.rm = TRUE))
)
knitr::kable(comp_stats, digits = 2, caption = "Relevance: Comparing Reconsideration Candidates to Typical Approvals")| Cohort | Avg_Credit_Score | Avg_Income | Avg_Loan_Amount | Avg_Leverage |
|---|---|---|---|---|
| Historical Approvals | 631.89 | 59847.35 | 10856.58 | 0.20 |
| Recommended for Reconsideration (>60% Score) | 584.94 | 67462.50 | 9245.46 | 0.14 |
# Sample of these premium candidates
good_candidates %>%
dplyr::select(person_income, loan_amnt, loan_percent_income, credit_score, reconsideration_score) %>%
head(100) %>%
knitr::kable(digits = 2, caption = "Sample of Top Tier 'Prior Default' Candidates Recommended for Review")| person_income | loan_amnt | loan_percent_income | credit_score | reconsideration_score |
|---|---|---|---|---|
| 12282 | 1000 | 0.08 | 504 | 0.60 |
| 337133 | 10000 | 0.03 | 623 | 0.61 |
| 279664 | 25000 | 0.09 | 642 | 0.64 |
| 277017 | 35000 | 0.13 | 622 | 0.66 |
| 258980 | 35000 | 0.14 | 618 | 0.71 |
| 220872 | 20000 | 0.09 | 504 | 0.69 |
| 100435 | 25000 | 0.25 | 576 | 0.80 |
| 102317 | 25000 | 0.24 | 586 | 0.72 |
| 105349 | 25000 | 0.24 | 621 | 0.84 |
| 19158 | 1750 | 0.09 | 625 | 0.67 |
| 209997 | 8800 | 0.04 | 616 | 0.64 |
| 133231 | 25000 | 0.19 | 675 | 0.78 |
| 161952 | 25000 | 0.15 | 578 | 0.76 |
| 20401 | 2275 | 0.11 | 573 | 0.66 |
| 22107 | 2250 | 0.10 | 636 | 0.78 |
| 140396 | 21500 | 0.15 | 572 | 0.66 |
| 85496 | 21000 | 0.25 | 614 | 0.90 |
| 24521 | 7500 | 0.31 | 612 | 0.76 |
| 24993 | 16600 | 0.66 | 634 | 0.66 |
| 24853 | 7125 | 0.29 | 631 | 0.61 |
| 24853 | 2500 | 0.10 | 524 | 0.76 |
| 81580 | 20000 | 0.25 | 607 | 0.91 |
| 84428 | 20000 | 0.24 | 707 | 0.76 |
| 86325 | 20000 | 0.23 | 566 | 0.62 |
| 87458 | 20000 | 0.23 | 604 | 0.92 |
| 27097 | 1000 | 0.04 | 581 | 0.69 |
| 27041 | 10000 | 0.37 | 534 | 0.70 |
| 109089 | 20000 | 0.18 | 537 | 0.60 |
| 27730 | 8000 | 0.29 | 477 | 0.61 |
| 28773 | 6000 | 0.21 | 657 | 0.69 |
| 28257 | 10625 | 0.38 | 524 | 0.64 |
| 210999 | 20000 | 0.09 | 680 | 0.65 |
| 23899 | 700 | 0.03 | 658 | 0.69 |
| 28752 | 7000 | 0.24 | 484 | 0.76 |
| 14763 | 1000 | 0.07 | 540 | 0.74 |
| 15248 | 1000 | 0.07 | 547 | 0.78 |
| 15553 | 1000 | 0.06 | 602 | 0.91 |
| 14997 | 1000 | 0.07 | 669 | 0.74 |
| 20953 | 1000 | 0.05 | 665 | 0.74 |
| 74953 | 18000 | 0.24 | 515 | 0.68 |
| 79231 | 18000 | 0.23 | 562 | 0.68 |
| 85830 | 18000 | 0.21 | 625 | 0.74 |
| 29935 | 8500 | 0.28 | 606 | 0.64 |
| 115195 | 18000 | 0.16 | 541 | 0.80 |
| 146325 | 18000 | 0.12 | 624 | 0.66 |
| 151146 | 18000 | 0.12 | 642 | 0.66 |
| 30359 | 4000 | 0.13 | 616 | 0.75 |
| 70497 | 17500 | 0.25 | 661 | 0.66 |
| 68170 | 17000 | 0.25 | 614 | 0.88 |
| 73185 | 16800 | 0.23 | 637 | 0.83 |
| 72852 | 16800 | 0.23 | 547 | 0.66 |
| 30872 | 4000 | 0.13 | 486 | 0.73 |
| 66865 | 16700 | 0.25 | 628 | 0.76 |
| 65886 | 16000 | 0.24 | 627 | 0.60 |
| 31658 | 7850 | 0.25 | 522 | 0.70 |
| 89843 | 16000 | 0.18 | 618 | 0.89 |
| 91345 | 16000 | 0.18 | 545 | 0.90 |
| 25289 | 1000 | 0.04 | 629 | 0.61 |
| 29839 | 1000 | 0.03 | 444 | 0.72 |
| 30751 | 1000 | 0.03 | 560 | 0.86 |
| 32214 | 7000 | 0.22 | 643 | 0.72 |
| 32258 | 6000 | 0.19 | 587 | 0.60 |
| 32612 | 10000 | 0.31 | 710 | 0.70 |
| 41311 | 1000 | 0.02 | 482 | 0.64 |
| 18083 | 1200 | 0.07 | 566 | 0.62 |
| 25738 | 1200 | 0.05 | 571 | 0.62 |
| 33534 | 3025 | 0.09 | 556 | 0.78 |
| 44135 | 1200 | 0.03 | 569 | 0.66 |
| 49151 | 1200 | 0.02 | 652 | 0.73 |
| 33521 | 5000 | 0.15 | 518 | 0.66 |
| 55118 | 1300 | 0.02 | 477 | 0.68 |
| 40329 | 1400 | 0.03 | 682 | 0.60 |
| 34520 | 1500 | 0.04 | 666 | 0.64 |
| 35346 | 1500 | 0.04 | 521 | 0.66 |
| 37180 | 1500 | 0.04 | 526 | 0.74 |
| 36919 | 1500 | 0.04 | 622 | 0.66 |
| 38350 | 1500 | 0.04 | 538 | 0.61 |
| 39493 | 1500 | 0.04 | 662 | 0.63 |
| 41697 | 1500 | 0.04 | 671 | 0.92 |
| 54361 | 1500 | 0.03 | 454 | 0.61 |
| 35051 | 10750 | 0.31 | 584 | 0.61 |
| 22590 | 1800 | 0.08 | 662 | 0.74 |
| 55199 | 1800 | 0.03 | 601 | 0.80 |
| 61669 | 1800 | 0.03 | 653 | 0.66 |
| 24336 | 1825 | 0.07 | 594 | 0.62 |
| 17716 | 2000 | 0.11 | 539 | 0.60 |
| 19130 | 2000 | 0.10 | 597 | 0.68 |
| 19821 | 2000 | 0.10 | 602 | 0.63 |
| 20020 | 2000 | 0.10 | 687 | 0.64 |
| 22694 | 2000 | 0.09 | 651 | 0.78 |
| 22735 | 2000 | 0.09 | 462 | 0.64 |
| 23899 | 2000 | 0.08 | 641 | 0.78 |
| 24407 | 2000 | 0.08 | 535 | 0.69 |
| 24999 | 2000 | 0.08 | 527 | 0.60 |
| 29829 | 2000 | 0.07 | 612 | 0.96 |
| 33197 | 2000 | 0.06 | 629 | 0.63 |
| 36965 | 5400 | 0.15 | 476 | 0.75 |
| 122002 | 15000 | 0.12 | 648 | 0.60 |
| 38328 | 2000 | 0.05 | 626 | 0.79 |
| 39812 | 2000 | 0.05 | 593 | 0.84 |
To provide another “expert eye” on these candidates, how do they group without any target variables at all?
# We scale the financial health attributes of all applicants
cluster_vars <- df_clean %>%
dplyr::select(loan_percent_income, credit_score, person_income) %>%
drop_na()
cluster_scaled <- scale(cluster_vars)
# Form 3 KMeans clusters
km <- kmeans(cluster_scaled, centers = 3, nstart = 25)
cluster_summary <- cluster_vars %>%
mutate(Cluster = as.factor(km$cluster)) %>%
group_by(Cluster) %>%
summarise(
Mean_Credit = mean(credit_score),
Mean_Income = mean(person_income),
Mean_Leverage = mean(loan_percent_income),
Count = n()
)
knitr::kable(cluster_summary, caption = "Unsupervised Cluster Averages")| Cluster | Mean_Credit | Mean_Income | Mean_Leverage | Count |
|---|---|---|---|---|
| 1 | 661.6684 | 95558.28 | 0.0913231 | 21042 |
| 2 | 571.9636 | 75775.28 | 0.1169166 | 12428 |
| 3 | 644.8793 | 53711.36 | 0.2533142 | 11496 |
When comparing our high-reconsideration candidates, standard business
sense dictates that the cluster with the lowest leverage
(Mean_Leverage) and highest income/credit scores is the
ideal group. Our >60% threshold individuals map comfortably into the
healthiest statistical strata, confirming the Random Forest behavior
without target-variable bias.
Summary: We utilized the Data Science process to de-bias the automatic screening mechanism against prior defaulters. Using Proxy Target Classification, we built a model to emulate standard approval operations and systematically screened the definitively denied group to surface high-potential candidates.
Deployment Consideration: - Instead of auto-denying
100% of applicants with a prior default, the system should invoke an API
call generating the reconsideration_score. - If the score
> 0.60, the application is flagged as “Pending - Manual Review”. -
Hazards & Bias: While these individuals show
currently strong financial positioning, their prior default remains a
material fact. Manual review procedures should involve human
verification of why the prior default occurred. Models cannot
quantify sudden macroeconomic shocks to prior defaulting individuals;
thus humans remain the best mitigation strategy before the firm releases
capital.