Executive Summary

Our existing algorithmic approach to prescreening loan applicants automatically denies 100% of applicants with a prior history of default. This policy is overly restrictive from a quantitative and business perspective, as many of these individuals possess superior credit scores and lower leverage ratios than those who are regularly approved. This report follows the Data Science lifecycle (Modules 1-5) to analyze the issue and deploy an alternative strategy: a predictive, proxy-target model to identify high-potential applicants within this previously excluded population for manual review.

M1: Business Understanding

The Problem: The strict business rule (decision tree outcome) automatically denies the 22,858 people who have a prior default. Because this criterion universally drove denial in historic data, any model trained naively on the full dataset learns this as an unbreakable rule. The Question: How can we identify credit-worthy applicants among those with a prior default to refer them for secondary, manual review, thereby increasing overall revenue without disproportionately raising default risk? The Solution Path: Because historical data contains zero approvals for the “Prior Default” target group, we will utilize Proxy Target Modeling. We will train an algorithm exclusively on applicants with no prior defaults to understand what a “good” applicant looks like, and then apply this scoring mechanism back to the “Prior Default” group to identify strong candidates for reconsideration.

# Install and load standard project libraries
required_packages <- c(
  "tidyverse", "janitor", "skimr", "GGally",
  "caret", "randomForest", "pROC", "rpart", "rpart.plot", "factoextra", "cluster"
)

installed <- rownames(installed.packages())
to_install <- setdiff(required_packages, installed)
if (length(to_install) > 0) install.packages(to_install, quietly = TRUE)

invisible(lapply(required_packages, library, character.only = TRUE))
set.seed(42) # For reproducibility

M2: Data Understanding & Preparation

2.1 Load & Initial Structure Check

We load the primary dataset from the raw directory and normalize column names.

data_path <- "data/raw/loan_data.csv"
if (!file.exists(data_path)) stop("Data file not found at data/raw/loan_data.csv.")

df_raw <- read_csv(data_path, show_col_types = FALSE) %>% clean_names()

cat("Initial Data Set Dimensions: \nRows:", nrow(df_raw), "\nColumns:", ncol(df_raw), "\n")

## Initial Data Set Dimensions: 
## Rows: 45000 
## Columns: 14

2.2 Data Cleaning & Type Formatting

We apply necessary business rules to handle impossibly high or low values, normalize categories, and handle missing limits explicitly.

df_clean <- df_raw %>%
  mutate(
    # Factors & Categories
    person_gender = factor(person_gender),
    person_education = factor(person_education),
    person_home_ownership = factor(person_home_ownership),
    loan_intent = factor(loan_intent),
    previous_loan_defaults_on_file = factor(previous_loan_defaults_on_file),
    loan_status = factor(loan_status, levels = c(0, 1), labels = c("Denied", "Approved")),

    # Integer fixes
    person_age = as.integer(ifelse(person_age < 18 | person_age > 100, NA, person_age)),
    person_emp_exp = as.integer(ifelse(person_emp_exp < 0 | person_emp_exp > 60, NA, person_emp_exp)),
    cb_person_cred_hist_length = as.integer(ifelse(cb_person_cred_hist_length < 0 | cb_person_cred_hist_length > 60, NA, cb_person_cred_hist_length)),
    credit_score = as.integer(ifelse(credit_score < 300 | credit_score > 850, NA, credit_score)),

    # Financial metrics sanitization
    loan_int_rate = ifelse(loan_int_rate < 0 | loan_int_rate > 60, NA, loan_int_rate),
    loan_percent_income = ifelse(loan_percent_income <= 0 | loan_percent_income > 1.5, NA, loan_percent_income),
    person_income = ifelse(person_income <= 0, NA, person_income),
    loan_amnt = ifelse(loan_amnt <= 0, NA, loan_amnt)
  )

# Drop missing targets and key variables to ensure modeled data is completely sound
df_clean <- df_clean %>%
  drop_na()

cat("Valid Rows after Business Rules & NAs removed:", nrow(df_clean), "\n")

## Valid Rows after Business Rules & NAs removed: 44966

2.3 The “Base Case 1” Data Split

Here, we perform the crucial logical split. We isolate the “Prior Default” group (the excluded population) from the “No Prior Default” group (the training population).

# The Training & Evaluation Group (No Defaults)
df_no_default <- df_clean %>% filter(previous_loan_defaults_on_file == "No")

# The "Holdout" Action Group (Prior Defaults)
df_with_default <- df_clean %>% filter(previous_loan_defaults_on_file == "Yes")

cat("No Prior Default (Train/Test Population):", nrow(df_no_default), "\n")

## No Prior Default (Train/Test Population): 22125

cat("Prior Default (To Be Scored Later):", nrow(df_with_default), "\n")

## Prior Default (To Be Scored Later): 22841

Exploratory Overlap (The Business Case Evidence)

Let’s visually prove the core business hypothesis: many prior defaulters possess equal or better credit health than currently approved individuals.

p1 <- ggplot(df_clean, aes(x = credit_score, fill = loan_status)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~previous_loan_defaults_on_file, labeller = as_labeller(c("No" = "No Prior Default (Normal Processing)", "Yes" = "Prior Default (Auto Denied)"))) +
  scale_fill_manual(values = c("indianred", "seagreen")) +
  theme_minimal() +
  labs(
    title = "Credit Score Distribution by Approval vs. Prior Default Group",
    x = "Credit Score", y = "Density"
  )

print(p1)

Takeaway: There is a distinct, healthy spike of applicant credit scores well into the ~750+ range within the Automatically Denied (Prior Default = Yes) group.

M3: Modeling

3.1 Proxy Target Approach Setup

We will train our classification models strictly on df_no_default to map typical approval standards without the bias of the absolute denial rule enforced on prior defaulters.

# Stratified Train/Test split on the No-Default population
idx <- caret::createDataPartition(df_no_default$loan_status, p = 0.8, list = FALSE)
train_set <- df_no_default[idx, ] %>% dplyr::select(-previous_loan_defaults_on_file) # remove from predictors
test_set <- df_no_default[-idx, ] %>% dplyr::select(-previous_loan_defaults_on_file)

3.2 Interpretable Model Baseline (Decision Tree)

We start with a transparent tree to understand which features weigh heaviest in the “normal” approval process.

tree_model <- rpart(
  loan_status ~ .,
  data = train_set,
  method = "class",
  control = rpart.control(cp = 0.005)
)

rpart.plot(tree_model, type = 2, extra = 104, fallen.leaves = TRUE, main = "Approval Decision Tree (Normal Population)")

3.3 High-Performance Model (Random Forest)

A more robust Random Forest model will better capture the non-linear boundaries of approval criteria.

rf_model <- randomForest(
  loan_status ~ .,
  data = train_set,
  ntree = 200,
  importance = TRUE
)

# Feature Importance
imp <- importance(rf_model)
imp_df <- tibble(feature = rownames(imp), MeanDecreaseGini = imp[, "MeanDecreaseGini"]) %>%
  arrange(desc(MeanDecreaseGini)) %>%
  head(10)

ggplot(imp_df, aes(x = reorder(feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top Driving Factors for Standard Loan Approvals", x = "", y = "Mean Decrease in Gini Impurity")

M4: Model Tuning & Evaluation

Let’s assert our model’s performance on the 20% validation set representing normal applicants. If our model accurately evaluates “normal” applicants, we can trust it to proxy score the separated prior defaulters.

rf_prob <- predict(rf_model, newdata = test_set, type = "prob")[, "Approved"]
rf_pred <- factor(ifelse(rf_prob > 0.5, "Approved", "Denied"), levels = c("Denied", "Approved"))

# Confusion Matrix
cm_rf <- confusionMatrix(rf_pred, test_set$loan_status, positive = "Approved")
print(cm_rf$table)

##           Reference
## Prediction Denied Approved
##   Denied     2250      409
##   Approved    175     1590

cat(paste0("\nOverall Accuracy on Test Set: ", round(cm_rf$overall["Accuracy"], 4) * 100, "%\n"))

## 
## Overall Accuracy on Test Set: 86.8%

cat(paste0("Sensitivity (True Approval Rate): ", round(cm_rf$byClass["Sensitivity"], 4) * 100, "%\n"))

## Sensitivity (True Approval Rate): 79.54%

# ROC Score Setup
roc_rf <- pROC::roc(response = test_set$loan_status, predictor = rf_prob, levels = c("Denied", "Approved"), direction = "<")
plot(roc_rf, main = paste("Random Forest ROC Curve (AUC =", round(auc(roc_rf), 3), ")"), col = "darkblue", lwd = 2)

Our model highly encapsulates and correctly classifies the standard firm approvals. We are now ready to project this understanding onto the manually denied subset.

M5: Base Case Execution & Deployment

5.1 Projecting Scores to Prior Defaulters (The Reconsideration Score)

We will now extract the df_with_default group and pass them through our trusted Random Forest to estimate how likely they would have been approved if they did not have a prior default. We term this the Reconsideration Score.

# Apply model to the isolated prior default group
eval_default <- df_with_default %>%
  dplyr::select(-previous_loan_defaults_on_file)
reconsider_probs <- predict(
  rf_model,
  newdata = eval_default, type = "prob"
)[, "Approved"]

df_with_default$reconsideration_score <- reconsider_probs

# Plot the distribution of their scores!
ggplot(df_with_default, aes(x = reconsideration_score)) +
  geom_histogram(bins = 40, fill = "purple", color = "white") +
  geom_vline(xintercept = 0.6, linetype = "dashed", color = "red", size = 1) +
  theme_minimal() +
  labs(
    title = "Reconsideration Score Distribution among Prior Defaulters",
    subtitle = "Red Dashed Line = 60% Threshold for Manual Review",
    x = "Probability of Approval (Reconsideration Score)", y = "Frequency of Applicants"
  )

5.2 Business Outcome: Revenue Identification

If we set a conservative threshold where a score of 60% or greater kicks the application into a manual review queue, what is the size of that opportunity pool?

threshold <- 0.6
good_candidates <- df_with_default %>% filter(reconsideration_score >= threshold)

# Calculate total impact proportions relative to the entire dataset
total_applicants <- nrow(df_clean)
total_excluded <- nrow(df_with_default)
recons_count <- nrow(good_candidates)

pct_of_excluded <- (recons_count / total_excluded) * 100
pct_of_total <- (recons_count / total_applicants) * 100
potential_loan_volume <- sum(good_candidates$loan_amnt, na.rm = TRUE)

cat(sprintf("Total Historical Applicants Context: %d\n", total_applicants))

## Total Historical Applicants Context: 44966

cat(sprintf("Total Excluded Applicants (Prior Default): %d\n", total_excluded))

## Total Excluded Applicants (Prior Default): 22841

cat(sprintf("Highly Recommend Candidates (>%s Score): %d\n", paste0(threshold * 100, "%"), recons_count))

## Highly Recommend Candidates (>60% Score): 1530

cat("---------\n")

## ---------

cat(sprintf("Percentage of Excluded that should be manually reviewed: %.2f%%\n", pct_of_excluded))

## Percentage of Excluded that should be manually reviewed: 6.70%

cat(sprintf("Percentage of TOTAL Applicant Pool represented by this cohort: %.2f%%\n", pct_of_total))

## Percentage of TOTAL Applicant Pool represented by this cohort: 3.40%

cat(sprintf("Total Estimated Loan Volume for Reconsideration: $%s\n", format(potential_loan_volume, big.mark = ",")))

## Total Estimated Loan Volume for Reconsideration: $14,422,732

cat("=========\n")

## =========

# Comparative Descriptive Statistics: Why is this group highly relevant?
# We compare our 60%+ reconsideration group against people who actually got approved historically.
historical_approved <- df_no_default %>% filter(loan_status == "Approved")

comp_stats <- tibble(
  Cohort = c("Historical Approvals", "Recommended for Reconsideration (>60% Score)"),
  Avg_Credit_Score = c(mean(historical_approved$credit_score, na.rm = TRUE), mean(good_candidates$credit_score, na.rm = TRUE)),
  Avg_Income = c(mean(historical_approved$person_income, na.rm = TRUE), mean(good_candidates$person_income, na.rm = TRUE)),
  Avg_Loan_Amount = c(mean(historical_approved$loan_amnt, na.rm = TRUE), mean(good_candidates$loan_amnt, na.rm = TRUE)),
  Avg_Leverage = c(mean(historical_approved$loan_percent_income, na.rm = TRUE), mean(good_candidates$loan_percent_income, na.rm = TRUE))
)

knitr::kable(comp_stats, digits = 2, caption = "Relevance: Comparing Reconsideration Candidates to Typical Approvals")

Relevance: Comparing Reconsideration Candidates to Typical Approvals
Cohort	Avg_Credit_Score	Avg_Income	Avg_Loan_Amount	Avg_Leverage
Historical Approvals	631.89	59847.35	10856.58	0.20
Recommended for Reconsideration (>60% Score)	587.69	66715.73	9426.62	0.15

# Sample of these premium candidates
good_candidates %>%
  dplyr::select(person_income, loan_amnt, loan_percent_income, credit_score, reconsideration_score) %>%
  head(100) %>%
  knitr::kable(digits = 2, caption = "Sample of Top Tier 'Prior Default' Candidates Recommended for Review")

Sample of Top Tier ‘Prior Default’ Candidates Recommended for Review
person_income	loan_amnt	loan_percent_income	credit_score	reconsideration_score
279664	25000	0.09	642	0.68
277017	35000	0.13	622	0.65
258980	35000	0.14	618	0.74
220872	20000	0.09	504	0.60
100435	25000	0.25	576	0.76
216925	18000	0.08	651	0.66
102317	25000	0.24	586	0.72
105349	25000	0.24	621	0.86
19158	1750	0.09	625	0.64
209997	8800	0.04	616	0.64
133231	25000	0.19	675	0.77
161952	25000	0.15	578	0.77
20401	2275	0.11	573	0.64
22107	2250	0.10	636	0.70
132873	24000	0.18	701	0.60
140396	21500	0.15	572	0.66
85496	21000	0.25	614	0.83
24521	7500	0.31	612	0.80
24853	2500	0.10	524	0.75
81580	20000	0.25	607	0.90
84428	20000	0.24	707	0.80
87458	20000	0.23	604	0.86
27097	1000	0.04	581	0.66
27041	10000	0.37	534	0.67
109089	20000	0.18	537	0.62
28773	6000	0.21	657	0.68
28257	10625	0.38	524	0.63
210999	20000	0.09	680	0.66
23899	700	0.03	658	0.68
28752	7000	0.24	484	0.76
14763	1000	0.07	540	0.78
15248	1000	0.07	547	0.76
15553	1000	0.06	602	0.91
14997	1000	0.07	669	0.66
20953	1000	0.05	665	0.72
72947	18000	0.25	631	0.62
74953	18000	0.24	515	0.69
79231	18000	0.23	562	0.62
85830	18000	0.21	625	0.73
29935	8500	0.28	606	0.64
115195	18000	0.16	541	0.79
146325	18000	0.12	624	0.64
151146	18000	0.12	642	0.68
30359	4000	0.13	616	0.74
70497	17500	0.25	661	0.68
68170	17000	0.25	614	0.84
73185	16800	0.23	637	0.80
72852	16800	0.23	547	0.61
30872	4000	0.13	486	0.76
66865	16700	0.25	628	0.71
31532	1200	0.04	623	0.61
31658	7850	0.25	522	0.72
79081	16000	0.20	509	0.62
89843	16000	0.18	618	0.90
91345	16000	0.18	545	0.90
25289	1000	0.04	629	0.73
29839	1000	0.03	444	0.75
30751	1000	0.03	560	0.88
32214	7000	0.22	643	0.69
36677	1000	0.03	514	0.61
41311	1000	0.02	482	0.69
18083	1200	0.07	566	0.64
33534	3025	0.09	556	0.78
49151	1200	0.02	652	0.72
33521	5000	0.15	518	0.64
55118	1300	0.02	477	0.64
34520	1500	0.04	666	0.64
35346	1500	0.04	521	0.64
37180	1500	0.04	526	0.78
36919	1500	0.04	622	0.64
38350	1500	0.04	538	0.63
39493	1500	0.04	662	0.62
41697	1500	0.04	671	0.91
35051	10750	0.31	584	0.61
27045	1600	0.06	547	0.60
55130	1700	0.03	605	0.65
22590	1800	0.08	662	0.64
55199	1800	0.03	601	0.84
61669	1800	0.03	653	0.62
69328	1800	0.03	526	0.62
24336	1825	0.07	594	0.66
19130	2000	0.10	597	0.64
20020	2000	0.10	687	0.60
22694	2000	0.09	651	0.73
23899	2000	0.08	641	0.72
24407	2000	0.08	535	0.63
24999	2000	0.08	527	0.72
29829	2000	0.07	612	0.96
36965	5400	0.15	476	0.65
122002	15000	0.12	648	0.62
38328	2000	0.05	626	0.77
39812	2000	0.05	593	0.88
37197	10625	0.29	655	0.69
130590	2000	0.02	628	0.75
54191	2100	0.04	549	0.88
72916	2100	0.03	619	0.64
22076	2150	0.10	591	0.66
37068	3075	0.08	521	0.69
18915	2200	0.12	632	0.70
37111	12000	0.32	607	0.78

M5 Context: Unsupervised Cross-Validation (Clustering)

To provide another “expert eye” on these candidates, how do they group without any target variables at all?

# We scale the financial health attributes of all applicants
cluster_vars <- df_clean %>%
  dplyr::select(loan_percent_income, credit_score, person_income) %>%
  drop_na()
cluster_scaled <- scale(cluster_vars)

# Form 3 KMeans clusters
km <- kmeans(cluster_scaled, centers = 3, nstart = 25)

cluster_summary <- cluster_vars %>%
  mutate(Cluster = as.factor(km$cluster)) %>%
  group_by(Cluster) %>%
  summarise(
    Mean_Credit = mean(credit_score),
    Mean_Income = mean(person_income),
    Mean_Leverage = mean(loan_percent_income),
    Count = n()
  )

knitr::kable(cluster_summary, caption = "Unsupervised Cluster Averages")

Unsupervised Cluster Averages
Cluster	Mean_Credit	Mean_Income	Mean_Leverage	Count
1	644.8793	53711.36	0.2533142	11496
2	571.9636	75775.28	0.1169166	12428
3	661.6684	95558.28	0.0913231	21042

When comparing our high-reconsideration candidates, standard business sense dictates that the cluster with the lowest leverage (Mean_Leverage) and highest income/credit scores is the ideal group. Our >60% threshold individuals map comfortably into the healthiest statistical strata, confirming the Random Forest behavior without target-variable bias.

Conclusion & Deployment Plan

Summary: We utilized the Data Science process to de-bias the automatic screening mechanism against prior defaulters. Using Proxy Target Classification, we built a model to emulate standard approval operations and systematically screened the definitively denied group to surface high-potential candidates.

Deployment Consideration: - Instead of auto-denying 100% of applicants with a prior default, the system should invoke an API call generating the reconsideration_score. - If the score > 0.60, the application is flagged as “Pending - Manual Review”. - Hazards & Bias: While these individuals show currently strong financial positioning, their prior default remains a material fact. Manual review procedures should involve human verification of why the prior default occurred. Models cannot quantify sudden macroeconomic shocks to prior defaulting individuals; thus humans remain the best mitigation strategy before the firm releases capital.

Loan Approval Optimization: Re-evaluating Automatic Denials

Team 2

2026-02-20