This report presents a comprehensive analysis of the bank loan default dataset. This project consists of three parts:
Credit_score
(Linear Regression) and Default (Logistic Regression).The overall goal is to understand the factors related to loan defaults and credit scores and to build reliable models for prediction.
We begin by loading the raw data and applying the feature engineering functions. This function scales numerical predictors and creates dummy variables for categorical predictors.
# Load the raw dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Homework/Data")
loan_data <- read.csv("BankLoanDefaultDataset.csv")
# Utilizing the feature engineering function from Part II instead of Part I
create_analytical_features <- function(raw_data) {
# Step 1: Clean data
data_cleaned <- raw_data
data_cleaned$Checking_amount[data_cleaned$Checking_amount < 0] <- 0
# Step 2: Convert to factors
data_cleaned$Gender <- as.factor(data_cleaned$Gender)
data_cleaned$Marital_status <- as.factor(data_cleaned$Marital_status)
data_cleaned$Emp_status <- as.factor(data_cleaned$Emp_status)
# Step 3: Scale numerical predictors
vars_to_scale <- c("Checking_amount", "Term", "Credit_score",
"Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
scaled_data <- scale(data_cleaned[, vars_to_scale])
colnames(scaled_data) <- paste0(vars_to_scale, "_scaled")
# Step 4: Create dummy variables
dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = data_cleaned)
# Step 5: Combine all parts, retaining original Credit_score and Amount for response
binary_vars <- data_cleaned[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
final_data <- cbind(Amount = data_cleaned$Amount,
Credit_score = data_cleaned$Credit_score, # Keep original Credit_score
scaled_data,
dummy_vars,
binary_vars)
return(as.data.frame(final_data))
}
# Create the final dataset
regression_data <- create_analytical_features(loan_data)
cat("Dimensions of the final processed dataset:\n")Dimensions of the final processed dataset:
[1] 1000 18
In Part I, we performed an EDA. Key findings included:
Amount and
Saving_amount were right-skewed.Default variable consisted of more non-defaulters
(0) than defaulters (1).# Recreating a plot from Part I
boxplot(Credit_score ~ Default, data = regression_data,
main = "Credit Score by Default",
xlab = "Default (0=No, 1=Yes)", ylab = "Credit Score",
col = c("darkgreen", "darkred"))Credit Score by Default Status
This part of the project focuses on classical regression modeling for
association analysis. We aim to understand the
relationships between predictors and our two response variables:
Credit_score and Default.
The primary goal of this linear regression analysis is
association analysis. We answer the question:
What customer and loan characteristics are associated with a
customer’s Credit score?
Understanding these associations can help identify factors linked to better financial health.
We begin by fitting a full model using the original
Credit_score as the response. We exclude
Amount, Default, and the scaled version of
Credit_score from the predictors. We then assess the
model’s assumptions using diagnostic plots.
# Build the initial model using the original Credit_score
ini.model <- lm(Credit_score ~ . - Amount - Default - Credit_score_scaled, data = regression_data)
# Assess the assumptions of the initial linear model
par(mfrow = c(2, 2))
plot(ini.model)Diagnostic Plots for the Initial Linear Model
Observations:
The diagnostic plots show a slight curve in the Residuals vs Fitted plot, suggesting potential non-linearity. To address this we will use a Box-Cox plot to investigate if a power transformation of the response variable could improve the model’s fit.
# Use Box-Cox to find a potential transformation for the response variable
boxcox(ini.model, lambda = seq(-2, 2, 0.1))Box-Cox Plot for Power Transformation
The Box-Cox plot indicates that a lambda value near 0 may be optimal,
suggesting a log transformation could be beneficial. We
will create a new model with log(Credit_score) as the
response.
We now have three candidate models to compare:
ini.model: The full model on the original
Credit_score.transform.model: The full model on
log(Credit_score).final.model: A simplified version of the transformed
model selected via step().# Build a transformed model using log(Credit_score)
transform.model <- lm(log(Credit_score) ~ . - Amount - Default - Credit_score_scaled, data = regression_data)
# Use backward elimination to find final model from the transformed model
final.model <- step(transform.model, direction = "backward", trace = 0)
# Compare the R-squared of all three candidate models
r.ini.model <- summary(ini.model)$r.squared
r.transfd.model <- summary(transform.model)$r.squared
r.final.model <- summary(final.model)$r.squared
# Create a data frame for comparison and display it as a table
Rsquare <- data.frame(
ini.model = r.ini.model,
transfd.model = r.transfd.model,
final.model = r.final.model
)
kable(Rsquare, caption = "Coefficients of Determination (R-squared) for the Three Candidate Models")| ini.model | transfd.model | final.model |
|---|---|---|
| 0.1425433 | 0.1491919 | 0.1456441 |
The comparison table shows that all three models have very similar
R-squared values. The transformed model and the final simplified model
offer a slightly better fit than the initial model. Given that the
final.model uses less variables while maintaining the
highest R-squared we select it as our final working model.
# Display the summary of the final selected model
final_lm_summary <- summary(final.model)
pander(final_lm_summary$coefficients, caption = "Summary of Final Linear Regression Model Coefficients")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.63 | 0.003393 | 1954 | 0 |
| Checking_amount_scaled | 0.008597 | 0.003408 | 2.522 | 0.01182 |
| Term_scaled | -0.01069 | 0.003346 | -3.196 | 0.001438 |
| Saving_amount_scaled | 0.00971 | 0.00345 | 2.814 | 0.004984 |
| Age_scaled | 0.02693 | 0.003594 | 7.495 | 1.466e-13 |
| Education_loan | -0.0186 | 0.01039 | -1.791 | 0.07362 |
The model’s Adjusted R-squared value is .1456, which
means that about 14.6% of the variability in the
log of Credit_score is explained by the
predictors.
Key Findings:
The purpose of this analysis is primarily association. We want to answer: What are the key characteristics of a loan applicant that influence their likelihood of defaulting?
The dataset contains the Default variable and numerous
potential predictors.
We will start with a “full model” that includes all predictors. For
the variable selection process, we will also have a “optimized model”
containing only predictors we believe may be important. The
step() function will be used for backward selection to find
the final model, using AIC as the selection criterion.
# define the full model
full_logistic_model <- glm(Default ~ . - Amount - Credit_score, data = regression_data, family = "binomial")
# define the reduced model with practically important variables
optimized_logistic_model <- glm(Default ~ Credit_score_scaled + Checking_amount_scaled, data = regression_data, family = "binomial")
# using step() for backward selection to find the final model
# the scope is defined by the optimized and full models
final_logistic_model <- step(full_logistic_model,
scope = list(lower = formula(optimized_logistic_model), upper = formula(full_logistic_model)),
direction = "backward",
trace = 0)The step() function has identified the best model based
on AIC.
# Get the summary and display it using pander for clean formatting
final_summary <- summary(final_logistic_model)
pander(final_summary$coefficients, caption = "Summary of Final Logistic Regression Model Coefficients")| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.632 | 0.3837 | -6.86 | 6.895e-12 |
| Checking_amount_scaled | -1.709 | 0.225 | -7.595 | 3.077e-14 |
| Term_scaled | 0.5333 | 0.1677 | 3.18 | 0.001474 |
| Credit_score_scaled | -0.8611 | 0.1614 | -5.334 | 9.593e-08 |
| Saving_amount_scaled | -1.58 | 0.2073 | -7.623 | 2.487e-14 |
| Age_scaled | -2.643 | 0.2607 | -10.14 | 3.725e-24 |
| Emp_statusunemployed | 0.6117 | 0.3375 | 1.813 | 0.06991 |
| Personal_loan | -0.8951 | 0.3335 | -2.684 | 0.007272 |
| Home_loan | -2.79 | 0.7782 | -3.585 | 0.0003368 |
| Education_loan | 1.42 | 0.5542 | 2.563 | 0.01038 |
Key Findings:
The final model highlights several factors significantly associated with defaulting:
Now we shift from association to prediction. We will use a 75%/25% split for training and testing. We will then perform 5-fold cross-validation on the training data to select the model with the best predictive performance, which will then be evaluated on the 25% test set.
We first shuffle the data and split it into a 75% training set (for CV) and a 25% testing set (for final evaluation).
set.seed(123) # for reproducibility
# Get total sample size
n <- nrow(regression_data)
obs.ID <- 1:n
# Shuffle IDs
shuffled.id <- sample(obs.ID, n, replace = FALSE) # randomizing the observation IDs
shuffled_data <- regression_data[shuffled.id, ] # randomizing the data set
# Define training set size (75%)
n.train <- round(0.75 * n)
# Create training and testing sets
train.data <- shuffled_data[1:n.train, ]
test.data <- shuffled_data[(n.train + 1):n, ]
cat("Training data dimensions:", dim(train.data), "\n")Training data dimensions: 750 18
Testing data dimensions: 250 18
Credit_score)We will use 5-fold cross-validation to compare two candidate models
based on their Mean Square Error (MSE). We must
calculate MSE on the original Credit_score scale,
so we will use exp() on our predictions from the
log(Credit_score) models.
transform.model).final.model).# Set number of folds
k <- 5
# Define size of 5-fold splitting, -1 to guarantee 5*n.fold <= n.train
n.fold <- round(n.train / k) - 1
# Initialize vectors to store MSE for each fold
MSE.M1 <- rep(0, k)
MSE.M2 <- rep(0, k)
# Define the model formulas from Part II
formula.M1.linear <- log(Credit_score) ~ . - Amount - Default - Credit_score_scaled
formula.M2.linear <- summary(final.model)$call$formula # Use formula from step() in Part II
for (i in 1:k) {
# Get validation fold IDs
valid.id <- ((i - 1) * n.fold + 1):(i * n.fold)
# Define cross-training and cross-validation sets
cross.train <- train.data[-valid.id, ]
cross.valid <- train.data[valid.id,]
# Fit candidate models
lm1 <- lm(formula.M1.linear, data = cross.train)
lm2 <- lm(formula.M2.linear, data = cross.train)
# Predict on the validation fold
predM1_log <- predict(lm1, newdata = cross.valid)
predM2_log <- predict(lm2, newdata = cross.valid)
# Convert predictions back to original scale (exp) and get true values
predM1_orig <- exp(predM1_log)
predM2_orig <- exp(predM2_log)
true_values <- cross.valid$Credit_score
# Calculate MSE on the original scale
MSE.M1[i] <- mean((predM1_orig - true_values)^2)
MSE.M2[i] <- mean((predM2_orig - true_values)^2)
cat(paste("Fold", i, "- M1 MSE:", round(MSE.M1[i]), "| M2 MSE:", round(MSE.M2[i]), "\n"))
}Fold 1 - M1 MSE: 5850 | M2 MSE: 5492
Fold 2 - M1 MSE: 5257 | M2 MSE: 5046
Fold 3 - M1 MSE: 7143 | M2 MSE: 6991
Fold 4 - M1 MSE: 4538 | M2 MSE: 4488
Fold 5 - M1 MSE: 4947 | M2 MSE: 4850
# Calculate average MSE across all folds
avg.MSE.M1 <- mean(MSE.M1)
avg.MSE.M2 <- mean(MSE.M2)
cat("M1 (Full Model) Average MSE:", round(avg.MSE.M1), "\n")M1 (Full Model) Average MSE: 5547
M2 (Step Model) Average MSE: 5373
# Plot the MSE results
plot(1:k, MSE.M1, type = "l", col = "navy", lwd = 2,
xlab = "Iterations 1 to 5", ylab = "MSE",
ylim = c(min(c(MSE.M1, MSE.M2)) - 5000, max(c(MSE.M1, MSE.M2)) + 5000),
xlim = c(0, 6),
main = "Performance Evaluation via 5-fold Cross Validation",
cex.main = 0.8)
points(1:k, MSE.M1, pch = 19, col = "navy")
lines(1:k, MSE.M2, type = "l", col = "darkred", lwd = 2)
points(1:k, MSE.M2, pch = 19, col = "darkred")
legend("topright", c(paste("M1 Avg. MSE:", round(avg.MSE.M1)),
paste("M2 Avg. MSE:", round(avg.MSE.M2))),
col = c("navy", "darkred"), lwd = 2, pch = 19, bty = "n", cex = 0.8)Linear Model 5-Fold CV Performance (MSE)
CV Result: The model with the lower average MSE is selected as the best predictive model. Based on the output we choose M2 (Step-Selected Model).
Default)We repeat the 5-fold CV process to compare two logistic models. The primary metric will be the Area Under the Curve (AUC) from the ROC analysis.
full_logistic_model).final_logistic_model).# Initialize vectors to store AUC for each fold
AUC.M1 <- rep(0, k)
AUC.M2 <- rep(0, k)
# Use the same folds as the linear CV
# k, n.fold are already defined
# Define the model formulas from Part II
formula.M1.logistic <- Default ~ . - Amount - Credit_score
formula.M2.logistic <- summary(final_logistic_model)$call$formula # Use formula from step()
for (i in 1:k) {
# Get validation fold IDs
valid.id <- ((i - 1) * n.fold + 1):(i * n.fold)
# Define cross-training and cross-validation sets
cross.train <- train.data[-valid.id, ]
cross.valid <- train.data[valid.id,]
# Fit candidate models
glm1 <- glm(formula.M1.logistic, data = cross.train, family = "binomial")
glm2 <- glm(formula.M2.logistic, data = cross.train, family = "binomial")
# Predict probabilities on the validation fold
predM1_prob <- predict(glm1, newdata = cross.valid, type = "response")
predM2_prob <- predict(glm2, newdata = cross.valid, type = "response")
# Calculate AUC
# Using quiet = TRUE to prevent printing messages during the loop
AUC.M1[i] <- auc(roc(cross.valid$Default, predM1_prob, quiet = TRUE))
AUC.M2[i] <- auc(roc(cross.valid$Default, predM2_prob, quiet = TRUE))
cat(paste("Fold", i, "- M1 AUC:", round(AUC.M1[i], 4), "| M2 AUC:", round(AUC.M2[i], 4), "\n"))
}Fold 1 - M1 AUC: 0.9704 | M2 AUC: 0.9691
Fold 2 - M1 AUC: 0.9783 | M2 AUC: 0.9814
Fold 3 - M1 AUC: 0.986 | M2 AUC: 0.9847
Fold 4 - M1 AUC: 0.9829 | M2 AUC: 0.9839
Fold 5 - M1 AUC: 0.9795 | M2 AUC: 0.9791
# Calculate average AUC across all folds
avg.AUC.M1 <- mean(AUC.M1)
avg.AUC.M2 <- mean(AUC.M2)
cat("M1 (Full Model) Average AUC:", round(avg.AUC.M1, 4), "\n")M1 (Full Model) Average AUC: 0.9794
M2 (Step Model) Average AUC: 0.9796
# Plot the AUC results
plot(1:k, AUC.M1, type = "l", col = "navy", lwd = 2,
xlab = "Iterations 1 to 5", ylab = "AUC",
ylim = c(min(c(AUC.M1, AUC.M2)) - 0.05, max(c(AUC.M1, AUC.M2)) + 0.05),
xlim = c(0, 6),
main = "Performance Evaluation via 5-fold Cross Validation",
cex.main = 0.8)
points(1:k, AUC.M1, pch = 19, col = "navy")
lines(1:k, AUC.M2, type = "l", col = "darkred", lwd = 2)
points(1:k, AUC.M2, pch = 19, col = "darkred")
legend("bottomright", c(paste("M1 Avg. AUC:", round(avg.AUC.M1, 4)),
paste("M2 Avg. AUC:", round(avg.AUC.M2, 4))),
col = c("navy", "darkred"), lwd = 2, pch = 19, bty = "n", cex = 0.8)Logistic Model 5-Fold CV Performance (AUC)
CV Result: The model with the higher average AUC is selected. Based on the output, we choose the M2 (Step-Selected Model) as it performs a bit better than the full model while being simpler.
We will now train our chosen models (M2 for both) on the entire 75% training set and report their final performance on the 25% testing set.
# 1. Re-train the winning model (M2) on the entire training set
final.lm.test <- lm(formula.M2.linear, data = train.data)
# 2. Predict on the 25% test set
pred.lm.test_log <- predict(final.lm.test, newdata = test.data)
# 3. Convert predictions to original scale
pred.lm.test_orig <- exp(pred.lm.test_log)
true_values_test <- test.data$Credit_score
# 4. Calculate and report the final MSE on the test data
final.MSE <- mean((pred.lm.test_orig - true_values_test)^2)
cat("Final MSE on 25% Test Set:", round(final.MSE), "\n")Final MSE on 25% Test Set: 5106
# 1. Re-train the winning model (M2) on the entire training set
final.glm.test <- glm(formula.M2.logistic, data = train.data, family = "binomial")
# 2. Predict probabilities on the 25% test set
pred.glm.test_prob <- predict(final.glm.test, newdata = test.data, type = "response")
# 3. Calculate and report the final AUC and plot the ROC curve
final.roc.obj <- roc(test.data$Default, pred.glm.test_prob)
final.AUC <- auc(final.roc.obj)
cat("Final AUC on 25% Test Set:", round(final.AUC, 4), "\n")Final AUC on 25% Test Set: 0.9888
# Plot the final ROC curve
plot(final.roc.obj, main = "Final Model ROC Curve (Test Data)",
col = "darkblue", lwd = 2, print.auc = TRUE)Final ROC Curve on 25% Test Data
This project has analyzed the bank loan dataset from three perspectives:
Credit_score and Default.log(Credit_score) and Default using classical
regression. The selection of these models was justified by a combination
of checks (like Box-Cox transformations) and statistical criteria
(R-squared and AIC-based backward selection).Credit_score) and logistic (Default)
problems, the simpler, step-selected models (M2) from Part II were
confirmed to be the better choice for prediction.The final models were tested yielding a Final MSE of
5106 for predicting Credit_score and a
Final AUC of 0.9888 for predicting
Default. This demonstrates that our final step-selected
models are not only valid (as shown in Part II) but also are reliable
for prediction (as confirmed in Part III).