Introduction
This report presents Part II of the project focusing on classical
regression modeling. Building upon the exploratory data analysis (EDA)
and feature engineering from Part I, we will now construct and interpret
both linear and logistic regression models. The goal is to find:
identifying factors associated with the borrower’s
Credit Score
and predicting the likelihood of loan
Default
.
Data Preparation
Before modeling, we first load the raw data and process it using the
feature engineering function developed in Part I.
# Load the raw dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Homework/Data")
loan_data <- read.csv("BankLoanDefaultDataset.csv")
# Define the feature engineering function from Part I
create_analytical_features <- function(raw_data) {
# Step 1: Clean data
data_cleaned <- raw_data
data_cleaned$Checking_amount[data_cleaned$Checking_amount < 0] <- 0
# Step 2: Convert to factors
data_cleaned$Gender <- as.factor(data_cleaned$Gender)
data_cleaned$Marital_status <- as.factor(data_cleaned$Marital_status)
data_cleaned$Emp_status <- as.factor(data_cleaned$Emp_status)
# Step 3: Scale numerical predictors
vars_to_scale <- c("Checking_amount", "Term", "Credit_score",
"Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
scaled_data <- scale(data_cleaned[, vars_to_scale])
colnames(scaled_data) <- paste0(vars_to_scale, "_scaled")
# Step 4: Create dummy variables
dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = data_cleaned)
# Step 5: Combine all parts, retaining original Credit_score for transformation
binary_vars <- data_cleaned[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
final_data <- cbind(Amount = data_cleaned$Amount,
Credit_score = data_cleaned$Credit_score, # Keep original Credit_score
scaled_data,
dummy_vars,
binary_vars)
return(as.data.frame(final_data))
}
# Create the final dataset for regression in a single step
regression_data <- create_analytical_features(loan_data)
Linear Regression
Analysis
Statement of the
Question
The primary goal of this linear regression analysis is
association analysis. We aim to answer the question:
What customer and loan characteristics are associated with a
customer’s Credit score
?
Understanding these associations can help identify factors linked to
better financial health.
Model Building and
Selection Process
Initial Model and
Diagnostics
We begin by fitting a full model using the original
Credit_score
as the response. We exclude
Amount
, Default
, and the scaled version of
Credit_score
from the predictors. We then assess the
model’s assumptions using diagnostic plots.
# Build the initial model using the original Credit_score
ini.model <- lm(Credit_score ~ . - Amount - Default - Credit_score_scaled, data = regression_data)
# Assess the assumptions of the initial linear model
par(mfrow = c(2, 2))
plot(ini.model)
Observations:
The diagnostic plots show a slight curve in the Residuals vs
Fitted plot, suggesting potential non-linearity. To address
this, we will use a Box-Cox plot to investigate if a power
transformation of the response variable could improve the model’s
fit.
Model
Comparison
We now have three candidate models to compare:
ini.model
: The full model on the original
Credit_score
.
transform.model
: The full model on
log(Credit_score)
.
final.model
: A simplified version of the transformed
model selected via step()
.
# Build a transformed model using log(Credit_score)
transform.model <- lm(log(Credit_score) ~ . - Amount - Default - Credit_score_scaled, data = regression_data)
# Use backward elimination to find final model from the transformed model
final.model <- step(transform.model, direction = "backward", trace = 0)
# Compare the R-squared of all three candidate models
r.ini.model <- summary(ini.model)$r.squared
r.transfd.model <- summary(transform.model)$r.squared
r.final.model <- summary(final.model)$r.squared
# Create a data frame for comparison and display it as a table
Rsquare <- data.frame(
ini.model = r.ini.model,
transfd.model = r.transfd.model,
final.model = r.final.model
)
kable(Rsquare, caption = "Coefficients of Determination (R-squared) for the Three Candidate Models")
Coefficients of Determination (R-squared) for the Three
Candidate Models
0.1425433 |
0.1491919 |
0.1456441 |
Interpretation of the
Final Model
The comparison table shows that all three models have very similar
R-squared values. The transformed model and the final simplified model
offer a slightly better fit than the initial model. Given that the
final.model
uses less variables while maintaining the
highest R-squared we select it as our final working model.
# Display the summary of the final selected model
final_lm_summary <- summary(final.model)
pander(final_lm_summary$coefficients, caption = "Summary of Final Linear Regression Model Coefficients")
Summary of Final Linear Regression Model Coefficients
(Intercept) |
6.63 |
0.003393 |
1954 |
0 |
Checking_amount_scaled |
0.008597 |
0.003408 |
2.522 |
0.01182 |
Term_scaled |
-0.01069 |
0.003346 |
-3.196 |
0.001438 |
Saving_amount_scaled |
0.00971 |
0.00345 |
2.814 |
0.004984 |
Age_scaled |
0.02693 |
0.003594 |
7.495 |
1.466e-13 |
Education_loan |
-0.0186 |
0.01039 |
-1.791 |
0.07362 |
The model’s Adjusted R-squared value is .1456, which
means that about 14.6% of the variability in the
log of Credit_score
is explained by the
predictors.
Key Findings:
- Checking_amount_scaled: The coefficient is positive
and significant. This indicates that applicants with higher checkings
tend to have a higher credit score.
- Saving_amount_scaled: This is also positively
associated with credit score. Having higher savings is linked to a
higher score.
- Term_scaled: Longer loan durations is negatively
associated with a higher credit score, suggesting longer term loans are
taken by people with lower credit scores.
- Age_scaled As borrower’s age increases, their
credit score increases.
Logistic Regression
Analysis
Statement of the
Question
The purpose of this analysis is primarily
association. We want to answer: What are the
key characteristics of a loan applicant that influence their likelihood
of defaulting?
The dataset contains the Default
variable and numerous
potential predictors.
Model Building
Full and Optimized
Models
We will start with a “full model” that includes all predictors. For
the variable selection process, we will also have a “optimized model”
containing only predictors we believe may be important.
# define the full model
full_logistic_model <- glm(Default ~ . - Amount - Credit_score, data = regression_data, family = "binomial")
# define the reduced model with practically important variables
optimized_logistic_model <- glm(Default ~ Credit_score_scaled + Checking_amount_scaled, data = regression_data, family = "binomial")
# using step() for backward selection to find the final model
# the scope is defined by the optimized and full models
final_logistic_model <- step(full_logistic_model,
scope = list(lower = formula(optimized_logistic_model), upper = formula(full_logistic_model)),
direction = "backward",
trace = 0)
Interpretation of the
Final Model
The step()
function has identified the best model based
on AIC.
# Get the summary and display it using pander for clean formatting
final_summary <- summary(final_logistic_model)
pander(final_summary$coefficients, caption = "Summary of Final Logistic Regression Model Coefficients")
Summary of Final Logistic Regression Model
Coefficients
(Intercept) |
-2.632 |
0.3837 |
-6.86 |
6.895e-12 |
Checking_amount_scaled |
-1.709 |
0.225 |
-7.595 |
3.077e-14 |
Term_scaled |
0.5333 |
0.1677 |
3.18 |
0.001474 |
Credit_score_scaled |
-0.8611 |
0.1614 |
-5.334 |
9.593e-08 |
Saving_amount_scaled |
-1.58 |
0.2073 |
-7.623 |
2.487e-14 |
Age_scaled |
-2.643 |
0.2607 |
-10.14 |
3.725e-24 |
Emp_statusunemployed |
0.6117 |
0.3375 |
1.813 |
0.06991 |
Personal_loan |
-0.8951 |
0.3335 |
-2.684 |
0.007272 |
Home_loan |
-2.79 |
0.7782 |
-3.585 |
0.0003368 |
Education_loan |
1.42 |
0.5542 |
2.563 |
0.01038 |
Key Findings:
The final model highlights several factors significantly associated
with defaulting:
- Credit_score_scaled: The coefficient is
-0.86. This is a strong negative association. A higher
credit score makes a default much less likely.
- Term_scaled: The positive coefficient
(0.53) implies that longer loan terms are associated
with higher log-odds of default.
- Checking_amount_scaled: A larger amount in the
checking account is associated with lower odds of default
(-1.7).
- Age_scaled: Younger borrowers are more likely to
default than older borrowers (-2.6).
Summary and
Discussion
The logistic regression analysis successfully identified key
predictors of loan default. Factors like a low credit
score, a low checking account balance and
a longer loan term are all associated with an increased
likelihood of default.
Conclusion
In this part of the project we successfully built both linear and
logistic regression models. The linear regression revealed that factors
like checking and savings are associated with the borrower’s
Credit_score
. The logistic regression provided strong
evidence that Credit_score
, Checking_amount
,
and Term
are critical predictors for Default
.
These statistical models have provided a foundation for understanding
the relationships within the data.
