Introduction
This report presents Part II of the project focusing on classical
regression modeling. Building upon the exploratory data analysis (EDA)
and feature engineering from Part I, we will now construct and interpret
both linear and logistic regression models. The goal is to find:
identifying factors associated with the borrower’s
Credit Score
and predicting the likelihood of loan
Default
.
Data Preparation
Before modeling, we first load the raw data and process it using the
feature engineering function developed in Part I.
# Load the raw dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Homework/Data")
loan_data <- read.csv("BankLoanDefaultDataset.csv")
# Define the feature engineering function from Part I
create_analytical_features <- function(raw_data) {
# Step 1: Clean data
data_cleaned <- raw_data
data_cleaned$Checking_amount[data_cleaned$Checking_amount < 0] <- 0
# Step 2: Convert to factors
data_cleaned$Gender <- as.factor(data_cleaned$Gender)
data_cleaned$Marital_status <- as.factor(data_cleaned$Marital_status)
data_cleaned$Emp_status <- as.factor(data_cleaned$Emp_status)
# Step 3: Scale numerical predictors
vars_to_scale <- c("Checking_amount", "Term", "Credit_score",
"Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
scaled_data <- scale(data_cleaned[, vars_to_scale])
colnames(scaled_data) <- paste0(vars_to_scale, "_scaled")
# Step 4: Create dummy variables
dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = data_cleaned)
# Step 5: Combine all parts, retaining original Credit_score for transformation
binary_vars <- data_cleaned[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
final_data <- cbind(Amount = data_cleaned$Amount,
Credit_score = data_cleaned$Credit_score, # Keep original Credit_score
scaled_data,
dummy_vars,
binary_vars)
return(as.data.frame(final_data))
}
# Create the final dataset for regression in a single step
regression_data <- create_analytical_features(loan_data)
Linear Regression
Analysis
Statement of the
Question
The primary goal of this linear regression analysis is
association analysis. We aim to answer the question:
What customer and loan characteristics are associated with a
customer’s Credit score
?
Understanding these associations can help identify factors linked to
better financial health.
Model Building and
Selection Process
Initial Model and
Diagnostics
We begin by fitting a full model using the original
Credit_score
as the response. We exclude
Amount
, Default
, and the scaled version of
Credit_score
from the predictors. We then assess the
model’s assumptions using diagnostic plots.
# Build the initial model using the original Credit_score
ini.model <- lm(Credit_score ~ . - Amount - Default - Credit_score_scaled, data = regression_data)
# Assess the assumptions of the initial linear model
par(mfrow = c(2, 2))
plot(ini.model)
Observations:
The diagnostic plots show a slight curve in the Residuals vs
Fitted plot, suggesting potential non-linearity. To address
this, we will use a Box-Cox plot to investigate if a power
transformation of the response variable could improve the model’s
fit.
Model
Comparison
We now have three candidate models to compare:
ini.model
: The full model on the original
Credit_score
.
transform.model
: The full model on
log(Credit_score)
.
final.model
: A simplified version of the transformed
model selected via step()
.
# Build a transformed model using log(Credit_score)
transform.model <- lm(log(Credit_score) ~ . - Amount - Default - Credit_score_scaled, data = regression_data)
# Use backward elimination to find final model from the transformed model
final.model <- step(transform.model, direction = "backward", trace = 0)
# Compare the R-squared of all three candidate models
r.ini.model <- summary(ini.model)$r.squared
r.transfd.model <- summary(transform.model)$r.squared
r.final.model <- summary(final.model)$r.squared
# Create a data frame for comparison and display it as a table
Rsquare <- data.frame(
ini.model = r.ini.model,
transfd.model = r.transfd.model,
final.model = r.final.model
)
kable(Rsquare, caption = "Coefficients of Determination (R-squared) for the Three Candidate Models")
Coefficients of Determination (R-squared) for the Three
Candidate Models
0.1425433 |
0.1491919 |
0.1456441 |
Interpretation of the
Final Model
The comparison table shows that all three models have very similar
R-squared values. The transformed model and the final simplified model
offer a slightly better fit than the initial model. Given that the
final.model
uses less variables while maintaining the
highest R-squared we select it as our final working model.
# Display the summary of the final selected model
final_lm_summary <- summary(final.model)
pander(final_lm_summary$coefficients, caption = "Summary of Final Linear Regression Model Coefficients")
Summary of Final Linear Regression Model Coefficients
(Intercept) |
6.63 |
0.003393 |
1954 |
0 |
Checking_amount_scaled |
0.008597 |
0.003408 |
2.522 |
0.01182 |
Term_scaled |
-0.01069 |
0.003346 |
-3.196 |
0.001438 |
Saving_amount_scaled |
0.00971 |
0.00345 |
2.814 |
0.004984 |
Age_scaled |
0.02693 |
0.003594 |
7.495 |
1.466e-13 |
Education_loan |
-0.0186 |
0.01039 |
-1.791 |
0.07362 |
The model’s Adjusted R-squared value is .1456, which
means that about 14.6% of the variability in the
log of Credit_score
is explained by the
predictors.
Key Findings:
- Checking_amount_scaled: The coefficient is positive
and significant. This indicates that applicants with higher checkings
tend to have a higher credit score.
- Saving_amount_scaled: This is also positively
associated with credit score. Having higher savings is linked to a
higher score.
- Term_scaled: Longer loan durations is negatively
associated with a higher credit score, suggesting longer term loans are
taken by people with lower credit scores.
- Age_scaled As borrower’s age increases, their
credit score increases.
Logistic Regression
Analysis
Statement of the
Question
The purpose of this analysis is primarily
association. We want to answer: What are the
key characteristics of a loan applicant that influence their likelihood
of defaulting?
The dataset contains the Default
variable and numerous
potential predictors.
Model Building
Full and Optimized
Models
We will start with a “full model” that includes all predictors. For
the variable selection process, we will also have a “optimized model”
containing only predictors we believe may be important.
# define the full model
full_logistic_model <- glm(Default ~ . - Amount - Credit_score, data = regression_data, family = "binomial")
# define the reduced model with practically important variables
optimized_logistic_model <- glm(Default ~ Credit_score_scaled + Checking_amount_scaled, data = regression_data, family = "binomial")
# using step() for backward selection to find the final model
# the scope is defined by the optimized and full models
final_logistic_model <- step(full_logistic_model,
scope = list(lower = formula(optimized_logistic_model), upper = formula(full_logistic_model)),
direction = "backward",
trace = 0)
Interpretation of the
Final Model
The step()
function has identified the best model based
on AIC.
# Get the summary and display it using pander for clean formatting
final_summary <- summary(final_logistic_model)
pander(final_summary$coefficients, caption = "Summary of Final Logistic Regression Model Coefficients")
Summary of Final Logistic Regression Model
Coefficients
(Intercept) |
-2.632 |
0.3837 |
-6.86 |
6.895e-12 |
Checking_amount_scaled |
-1.709 |
0.225 |
-7.595 |
3.077e-14 |
Term_scaled |
0.5333 |
0.1677 |
3.18 |
0.001474 |
Credit_score_scaled |
-0.8611 |
0.1614 |
-5.334 |
9.593e-08 |
Saving_amount_scaled |
-1.58 |
0.2073 |
-7.623 |
2.487e-14 |
Age_scaled |
-2.643 |
0.2607 |
-10.14 |
3.725e-24 |
Emp_statusunemployed |
0.6117 |
0.3375 |
1.813 |
0.06991 |
Personal_loan |
-0.8951 |
0.3335 |
-2.684 |
0.007272 |
Home_loan |
-2.79 |
0.7782 |
-3.585 |
0.0003368 |
Education_loan |
1.42 |
0.5542 |
2.563 |
0.01038 |
Key Findings:
The final model highlights several factors significantly associated
with defaulting:
- Credit_score_scaled: The coefficient is
-0.86. This is a strong negative association. A higher
credit score makes a default much less likely.
- Term_scaled: The positive coefficient
(0.53) implies that longer loan terms are associated
with higher log-odds of default.
- Checking_amount_scaled: A larger amount in the
checking account is associated with lower odds of default
(-1.7).
- Age_scaled: Younger borrowers are more likely to
default than older borrowers (-2.6).
Summary and
Discussion
The logistic regression analysis successfully identified key
predictors of loan default. Factors like a low credit
score, a low checking account balance and
a longer loan term are all associated with an increased
likelihood of default.
Conclusion
In this part of the project we successfully built both linear and
logistic regression models. The linear regression revealed that factors
like checking and savings are associated with the borrower’s
Credit_score
. The logistic regression provided strong
evidence that Credit_score
, Checking_amount
,
and Term
are critical predictors for Default
.
These statistical models have provided a foundation for understanding
the relationships within the data.
---
title: 'Project One: Part II - Regression Analysis'
author: 'Jeff Delva'
date: "October 8, 2025"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 8
    fig_height: 5
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
    highlight: tango
---

```{css, echo = FALSE}
h1.title {
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.author, h4.date {
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 {
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 {
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
h3 {
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}


knitr::opts_chunk$set(echo = TRUE,
                      warning = FALSE,
                      results = TRUE,
                      message = FALSE,
                      comment = NA
                      )
```

# Introduction

This report presents Part II of the project focusing on classical regression modeling. Building upon the exploratory data analysis (EDA) and feature engineering from Part I, we will now construct and interpret both linear and logistic regression models. The goal is to find: identifying factors associated with the borrower's `Credit Score` and predicting the likelihood of loan `Default`.

-----

# Data Preparation

Before modeling, we first load the raw data and process it using the feature engineering function developed in Part I. 

```{r prepare-data}
# Load the raw dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Homework/Data")
loan_data <- read.csv("BankLoanDefaultDataset.csv")

# Define the feature engineering function from Part I
create_analytical_features <- function(raw_data) {
  # Step 1: Clean data
  data_cleaned <- raw_data
  data_cleaned$Checking_amount[data_cleaned$Checking_amount < 0] <- 0
  
  # Step 2: Convert to factors
  data_cleaned$Gender <- as.factor(data_cleaned$Gender)
  data_cleaned$Marital_status <- as.factor(data_cleaned$Marital_status)
  data_cleaned$Emp_status <- as.factor(data_cleaned$Emp_status)
  
  # Step 3: Scale numerical predictors
  vars_to_scale <- c("Checking_amount", "Term", "Credit_score",
                     "Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
  scaled_data <- scale(data_cleaned[, vars_to_scale])
  colnames(scaled_data) <- paste0(vars_to_scale, "_scaled")
  
  # Step 4: Create dummy variables
  dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = data_cleaned)
  
  # Step 5: Combine all parts, retaining original Credit_score for transformation
  binary_vars <- data_cleaned[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
  final_data <- cbind(Amount = data_cleaned$Amount, 
                      Credit_score = data_cleaned$Credit_score, # Keep original Credit_score
                      scaled_data, 
                      dummy_vars, 
                      binary_vars)
  
  return(as.data.frame(final_data))
}

# Create the final dataset for regression in a single step
regression_data <- create_analytical_features(loan_data)
```

-----

# Linear Regression Analysis

## Statement of the Question

The primary goal of this linear regression analysis is **association analysis**. We aim to answer the question: **What customer and loan characteristics are associated with a customer's `Credit score`?**

Understanding these associations can help identify factors linked to better financial health.

## Model Building and Selection Process

### Initial Model and Diagnostics

We begin by fitting a full model using the original `Credit_score` as the response. We exclude `Amount`, `Default`, and the scaled version of `Credit_score` from the predictors. We then assess the model's assumptions using diagnostic plots.

```{r initial-lm, fig.cap="Diagnostic Plots for the Initial Linear Model"}
# Build the initial model using the original Credit_score
ini.model <- lm(Credit_score ~ . - Amount - Default - Credit_score_scaled, data = regression_data)

# Assess the assumptions of the initial linear model
par(mfrow = c(2, 2))
plot(ini.model)
par(mfrow = c(1, 1))
```

**Observations:**

The diagnostic plots show a slight curve in the **Residuals vs Fitted** plot, suggesting potential non-linearity. To address this, we will use a Box-Cox plot to investigate if a power transformation of the response variable could improve the model's fit.

### Transformation

```{r box-cox-transformation, fig.cap="Box-Cox Plot for Power Transformation"}
# Use Box-Cox to find a potential transformation for the response variable
boxcox(ini.model, lambda = seq(-2, 2, 0.1))
```

The Box-Cox plot indicates that a lambda value near 0 may be optimal, suggesting a **log transformation** could be beneficial. We will create a new model with `log(Credit_score)` as the response.

### Model Comparison

We now have three candidate models to compare:

1.  `ini.model`: The full model on the original `Credit_score`.
2.  `transform.model`: The full model on `log(Credit_score)`.
3.  `final.model`: A simplified version of the transformed model selected via `step()`.

```{r comparison-credit-score}
# Build a transformed model using log(Credit_score)
transform.model <- lm(log(Credit_score) ~ . - Amount - Default - Credit_score_scaled, data = regression_data)

# Use backward elimination to find final model from the transformed model
final.model <- step(transform.model, direction = "backward", trace = 0)

# Compare the R-squared of all three candidate models
r.ini.model <- summary(ini.model)$r.squared
r.transfd.model <- summary(transform.model)$r.squared
r.final.model <- summary(final.model)$r.squared

# Create a data frame for comparison and display it as a table
Rsquare <- data.frame(
  ini.model = r.ini.model,
  transfd.model = r.transfd.model,
  final.model = r.final.model
)

kable(Rsquare, caption = "Coefficients of Determination (R-squared) for the Three Candidate Models")
```

## Interpretation of the Final Model

The comparison table shows that all three models have very similar R-squared values. The transformed model and the final simplified model offer a slightly better fit than the initial model. Given that the `final.model` uses less variables while maintaining the highest R-squared we select it as our final working model.

```{r final-linear-summary, results='asis'}
# Display the summary of the final selected model
final_lm_summary <- summary(final.model)
pander(final_lm_summary$coefficients, caption = "Summary of Final Linear Regression Model Coefficients")
```

The model's Adjusted R-squared value is **.1456**, which means that about **14.6%** of the variability in the **log of `Credit_score`** is explained by the predictors.

**Key Findings:**

  * **Checking_amount_scaled:** The coefficient is positive and significant. This indicates that applicants with higher checkings tend to have a higher credit score.
  * **Saving_amount_scaled:** This is also positively associated with credit score. Having higher savings is linked to a higher score.
  * **Term_scaled:** Longer loan durations is negatively associated with a higher credit score, suggesting longer term loans are taken by people with lower credit scores.
  * **Age_scaled** As borrower's age increases, their credit score increases.

-----

# Logistic Regression Analysis

## Statement of the Question

The purpose of this analysis is primarily **association**. We want to answer: **What are the key characteristics of a loan applicant that influence their likelihood of defaulting?**

The dataset contains the `Default` variable and numerous potential predictors.

## Model Building

### Full and Optimized Models

We will start with a "full model" that includes all predictors. For the variable selection process, we will also have a "optimized model" containing only predictors we believe may be important.

```{r logistic-model-building}
# define the full model
full_logistic_model <- glm(Default ~ . - Amount - Credit_score, data = regression_data, family = "binomial")

# define the reduced model with practically important variables
optimized_logistic_model <- glm(Default ~ Credit_score_scaled + Checking_amount_scaled, data = regression_data, family = "binomial")

# using step() for backward selection to find the final model
# the scope is defined by the optimized and full models
final_logistic_model <- step(full_logistic_model,
                             scope = list(lower = formula(optimized_logistic_model), upper = formula(full_logistic_model)),
                             direction = "backward",
                             trace = 0)
```

## Interpretation of the Final Model

The `step()` function has identified the best model based on AIC.

```{r final-logistic-summary, results='asis'}
# Get the summary and display it using pander for clean formatting
final_summary <- summary(final_logistic_model)
pander(final_summary$coefficients, caption = "Summary of Final Logistic Regression Model Coefficients")
```

**Key Findings:**

The final model highlights several factors significantly associated with defaulting:

  * **Credit_score_scaled:** The coefficient is **-0.86**. This is a strong negative association. A higher credit score makes a default much less likely.
  * **Term_scaled:** The positive coefficient (**0.53**) implies that longer loan terms are associated with higher log-odds of default.
  * **Checking_amount_scaled:** A larger amount in the checking account is associated with lower odds of default (**-1.7**).
  * **Age_scaled:** Younger borrowers are more likely to default than older borrowers (**-2.6**).

## Summary and Discussion

The logistic regression analysis successfully identified key predictors of loan default. Factors like a **low credit score**, **a low checking account balance** and **a longer loan term** are all associated with an increased likelihood of default.

-----

# Conclusion

In this part of the project we successfully built both linear and logistic regression models. The linear regression revealed that factors like checking and savings are associated with the borrower's `Credit_score`. The logistic regression provided strong evidence that `Credit_score`, `Checking_amount`, and `Term` are critical predictors for `Default`. These statistical models have provided a foundation for understanding the relationships within the data.
