1 Introduction

This report presents Part II of the project focusing on classical regression modeling. Building upon the exploratory data analysis (EDA) and feature engineering from Part I, we will now construct and interpret both linear and logistic regression models. The goal is to find: identifying factors associated with the borrower’s Credit Score and predicting the likelihood of loan Default.


2 Data Preparation

Before modeling, we first load the raw data and process it using the feature engineering function developed in Part I.

# Load the raw dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Homework/Data")
loan_data <- read.csv("BankLoanDefaultDataset.csv")

# Define the feature engineering function from Part I
create_analytical_features <- function(raw_data) {
  # Step 1: Clean data
  data_cleaned <- raw_data
  data_cleaned$Checking_amount[data_cleaned$Checking_amount < 0] <- 0
  
  # Step 2: Convert to factors
  data_cleaned$Gender <- as.factor(data_cleaned$Gender)
  data_cleaned$Marital_status <- as.factor(data_cleaned$Marital_status)
  data_cleaned$Emp_status <- as.factor(data_cleaned$Emp_status)
  
  # Step 3: Scale numerical predictors
  vars_to_scale <- c("Checking_amount", "Term", "Credit_score",
                     "Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
  scaled_data <- scale(data_cleaned[, vars_to_scale])
  colnames(scaled_data) <- paste0(vars_to_scale, "_scaled")
  
  # Step 4: Create dummy variables
  dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = data_cleaned)
  
  # Step 5: Combine all parts, retaining original Credit_score for transformation
  binary_vars <- data_cleaned[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
  final_data <- cbind(Amount = data_cleaned$Amount, 
                      Credit_score = data_cleaned$Credit_score, # Keep original Credit_score
                      scaled_data, 
                      dummy_vars, 
                      binary_vars)
  
  return(as.data.frame(final_data))
}

# Create the final dataset for regression in a single step
regression_data <- create_analytical_features(loan_data)

3 Linear Regression Analysis

3.1 Statement of the Question

The primary goal of this linear regression analysis is association analysis. We aim to answer the question: What customer and loan characteristics are associated with a customer’s Credit score?

Understanding these associations can help identify factors linked to better financial health.

3.2 Model Building and Selection Process

3.2.1 Initial Model and Diagnostics

We begin by fitting a full model using the original Credit_score as the response. We exclude Amount, Default, and the scaled version of Credit_score from the predictors. We then assess the model’s assumptions using diagnostic plots.

# Build the initial model using the original Credit_score
ini.model <- lm(Credit_score ~ . - Amount - Default - Credit_score_scaled, data = regression_data)

# Assess the assumptions of the initial linear model
par(mfrow = c(2, 2))
plot(ini.model)
Diagnostic Plots for the Initial Linear Model

Diagnostic Plots for the Initial Linear Model

par(mfrow = c(1, 1))

Observations:

The diagnostic plots show a slight curve in the Residuals vs Fitted plot, suggesting potential non-linearity. To address this, we will use a Box-Cox plot to investigate if a power transformation of the response variable could improve the model’s fit.

3.2.2 Transformation

# Use Box-Cox to find a potential transformation for the response variable
boxcox(ini.model, lambda = seq(-2, 2, 0.1))
Box-Cox Plot for Power Transformation

Box-Cox Plot for Power Transformation

The Box-Cox plot indicates that a lambda value near 0 may be optimal, suggesting a log transformation could be beneficial. We will create a new model with log(Credit_score) as the response.

3.2.3 Model Comparison

We now have three candidate models to compare:

  1. ini.model: The full model on the original Credit_score.
  2. transform.model: The full model on log(Credit_score).
  3. final.model: A simplified version of the transformed model selected via step().
# Build a transformed model using log(Credit_score)
transform.model <- lm(log(Credit_score) ~ . - Amount - Default - Credit_score_scaled, data = regression_data)

# Use backward elimination to find final model from the transformed model
final.model <- step(transform.model, direction = "backward", trace = 0)

# Compare the R-squared of all three candidate models
r.ini.model <- summary(ini.model)$r.squared
r.transfd.model <- summary(transform.model)$r.squared
r.final.model <- summary(final.model)$r.squared

# Create a data frame for comparison and display it as a table
Rsquare <- data.frame(
  ini.model = r.ini.model,
  transfd.model = r.transfd.model,
  final.model = r.final.model
)

kable(Rsquare, caption = "Coefficients of Determination (R-squared) for the Three Candidate Models")
Coefficients of Determination (R-squared) for the Three Candidate Models
ini.model transfd.model final.model
0.1425433 0.1491919 0.1456441

3.3 Interpretation of the Final Model

The comparison table shows that all three models have very similar R-squared values. The transformed model and the final simplified model offer a slightly better fit than the initial model. Given that the final.model uses less variables while maintaining the highest R-squared we select it as our final working model.

# Display the summary of the final selected model
final_lm_summary <- summary(final.model)
pander(final_lm_summary$coefficients, caption = "Summary of Final Linear Regression Model Coefficients")
Summary of Final Linear Regression Model Coefficients
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.63 0.003393 1954 0
Checking_amount_scaled 0.008597 0.003408 2.522 0.01182
Term_scaled -0.01069 0.003346 -3.196 0.001438
Saving_amount_scaled 0.00971 0.00345 2.814 0.004984
Age_scaled 0.02693 0.003594 7.495 1.466e-13
Education_loan -0.0186 0.01039 -1.791 0.07362

The model’s Adjusted R-squared value is .1456, which means that about 14.6% of the variability in the log of Credit_score is explained by the predictors.

Key Findings:

  • Checking_amount_scaled: The coefficient is positive and significant. This indicates that applicants with higher checkings tend to have a higher credit score.
  • Saving_amount_scaled: This is also positively associated with credit score. Having higher savings is linked to a higher score.
  • Term_scaled: Longer loan durations is negatively associated with a higher credit score, suggesting longer term loans are taken by people with lower credit scores.
  • Age_scaled As borrower’s age increases, their credit score increases.

4 Logistic Regression Analysis

4.1 Statement of the Question

The purpose of this analysis is primarily association. We want to answer: What are the key characteristics of a loan applicant that influence their likelihood of defaulting?

The dataset contains the Default variable and numerous potential predictors.

4.2 Model Building

4.2.1 Full and Optimized Models

We will start with a “full model” that includes all predictors. For the variable selection process, we will also have a “optimized model” containing only predictors we believe may be important.

# define the full model
full_logistic_model <- glm(Default ~ . - Amount - Credit_score, data = regression_data, family = "binomial")

# define the reduced model with practically important variables
optimized_logistic_model <- glm(Default ~ Credit_score_scaled + Checking_amount_scaled, data = regression_data, family = "binomial")

# using step() for backward selection to find the final model
# the scope is defined by the optimized and full models
final_logistic_model <- step(full_logistic_model,
                             scope = list(lower = formula(optimized_logistic_model), upper = formula(full_logistic_model)),
                             direction = "backward",
                             trace = 0)

4.3 Interpretation of the Final Model

The step() function has identified the best model based on AIC.

# Get the summary and display it using pander for clean formatting
final_summary <- summary(final_logistic_model)
pander(final_summary$coefficients, caption = "Summary of Final Logistic Regression Model Coefficients")
Summary of Final Logistic Regression Model Coefficients
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.632 0.3837 -6.86 6.895e-12
Checking_amount_scaled -1.709 0.225 -7.595 3.077e-14
Term_scaled 0.5333 0.1677 3.18 0.001474
Credit_score_scaled -0.8611 0.1614 -5.334 9.593e-08
Saving_amount_scaled -1.58 0.2073 -7.623 2.487e-14
Age_scaled -2.643 0.2607 -10.14 3.725e-24
Emp_statusunemployed 0.6117 0.3375 1.813 0.06991
Personal_loan -0.8951 0.3335 -2.684 0.007272
Home_loan -2.79 0.7782 -3.585 0.0003368
Education_loan 1.42 0.5542 2.563 0.01038

Key Findings:

The final model highlights several factors significantly associated with defaulting:

  • Credit_score_scaled: The coefficient is -0.86. This is a strong negative association. A higher credit score makes a default much less likely.
  • Term_scaled: The positive coefficient (0.53) implies that longer loan terms are associated with higher log-odds of default.
  • Checking_amount_scaled: A larger amount in the checking account is associated with lower odds of default (-1.7).
  • Age_scaled: Younger borrowers are more likely to default than older borrowers (-2.6).

4.4 Summary and Discussion

The logistic regression analysis successfully identified key predictors of loan default. Factors like a low credit score, a low checking account balance and a longer loan term are all associated with an increased likelihood of default.


5 Conclusion

In this part of the project we successfully built both linear and logistic regression models. The linear regression revealed that factors like checking and savings are associated with the borrower’s Credit_score. The logistic regression provided strong evidence that Credit_score, Checking_amount, and Term are critical predictors for Default. These statistical models have provided a foundation for understanding the relationships within the data.

---
title: 'Project One: Part II - Regression Analysis'
author: 'Jeff Delva'
date: "October 8, 2025"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 8
    fig_height: 5
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
    highlight: tango
---

```{css, echo = FALSE}
h1.title {
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.author, h4.date {
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 {
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 {
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
h3 {
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}


knitr::opts_chunk$set(echo = TRUE,
                      warning = FALSE,
                      results = TRUE,
                      message = FALSE,
                      comment = NA
                      )
```

# Introduction

This report presents Part II of the project focusing on classical regression modeling. Building upon the exploratory data analysis (EDA) and feature engineering from Part I, we will now construct and interpret both linear and logistic regression models. The goal is to find: identifying factors associated with the borrower's `Credit Score` and predicting the likelihood of loan `Default`.

-----

# Data Preparation

Before modeling, we first load the raw data and process it using the feature engineering function developed in Part I. 

```{r prepare-data}
# Load the raw dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Homework/Data")
loan_data <- read.csv("BankLoanDefaultDataset.csv")

# Define the feature engineering function from Part I
create_analytical_features <- function(raw_data) {
  # Step 1: Clean data
  data_cleaned <- raw_data
  data_cleaned$Checking_amount[data_cleaned$Checking_amount < 0] <- 0
  
  # Step 2: Convert to factors
  data_cleaned$Gender <- as.factor(data_cleaned$Gender)
  data_cleaned$Marital_status <- as.factor(data_cleaned$Marital_status)
  data_cleaned$Emp_status <- as.factor(data_cleaned$Emp_status)
  
  # Step 3: Scale numerical predictors
  vars_to_scale <- c("Checking_amount", "Term", "Credit_score",
                     "Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
  scaled_data <- scale(data_cleaned[, vars_to_scale])
  colnames(scaled_data) <- paste0(vars_to_scale, "_scaled")
  
  # Step 4: Create dummy variables
  dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = data_cleaned)
  
  # Step 5: Combine all parts, retaining original Credit_score for transformation
  binary_vars <- data_cleaned[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
  final_data <- cbind(Amount = data_cleaned$Amount, 
                      Credit_score = data_cleaned$Credit_score, # Keep original Credit_score
                      scaled_data, 
                      dummy_vars, 
                      binary_vars)
  
  return(as.data.frame(final_data))
}

# Create the final dataset for regression in a single step
regression_data <- create_analytical_features(loan_data)
```

-----

# Linear Regression Analysis

## Statement of the Question

The primary goal of this linear regression analysis is **association analysis**. We aim to answer the question: **What customer and loan characteristics are associated with a customer's `Credit score`?**

Understanding these associations can help identify factors linked to better financial health.

## Model Building and Selection Process

### Initial Model and Diagnostics

We begin by fitting a full model using the original `Credit_score` as the response. We exclude `Amount`, `Default`, and the scaled version of `Credit_score` from the predictors. We then assess the model's assumptions using diagnostic plots.

```{r initial-lm, fig.cap="Diagnostic Plots for the Initial Linear Model"}
# Build the initial model using the original Credit_score
ini.model <- lm(Credit_score ~ . - Amount - Default - Credit_score_scaled, data = regression_data)

# Assess the assumptions of the initial linear model
par(mfrow = c(2, 2))
plot(ini.model)
par(mfrow = c(1, 1))
```

**Observations:**

The diagnostic plots show a slight curve in the **Residuals vs Fitted** plot, suggesting potential non-linearity. To address this, we will use a Box-Cox plot to investigate if a power transformation of the response variable could improve the model's fit.

### Transformation

```{r box-cox-transformation, fig.cap="Box-Cox Plot for Power Transformation"}
# Use Box-Cox to find a potential transformation for the response variable
boxcox(ini.model, lambda = seq(-2, 2, 0.1))
```

The Box-Cox plot indicates that a lambda value near 0 may be optimal, suggesting a **log transformation** could be beneficial. We will create a new model with `log(Credit_score)` as the response.

### Model Comparison

We now have three candidate models to compare:

1.  `ini.model`: The full model on the original `Credit_score`.
2.  `transform.model`: The full model on `log(Credit_score)`.
3.  `final.model`: A simplified version of the transformed model selected via `step()`.

```{r comparison-credit-score}
# Build a transformed model using log(Credit_score)
transform.model <- lm(log(Credit_score) ~ . - Amount - Default - Credit_score_scaled, data = regression_data)

# Use backward elimination to find final model from the transformed model
final.model <- step(transform.model, direction = "backward", trace = 0)

# Compare the R-squared of all three candidate models
r.ini.model <- summary(ini.model)$r.squared
r.transfd.model <- summary(transform.model)$r.squared
r.final.model <- summary(final.model)$r.squared

# Create a data frame for comparison and display it as a table
Rsquare <- data.frame(
  ini.model = r.ini.model,
  transfd.model = r.transfd.model,
  final.model = r.final.model
)

kable(Rsquare, caption = "Coefficients of Determination (R-squared) for the Three Candidate Models")
```

## Interpretation of the Final Model

The comparison table shows that all three models have very similar R-squared values. The transformed model and the final simplified model offer a slightly better fit than the initial model. Given that the `final.model` uses less variables while maintaining the highest R-squared we select it as our final working model.

```{r final-linear-summary, results='asis'}
# Display the summary of the final selected model
final_lm_summary <- summary(final.model)
pander(final_lm_summary$coefficients, caption = "Summary of Final Linear Regression Model Coefficients")
```

The model's Adjusted R-squared value is **.1456**, which means that about **14.6%** of the variability in the **log of `Credit_score`** is explained by the predictors.

**Key Findings:**

  * **Checking_amount_scaled:** The coefficient is positive and significant. This indicates that applicants with higher checkings tend to have a higher credit score.
  * **Saving_amount_scaled:** This is also positively associated with credit score. Having higher savings is linked to a higher score.
  * **Term_scaled:** Longer loan durations is negatively associated with a higher credit score, suggesting longer term loans are taken by people with lower credit scores.
  * **Age_scaled** As borrower's age increases, their credit score increases.

-----

# Logistic Regression Analysis

## Statement of the Question

The purpose of this analysis is primarily **association**. We want to answer: **What are the key characteristics of a loan applicant that influence their likelihood of defaulting?**

The dataset contains the `Default` variable and numerous potential predictors.

## Model Building

### Full and Optimized Models

We will start with a "full model" that includes all predictors. For the variable selection process, we will also have a "optimized model" containing only predictors we believe may be important.

```{r logistic-model-building}
# define the full model
full_logistic_model <- glm(Default ~ . - Amount - Credit_score, data = regression_data, family = "binomial")

# define the reduced model with practically important variables
optimized_logistic_model <- glm(Default ~ Credit_score_scaled + Checking_amount_scaled, data = regression_data, family = "binomial")

# using step() for backward selection to find the final model
# the scope is defined by the optimized and full models
final_logistic_model <- step(full_logistic_model,
                             scope = list(lower = formula(optimized_logistic_model), upper = formula(full_logistic_model)),
                             direction = "backward",
                             trace = 0)
```

## Interpretation of the Final Model

The `step()` function has identified the best model based on AIC.

```{r final-logistic-summary, results='asis'}
# Get the summary and display it using pander for clean formatting
final_summary <- summary(final_logistic_model)
pander(final_summary$coefficients, caption = "Summary of Final Logistic Regression Model Coefficients")
```

**Key Findings:**

The final model highlights several factors significantly associated with defaulting:

  * **Credit_score_scaled:** The coefficient is **-0.86**. This is a strong negative association. A higher credit score makes a default much less likely.
  * **Term_scaled:** The positive coefficient (**0.53**) implies that longer loan terms are associated with higher log-odds of default.
  * **Checking_amount_scaled:** A larger amount in the checking account is associated with lower odds of default (**-1.7**).
  * **Age_scaled:** Younger borrowers are more likely to default than older borrowers (**-2.6**).

## Summary and Discussion

The logistic regression analysis successfully identified key predictors of loan default. Factors like a **low credit score**, **a low checking account balance** and **a longer loan term** are all associated with an increased likelihood of default.

-----

# Conclusion

In this part of the project we successfully built both linear and logistic regression models. The linear regression revealed that factors like checking and savings are associated with the borrower's `Credit_score`. The logistic regression provided strong evidence that `Credit_score`, `Checking_amount`, and `Term` are critical predictors for `Default`. These statistical models have provided a foundation for understanding the relationships within the data.
