1 Introduction

This report details an Exploratory Data Analysis (EDA) and perform feature engineering on the bank loan defaults dataset.

1.1 Problem Statements and Objectives

Based on the dataset, we aim to address two primary practical questions:

  1. Classification Problem: What are the key characteristics of a loan applicant that influence their likelihood of defaulting on a loan?

    • Analytical Question: Can we build a model to predict whether a customer will Default (1) or not Default (0)?
  2. Regression Problem: What customer and loan characteristics are associated with the total loan Amount requested?

    • Analytical Question: Can we build a model to predict the loan Amount a customer will request?

2 Data Description

2.1 Data Source and Collection

The data is from the BankLoanDefaultDataset.csv file. It is a dataset containing information on 1,000 loan applicants.

SOURCE: Book - Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6

2.2 Dataset Overview

The dataset contains 1,000 observations and 16 variables. We will first load the data and inspect its structure.

# load dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Homework/Data")
loan_data <- read.csv("BankLoanDefaultDataset.csv")

# set dimensions for displaying variable definitions
cat("Dimensions:", dim(loan_data)[1], "rows and", dim(loan_data)[2], "columns\n")
Dimensions: 1000 rows and 16 columns
str(loan_data)
'data.frame':   1000 obs. of  16 variables:
 $ Default         : int  0 0 0 1 1 0 0 0 0 1 ...
 $ Checking_amount : int  988 458 158 300 63 1071 -192 172 585 189 ...
 $ Term            : int  15 15 14 25 24 20 13 16 20 19 ...
 $ Credit_score    : int  796 813 756 737 662 828 856 763 778 649 ...
 $ Gender          : chr  "Female" "Female" "Female" "Female" ...
 $ Marital_status  : chr  "Single" "Single" "Single" "Single" ...
 $ Car_loan        : int  1 1 0 0 0 1 1 1 1 1 ...
 $ Personal_loan   : int  0 0 1 0 0 0 0 0 0 0 ...
 $ Home_loan       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Education_loan  : int  0 0 0 1 1 0 0 0 0 0 ...
 $ Emp_status      : chr  "employed" "employed" "employed" "employed" ...
 $ Amount          : int  1536 947 1678 1804 1184 475 626 1224 1162 786 ...
 $ Saving_amount   : int  3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
 $ Emp_duration    : int  12 25 43 0 4 12 11 12 12 0 ...
 $ Age             : int  38 36 34 29 30 32 38 36 36 29 ...
 $ No_of_credit_acc: int  1 1 1 1 1 2 1 1 1 1 ...

2.3 Variables

The dataset includes a target variable, Default, and 15 predictor variables.

Variable Name Description
Default Target Variable: 1 for default, 0 for non-default
Checking_amount Amount in the checking account
Term Loan term in months
Credit_score Applicant’s credit score
Gender Applicant’s gender (Male/Female)
Marital_status Applicant’s marital status (Single/Married)
Car_loan 1 if applicant has a car loan, 0 otherwise
Personal_loan 1 if applicant has a personal loan, 0 otherwise
Home_loan 1 if applicant has a home loan, 0 otherwise
Education_loan 1 if applicant has an education loan, 0 otherwise
Emp_status Employment status (Employed/Unemployed)
Amount The loan amount requested
Saving_amount Amount in the savings account
Emp_duration Employment duration in months
Age Applicant’s age in years
No_of_credit_acc Number of credit accounts

3 Exploratory Data Analysis and Feature Engineering

We will now perform EDA to clean the data, show variable relationships and prepare for modeling.

3.1 Data Cleaning

3.1.1 Missing Values and Duplicates

First we check for any missing values or duplicated rows.

# check for missing values
cat("Total missing values:", sum(is.na(loan_data)), "\n")
Total missing values: 0 
# check for duplicated rows
cat("Total duplicated rows:", sum(duplicated(loan_data)), "\n")
Total duplicated rows: 0 

The data is clean with no missing values or duplicate entries.

3.1.2 Inspecting Data Issues

The Checking_amount column has negative values which likely represent overdrafts. To simplify, we set negative values to 0.

# Show summary to identify the issue
summary(loan_data$Checking_amount)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -665.0   164.8   351.5   362.4   553.5  1319.0 
# Create a copy to modify
loan_data_cleaned <- loan_data
# Use base R indexing to find negative values and set them to 0
loan_data_cleaned$Checking_amount[loan_data_cleaned$Checking_amount < 0] <- 0

cat("\nSummary of Checking_amount after correction:\n")

Summary of Checking_amount after correction:
summary(loan_data_cleaned$Checking_amount)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   164.8   351.5   377.4   553.5  1319.0 

3.2 Individual Variable Analysis

We will visualize the distribution of individual variables using basic plots.

3.2.1 Numerical Variable Distributions

# Set up a 2x2 plotting area
par(mfrow = c(2, 2))

# Basic histograms
hist(loan_data_cleaned$Credit_score, main = "Distribution of Credit Score", xlab = "Credit Score", col = "lightblue")
hist(loan_data_cleaned$Age, main = "Distribution of Age", xlab = "Age (years)", col = "green")
hist(loan_data_cleaned$Amount, main = "Distribution of Loan Amount", xlab = "Amount", col = "pink")
hist(loan_data_cleaned$Saving_amount, main = "Distribution of Saving Amount", xlab = "Saving Amount", col = "yellow")
Distributions of Key Numerical Variables

Distributions of Key Numerical Variables

# Reset plotting area to default
par(mfrow = c(1, 1))

Observations: Credit_score and Age are approximately normally distributed. Financial variables like Amount and Saving_amount are skewed to the right.

3.2.2 Categorical Variable Distributions

# Set up a 2x2 plotting area
par(mfrow = c(2, 2))

# Basic bar plots
barplot(table(loan_data_cleaned$Default), main = "Loan Default Counts", col=c("darkgreen", "darkred"))
barplot(table(loan_data_cleaned$Gender), main = "Gender Distribution", col=c("pink", "lightblue"))
barplot(table(loan_data_cleaned$Marital_status), main = "Marital Status", col=c("orange", "purple"))
barplot(table(loan_data_cleaned$Emp_status), main = "Employment Status", col=c("gray", "brown"))
Distributions of Categorical & Target Variables

Distributions of Categorical & Target Variables

# Reset plotting area
par(mfrow = c(1, 1))

Observations: There are more non-defaulters (0) than defaulters (1).

3.3 Relationships between Variables

Now we explore relationships between variables using simple comparisons.

3.3.1 Relationship with Default

We use boxplots to compare numerical variables between the default and non-default applicants.

# Set up a 1x2 plotting area
par(mfrow = c(1, 2))

# basic boxplots
boxplot(Credit_score ~ Default, data = loan_data_cleaned,
        main = "Credit Score by Default",
        xlab = "Default (0=No, 1=Yes)", ylab = "Credit Score",
        col = c("darkgreen", "darkred"))

boxplot(Amount ~ Default, data = loan_data_cleaned,
        main = "Loan Amount by Default",
        xlab = "Default (0=No, 1=Yes)", ylab = "Loan Amount",
        col = c("darkgreen", "darkred"))
Credit Score and Loan Amount by Default Status

Credit Score and Loan Amount by Default Status

# Reset plotting area
par(mfrow = c(1, 1))

Observations: Customers who defaulted tend to have lower credit scores.

3.3.2 Relationship with Amount

To explore predictors for the regression problem, we can see how Amount relates to categorical variables.

# Set up a 1x2 plotting area
par(mfrow = c(1, 2))

# boxplots comparing Amount against categorical variables
boxplot(Amount ~ Emp_status, data = loan_data_cleaned,
        main = "Loan Amount by Employment Status",
        xlab = "Employment Status", ylab = "Loan Amount",
        col = c("gray", "brown"))

boxplot(Amount ~ Marital_status, data = loan_data_cleaned,
        main = "Loan Amount by Marital Status",
        xlab = "Marital Status", ylab = "Loan Amount",
        col = c("orange", "purple"))
Loan Amount by Employment and Marital Status

Loan Amount by Employment and Marital Status

# Reset plotting area
par(mfrow = c(1, 1))

Observations: The requested loan amount does not appear to differ significantly between these groups, suggesting they may not be strong predictors on their own for the regression problem.


4 Creating the Analytical Dataset

We will now perform feature engineering to create the final dataset for modeling.

Steps:

  1. Convert Categorical Variables: Gender, Marital_status, and Emp_status will be converted into factors, then into numerical dummy variables.
  2. Standardize Numerical Variables: We will scale numerical predictors to have a mean of 0 and a standard deviation of 1.
# Start with the cleaned data
analytical_data <- loan_data_cleaned

# 1. Convert character columns to factors
analytical_data$Gender <- as.factor(analytical_data$Gender)
analytical_data$Marital_status <- as.factor(analytical_data$Marital_status)
analytical_data$Emp_status <- as.factor(analytical_data$Emp_status)

# 2. Identify numeric predictors to be scaled
vars_to_scale <- c("Checking_amount", "Term", "Credit_score", "Amount",
                   "Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
scaled_data <- scale(analytical_data[, vars_to_scale])
colnames(scaled_data) <- paste0(vars_to_scale, "_scaled") # Add suffix to scaled columns

# 3. One-hot encode categorical variables using model.matrix
dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = analytical_data)

# 4. Combine everything into the final analytical dataset
binary_vars <- analytical_data[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
analytical_data_final <- cbind(scaled_data, dummy_vars, binary_vars)

# Display the first few rows and structure of the final dataset
cat("First 6 rows of the final analytical dataset:\n")
First 6 rows of the final analytical dataset:
head(as.data.frame(analytical_data_final))
  Checking_amount_scaled Term_scaled Credit_score_scaled Amount_scaled
1              2.2191098  -0.8686751          0.45805485     1.0378255
2              0.2930295  -0.8686751          0.67725070    -0.8885616
3             -0.7972046  -1.1772630         -0.05770008     1.5022517
4             -0.2811605   2.2172044         -0.30268368     1.9143481
5             -1.1424454   1.9086164         -1.26972418    -0.1134279
6              2.5207412   0.6742647          0.87065880    -2.4322878
  Saving_amount_scaled Emp_duration_scaled Age_scaled No_of_credit_acc_scaled
1            0.8120577          -0.9901873  1.6591037              -0.9355765
2            1.2390938          -0.6459033  1.1704853              -0.9355765
3           -0.2540600          -0.1692024  0.6818669              -0.9355765
4           -2.1506893          -1.3079880 -0.5396790              -0.9355765
5           -0.9196473          -1.2020544 -0.2953698              -0.9355765
6            0.3025595          -0.9901873  0.1932486              -0.3304170
  GenderFemale GenderMale Marital_statusSingle Emp_statusunemployed Car_loan
1            1          0                    1                    0        1
2            1          0                    1                    0        1
3            1          0                    1                    0        0
4            1          0                    1                    0        0
5            1          0                    1                    1        0
6            0          1                    0                    0        1
  Personal_loan Home_loan Education_loan Default
1             0         0              0       0
2             0         0              0       0
3             1         0              0       0
4             0         0              1       1
5             0         0              1       1
6             0         0              0       0

5 Wrapping Feature Engineering Code

To make these steps reusable, we can wrap them in a function.

create_analytical_features <- function(raw_data) {
  # Step 1: Clean data
  data_cleaned <- raw_data
  data_cleaned$Checking_amount[data_cleaned$Checking_amount < 0] <- 0
  
  # Step 2: Convert to factors
  data_cleaned$Gender <- as.factor(data_cleaned$Gender)
  data_cleaned$Marital_status <- as.factor(data_cleaned$Marital_status)
  data_cleaned$Emp_status <- as.factor(data_cleaned$Emp_status)
  
  # Step 3: Scale numerical predictors
  vars_to_scale <- c("Checking_amount", "Term", "Credit_score", "Amount",
                     "Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
  scaled_data <- scale(data_cleaned[, vars_to_scale])
  colnames(scaled_data) <- paste0(vars_to_scale, "_scaled")
  
  # Step 4: Create dummy variables
  dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = data_cleaned)
  
  # Step 5: Combine all parts
  binary_vars <- data_cleaned[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
  final_data <- cbind(scaled_data, dummy_vars, binary_vars)
  
  return(as.data.frame(final_data))
}

# Example of using the function
analytical_data_from_function <- create_analytical_features(loan_data)
cat("First 3 rows of data processed by the function:\n")
First 3 rows of data processed by the function:
head(analytical_data_from_function, 3)
  Checking_amount_scaled Term_scaled Credit_score_scaled Amount_scaled
1              2.2191098  -0.8686751          0.45805485     1.0378255
2              0.2930295  -0.8686751          0.67725070    -0.8885616
3             -0.7972046  -1.1772630         -0.05770008     1.5022517
  Saving_amount_scaled Emp_duration_scaled Age_scaled No_of_credit_acc_scaled
1            0.8120577          -0.9901873  1.6591037              -0.9355765
2            1.2390938          -0.6459033  1.1704853              -0.9355765
3           -0.2540600          -0.1692024  0.6818669              -0.9355765
  GenderFemale GenderMale Marital_statusSingle Emp_statusunemployed Car_loan
1            1          0                    1                    0        1
2            1          0                    1                    0        1
3            1          0                    1                    0        0
  Personal_loan Home_loan Education_loan Default
1             0         0              0       0
2             0         0              0       0
3             1         0              0       0

6 Conclusion and Next Steps

This concludes our analysis of the bank loan defaults dataset. Key findings include the importance of Credit_score in predicting defaults and the nature of the Default variable. We performed feature engineering to scale numerical predictors and encode categorical variables, resulting in a clean dataset ready for part II. The entire process was wrapped in a function to ensure it can be reused consistently. The project is now prepared for Part II which will focus on building and evaluating predictive models.

