Introduction
This report details an Exploratory Data Analysis (EDA) and perform
feature engineering on the bank loan defaults dataset.
Problem Statements
and Objectives
Based on the dataset, we aim to address two primary practical
questions:
Classification Problem: What are the key
characteristics of a loan applicant that influence their likelihood of
defaulting on a loan?
- Analytical Question: Can we build a model to
predict whether a customer will
Default
(1) or not
Default
(0)?
Regression Problem: What customer and loan
characteristics are associated with the total loan Amount
requested?
- Analytical Question: Can we build a model to
predict the loan
Amount
a customer will request?
Data Description
Data Source and
Collection
The data is from the BankLoanDefaultDataset.csv
file. It
is a dataset containing information on 1,000 loan applicants.
SOURCE: Book - Applied Analytics through Case Studies Using SAS and
R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6
Dataset Overview
The dataset contains 1,000 observations and 16 variables. We will
first load the data and inspect its structure.
# load dataset
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Homework/Data")
loan_data <- read.csv("BankLoanDefaultDataset.csv")
# set dimensions for displaying variable definitions
cat("Dimensions:", dim(loan_data)[1], "rows and", dim(loan_data)[2], "columns\n")
Dimensions: 1000 rows and 16 columns
'data.frame': 1000 obs. of 16 variables:
$ Default : int 0 0 0 1 1 0 0 0 0 1 ...
$ Checking_amount : int 988 458 158 300 63 1071 -192 172 585 189 ...
$ Term : int 15 15 14 25 24 20 13 16 20 19 ...
$ Credit_score : int 796 813 756 737 662 828 856 763 778 649 ...
$ Gender : chr "Female" "Female" "Female" "Female" ...
$ Marital_status : chr "Single" "Single" "Single" "Single" ...
$ Car_loan : int 1 1 0 0 0 1 1 1 1 1 ...
$ Personal_loan : int 0 0 1 0 0 0 0 0 0 0 ...
$ Home_loan : int 0 0 0 0 0 0 0 0 0 0 ...
$ Education_loan : int 0 0 0 1 1 0 0 0 0 0 ...
$ Emp_status : chr "employed" "employed" "employed" "employed" ...
$ Amount : int 1536 947 1678 1804 1184 475 626 1224 1162 786 ...
$ Saving_amount : int 3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
$ Emp_duration : int 12 25 43 0 4 12 11 12 12 0 ...
$ Age : int 38 36 34 29 30 32 38 36 36 29 ...
$ No_of_credit_acc: int 1 1 1 1 1 2 1 1 1 1 ...
Variables
The dataset includes a target variable, Default
, and 15
predictor variables.
Default |
Target Variable: 1 for default, 0 for non-default |
Checking_amount |
Amount in the checking account |
Term |
Loan term in months |
Credit_score |
Applicant’s credit score |
Gender |
Applicant’s gender (Male/Female) |
Marital_status |
Applicant’s marital status (Single/Married) |
Car_loan |
1 if applicant has a car loan, 0 otherwise |
Personal_loan |
1 if applicant has a personal loan, 0 otherwise |
Home_loan |
1 if applicant has a home loan, 0 otherwise |
Education_loan |
1 if applicant has an education loan, 0 otherwise |
Emp_status |
Employment status (Employed/Unemployed) |
Amount |
The loan amount requested |
Saving_amount |
Amount in the savings account |
Emp_duration |
Employment duration in months |
Age |
Applicant’s age in years |
No_of_credit_acc |
Number of credit accounts |
Exploratory Data
Analysis and Feature Engineering
We will now perform EDA to clean the data, show variable
relationships and prepare for modeling.
Data Cleaning
Missing Values and
Duplicates
First we check for any missing values or duplicated rows.
# check for missing values
cat("Total missing values:", sum(is.na(loan_data)), "\n")
Total missing values: 0
# check for duplicated rows
cat("Total duplicated rows:", sum(duplicated(loan_data)), "\n")
Total duplicated rows: 0
The data is clean with no missing values or duplicate entries.
Inspecting Data
Issues
The Checking_amount
column has negative values which
likely represent overdrafts. To simplify, we set negative values to
0.
# Show summary to identify the issue
summary(loan_data$Checking_amount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-665.0 164.8 351.5 362.4 553.5 1319.0
# Create a copy to modify
loan_data_cleaned <- loan_data
# Use base R indexing to find negative values and set them to 0
loan_data_cleaned$Checking_amount[loan_data_cleaned$Checking_amount < 0] <- 0
cat("\nSummary of Checking_amount after correction:\n")
Summary of Checking_amount after correction:
summary(loan_data_cleaned$Checking_amount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 164.8 351.5 377.4 553.5 1319.0
Individual Variable
Analysis
We will visualize the distribution of individual variables using
basic plots.
Numerical Variable
Distributions
# Set up a 2x2 plotting area
par(mfrow = c(2, 2))
# Basic histograms
hist(loan_data_cleaned$Credit_score, main = "Distribution of Credit Score", xlab = "Credit Score", col = "lightblue")
hist(loan_data_cleaned$Age, main = "Distribution of Age", xlab = "Age (years)", col = "green")
hist(loan_data_cleaned$Amount, main = "Distribution of Loan Amount", xlab = "Amount", col = "pink")
hist(loan_data_cleaned$Saving_amount, main = "Distribution of Saving Amount", xlab = "Saving Amount", col = "yellow")
# Reset plotting area to default
par(mfrow = c(1, 1))
Observations: Credit_score
and
Age
are approximately normally distributed. Financial
variables like Amount
and Saving_amount
are
skewed to the right.
Categorical
Variable Distributions
# Set up a 2x2 plotting area
par(mfrow = c(2, 2))
# Basic bar plots
barplot(table(loan_data_cleaned$Default), main = "Loan Default Counts", col=c("darkgreen", "darkred"))
barplot(table(loan_data_cleaned$Gender), main = "Gender Distribution", col=c("pink", "lightblue"))
barplot(table(loan_data_cleaned$Marital_status), main = "Marital Status", col=c("orange", "purple"))
barplot(table(loan_data_cleaned$Emp_status), main = "Employment Status", col=c("gray", "brown"))
# Reset plotting area
par(mfrow = c(1, 1))
Observations: There are more non-defaulters (0) than
defaulters (1).
Relationships between
Variables
Now we explore relationships between variables using simple
comparisons.
Relationship with
Default
We use boxplots to compare numerical variables between the default
and non-default applicants.
# Set up a 1x2 plotting area
par(mfrow = c(1, 2))
# basic boxplots
boxplot(Credit_score ~ Default, data = loan_data_cleaned,
main = "Credit Score by Default",
xlab = "Default (0=No, 1=Yes)", ylab = "Credit Score",
col = c("darkgreen", "darkred"))
boxplot(Amount ~ Default, data = loan_data_cleaned,
main = "Loan Amount by Default",
xlab = "Default (0=No, 1=Yes)", ylab = "Loan Amount",
col = c("darkgreen", "darkred"))
# Reset plotting area
par(mfrow = c(1, 1))
Observations: Customers who defaulted tend to have
lower credit scores.
Relationship with
Amount
To explore predictors for the regression problem, we can see how
Amount
relates to categorical variables.
# Set up a 1x2 plotting area
par(mfrow = c(1, 2))
# boxplots comparing Amount against categorical variables
boxplot(Amount ~ Emp_status, data = loan_data_cleaned,
main = "Loan Amount by Employment Status",
xlab = "Employment Status", ylab = "Loan Amount",
col = c("gray", "brown"))
boxplot(Amount ~ Marital_status, data = loan_data_cleaned,
main = "Loan Amount by Marital Status",
xlab = "Marital Status", ylab = "Loan Amount",
col = c("orange", "purple"))
# Reset plotting area
par(mfrow = c(1, 1))
Observations: The requested loan amount does not
appear to differ significantly between these groups, suggesting they may
not be strong predictors on their own for the regression problem.
Creating the Analytical
Dataset
We will now perform feature engineering to create the final dataset
for modeling.
Steps:
- Convert Categorical Variables:
Gender
,
Marital_status
, and Emp_status
will be
converted into factors, then into numerical dummy variables.
- Standardize Numerical Variables: We will scale
numerical predictors to have a mean of 0 and a standard deviation of
1.
# Start with the cleaned data
analytical_data <- loan_data_cleaned
# 1. Convert character columns to factors
analytical_data$Gender <- as.factor(analytical_data$Gender)
analytical_data$Marital_status <- as.factor(analytical_data$Marital_status)
analytical_data$Emp_status <- as.factor(analytical_data$Emp_status)
# 2. Identify numeric predictors to be scaled
vars_to_scale <- c("Checking_amount", "Term", "Credit_score", "Amount",
"Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
scaled_data <- scale(analytical_data[, vars_to_scale])
colnames(scaled_data) <- paste0(vars_to_scale, "_scaled") # Add suffix to scaled columns
# 3. One-hot encode categorical variables using model.matrix
dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = analytical_data)
# 4. Combine everything into the final analytical dataset
binary_vars <- analytical_data[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
analytical_data_final <- cbind(scaled_data, dummy_vars, binary_vars)
# Display the first few rows and structure of the final dataset
cat("First 6 rows of the final analytical dataset:\n")
First 6 rows of the final analytical dataset:
head(as.data.frame(analytical_data_final))
Checking_amount_scaled Term_scaled Credit_score_scaled Amount_scaled
1 2.2191098 -0.8686751 0.45805485 1.0378255
2 0.2930295 -0.8686751 0.67725070 -0.8885616
3 -0.7972046 -1.1772630 -0.05770008 1.5022517
4 -0.2811605 2.2172044 -0.30268368 1.9143481
5 -1.1424454 1.9086164 -1.26972418 -0.1134279
6 2.5207412 0.6742647 0.87065880 -2.4322878
Saving_amount_scaled Emp_duration_scaled Age_scaled No_of_credit_acc_scaled
1 0.8120577 -0.9901873 1.6591037 -0.9355765
2 1.2390938 -0.6459033 1.1704853 -0.9355765
3 -0.2540600 -0.1692024 0.6818669 -0.9355765
4 -2.1506893 -1.3079880 -0.5396790 -0.9355765
5 -0.9196473 -1.2020544 -0.2953698 -0.9355765
6 0.3025595 -0.9901873 0.1932486 -0.3304170
GenderFemale GenderMale Marital_statusSingle Emp_statusunemployed Car_loan
1 1 0 1 0 1
2 1 0 1 0 1
3 1 0 1 0 0
4 1 0 1 0 0
5 1 0 1 1 0
6 0 1 0 0 1
Personal_loan Home_loan Education_loan Default
1 0 0 0 0
2 0 0 0 0
3 1 0 0 0
4 0 0 1 1
5 0 0 1 1
6 0 0 0 0
Wrapping Feature
Engineering Code
To make these steps reusable, we can wrap them in a function.
create_analytical_features <- function(raw_data) {
# Step 1: Clean data
data_cleaned <- raw_data
data_cleaned$Checking_amount[data_cleaned$Checking_amount < 0] <- 0
# Step 2: Convert to factors
data_cleaned$Gender <- as.factor(data_cleaned$Gender)
data_cleaned$Marital_status <- as.factor(data_cleaned$Marital_status)
data_cleaned$Emp_status <- as.factor(data_cleaned$Emp_status)
# Step 3: Scale numerical predictors
vars_to_scale <- c("Checking_amount", "Term", "Credit_score", "Amount",
"Saving_amount", "Emp_duration", "Age", "No_of_credit_acc")
scaled_data <- scale(data_cleaned[, vars_to_scale])
colnames(scaled_data) <- paste0(vars_to_scale, "_scaled")
# Step 4: Create dummy variables
dummy_vars <- model.matrix(~ Gender + Marital_status + Emp_status - 1, data = data_cleaned)
# Step 5: Combine all parts
binary_vars <- data_cleaned[, c("Car_loan", "Personal_loan", "Home_loan", "Education_loan", "Default")]
final_data <- cbind(scaled_data, dummy_vars, binary_vars)
return(as.data.frame(final_data))
}
# Example of using the function
analytical_data_from_function <- create_analytical_features(loan_data)
cat("First 3 rows of data processed by the function:\n")
First 3 rows of data processed by the function:
head(analytical_data_from_function, 3)
Checking_amount_scaled Term_scaled Credit_score_scaled Amount_scaled
1 2.2191098 -0.8686751 0.45805485 1.0378255
2 0.2930295 -0.8686751 0.67725070 -0.8885616
3 -0.7972046 -1.1772630 -0.05770008 1.5022517
Saving_amount_scaled Emp_duration_scaled Age_scaled No_of_credit_acc_scaled
1 0.8120577 -0.9901873 1.6591037 -0.9355765
2 1.2390938 -0.6459033 1.1704853 -0.9355765
3 -0.2540600 -0.1692024 0.6818669 -0.9355765
GenderFemale GenderMale Marital_statusSingle Emp_statusunemployed Car_loan
1 1 0 1 0 1
2 1 0 1 0 1
3 1 0 1 0 0
Personal_loan Home_loan Education_loan Default
1 0 0 0 0
2 0 0 0 0
3 1 0 0 0
Conclusion and Next
Steps
This concludes our analysis of the bank loan defaults dataset. Key
findings include the importance of Credit_score
in
predicting defaults and the nature of the Default
variable.
We performed feature engineering to scale numerical predictors and
encode categorical variables, resulting in a clean dataset ready for
part II. The entire process was wrapped in a function to ensure it can
be reused consistently. The project is now prepared for Part II which
will focus on building and evaluating predictive models.
