INTRODUCTION

In the banking sector, assessing the creditworthiness of loan applicants is crucial to minimize potential losses and ensure financial stability. This process involves evaluating various demographic and socio-economic attributes of prospective borrowers to make informed decisions regarding loan approvals.

In this case study, I explore a dataset from a bank that contains information about individuals who have applied for credit.Each entry in the dataset represents a loan applicant, classified as either a good credit risk or a bad credit risk based on their attributes. The primary goal of this analysis is to develop a predictive model that can assist the bank in determining whether to approve a loan for a prospective applicant.

The predictive model developed as part of this analysis, along with real-time visualization and exploratory analysis of the credit data, can be accessed through the following link: Click here to use the App. This application allows banks to input applicant data, receive predictions on credit risk, and explore various aspects of the dataset interactively.

Data Loading and Overview

credit <- read_csv("german_credit.csv")

## Rows: 1000 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Sex, Job, Housing, Saving accounts, Checking account, Purpose
## dbl (4): Age, Credit amount, Duration, Risk
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(credit)

## # A tibble: 6 × 10
##     Age Sex   Job   Housing `Saving accounts` `Checking account` `Credit amount`
##   <dbl> <chr> <chr> <chr>   <chr>             <chr>                        <dbl>
## 1    67 male  skil… own     little            little                        1169
## 2    22 fema… skil… own     little            moderate                      5951
## 3    49 male  unsk… own     little            little                        2096
## 4    45 male  skil… free    little            little                        7882
## 5    53 male  skil… free    little            little                        4870
## 6    35 male  unsk… free    little            moderate                      9055
## # ℹ 3 more variables: Duration <dbl>, Risk <dbl>, Purpose <chr>

glimpse(credit)

## Rows: 1,000
## Columns: 10
## $ Age                <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22,…
## $ Sex                <chr> "male", "female", "male", "male", "male", "male", "…
## $ Job                <chr> "skilled", "skilled", "unskilled_and_non-resident",…
## $ Housing            <chr> "own", "own", "own", "free", "free", "free", "own",…
## $ `Saving accounts`  <chr> "little", "little", "little", "little", "little", "…
## $ `Checking account` <chr> "little", "moderate", "little", "little", "little",…
## $ `Credit amount`    <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 305…
## $ Duration           <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, …
## $ Risk               <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, …
## $ Purpose            <chr> "radio/TV", "radio/TV", "education", "furniture/equ…

summary(credit)

##       Age            Sex                Job              Housing         
##  Min.   :19.00   Length:1000        Length:1000        Length:1000       
##  1st Qu.:27.00   Class :character   Class :character   Class :character  
##  Median :33.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :35.55                                                           
##  3rd Qu.:42.00                                                           
##  Max.   :75.00                                                           
##  Saving accounts    Checking account   Credit amount      Duration   
##  Length:1000        Length:1000        Min.   :  250   Min.   : 4.0  
##  Class :character   Class :character   1st Qu.: 1366   1st Qu.:12.0  
##  Mode  :character   Mode  :character   Median : 2320   Median :18.0  
##                                        Mean   : 3271   Mean   :20.9  
##                                        3rd Qu.: 3972   3rd Qu.:24.0  
##                                        Max.   :18424   Max.   :72.0  
##       Risk       Purpose         
##  Min.   :0.0   Length:1000       
##  1st Qu.:0.0   Class :character  
##  Median :0.0   Mode  :character  
##  Mean   :0.3                     
##  3rd Qu.:1.0                     
##  Max.   :1.0

# Check for missing values
colSums(is.na(credit))

##              Age              Sex              Job          Housing 
##                0                0                0                0 
##  Saving accounts Checking account    Credit amount         Duration 
##                0                0                0                0 
##             Risk          Purpose 
##                0                0

Data Cleaning and Feature Engineering

# Rename columns
credit <- credit %>%
  rename(
    Saving_accounts = `Saving accounts`,
    Checking_account = `Checking account`,
    Credit_amount = `Credit amount`
  )

# Create a New age group column
credit <- credit %>%
  mutate(AgeGroup = cut(Age, breaks = c(18, 30, 50, 100), labels = c("Young", "Middle-aged", "Senior")))

# Save and reload the cleaned dataset
write_csv(credit, "real_credit.csv")
credit <- read_csv("real_credit.csv")

## Rows: 1000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Sex, Job, Housing, Saving_accounts, Checking_account, Purpose, AgeG...
## dbl (4): Age, Credit_amount, Duration, Risk
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Visualization

# Create histograms for numerical variables
credit %>%
  select_if(is.numeric) %>%
  select(-Risk) %>% 
  gather(key = "Variable", value = "Value") %>%
  group_by(Variable) %>%
  mutate(mean_value = mean(Value, na.rm = TRUE)) %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  geom_vline(aes(xintercept = mean_value), color = "red", linetype = "dashed") +
  facet_wrap(~Variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numerical Variables", x = "Value", y = "Count")

Age: The distribution is right-skewed, meaning there are more younger individuals in the data. The majority of individuals are between 20 and 40 years old, with a smaller number of individuals in the older age groups.
Credit_amount: The distribution is also right-skewed, indicating that most individuals are borrowing smaller amounts. There is a long tail to the right, suggesting that a few individuals are borrowing significantly larger amounts.
Duration: The distribution is relatively uniform, with no clear skew. This suggests that the loan durations are evenly distributed across the different periods.

# Create a bar plot for the 'Risk' variable
ggplot(credit, aes(x = factor(Risk), fill = factor(Risk))) +
  geom_bar() +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal() +
  labs(title = "Distribution of Risk", x = "Risk", y = "Count")

The majority of individuals in the dataset are not at risk of defaulting on their loans (Risk = 0). A smaller number of individuals are classified as being at risk of defaulting (Risk = 1). This suggests that the bank has a relatively low rate of loan defaults.

numeric_columns <- credit %>% select_if(is.numeric)
cor_matrix <- cor(numeric_columns, use = "complete.obs")

corrplot(cor_matrix, method = "circle", type = "lower", tl.cex=0.8)

There is a weak negative correlation between age and credit amount. This suggests that older individuals tend to borrow slightly less than younger individuals, but the relationship is not very strong.
There is a moderate negative correlation between age and duration. This suggests that older individuals tend to take out loans for shorter durations than younger individuals.
There is a strong positive correlation between credit amount and duration. This suggests that the bank may be lending larger amounts to individuals willing to commit to longer loan terms.
The weak or no correlations between Risk and the other variables suggest that factors other than age, credit amount, and duration may be more important determinants of credit risk.

#Create a function to visualize numerical variables by risk
boxplot_function <- function(credit, x) {
  ggplot(credit, aes_string(x = "factor(Risk)", y = x, fill = "factor(Risk)")) +
    geom_boxplot() +
    scale_fill_brewer(palette = "Set2") +  
    theme_minimal() +
    labs(x = "Risk", y = x) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

boxplot_function(credit, "Duration")

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The median loan duration for at-risk applicants is significantly higher than for non-risk applicants. There are a few outliers in low-risk applicants, indicating some low-risk applicants have much longer loan durations, but they are rare.

High Risk : There is no significant presence of outliers.

Longer loan durations are associated with a higher risk of default. Applicants who take loans for extended periods tend to be more likely to default. Shorter durations are often seen among low-risk individuals, indicating a potential strategy for the bank might be to limit loan durations to reduce default risks.

boxplot_function(credit, "Age")

The median age for Risk 0 is slightly higher (~40 years) than for Risk 1 (~35 years). Individuals in the high-risk group tend to be slightly younger than those in the low-risk group.

For both risk categories, there are outliers, particularly older individuals who are far above the typical age range.

boxplot_function(credit, "Credit_amount")

This boxplot compares the credit amounts for people classified as “at risk” and “not at risk.”

At-risk applicants tend to request higher credit amounts compared to non-risk applicants. The median credit amount is higher for those at risk, with a larger range of credit values. There are also many outliers, particularly for the at-risk group, showing that some applicants take on significantly higher credit amounts.

#Function for categorical variables
barplot_function <- function(credit, x) {
  x_sym <- sym(x)
  ggplot(credit, aes(x = !!x_sym, fill = factor(Risk))) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set2") +
    theme_minimal() +
    labs(x = x, y = "Count") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

barplot_function(credit, "Sex")

Male applicants are more likely to be classified as both “not at risk” and “at risk” compared to female applicants. However, the proportion of males classified as “not at risk” is significantly higher than females. This suggests that male applicants make up a larger portion of the loan applications, but females appear to have a slightly lower risk proportion.

barplot_function(credit, "AgeGroup")

The bar plot shows the distribution of individuals based on their age group and risk status.

The proportion of individuals who are at risk (Risk = 1) is higher among young and senior individuals compared to middle-aged individuals. Young and senior individuals may be more vulnerable to financial shocks and may have a higher risk of default compared to middle-aged individuals.

Middle-aged individuals and risk: Middle-aged individuals may have established financial stability and may be less likely to face financial difficulties, leading to a lower risk of default.

barplot_function(credit, "Job")

The bar plot shows the distribution of credit loan applicants based on their job category and risk status

Individuals with higher levels of education and skills (highly skilled and skilled) may have a lower risk of defaulting on loans compared to those with lower levels of education or those who are not residents. The Unskilled and non-resident applicants may be more vulnerable to financial shocks and may have a higher risk of defaulting on loans.

barplot_function(credit, "Housing")

This bar plot compares the count of individuals in each housing category (free, own, rent) based on their risk status.

Individuals who own their homes appear to have a lower risk of defaulting on loans compared to those who rent or live rent-free. This might suggest that home ownership provides financial stability and reduces the likelihood of financial difficulties. Individuals who live rent-free have a higher proportion of being at risk. This could be due to various factors, such as lower income levels, unstable living situations, or other financial challenges associated with not owning a home.

barplot_function(credit, "Saving_accounts")

The bar plot shows the distribution of individuals based on their savings account balance (little, moderate, quite rich, rich).

Individuals with higher savings balances appear to have a lower risk of defaulting on loans compared to those with lower or moderate savings balances. This might suggest that having savings can provide financial stability and reduce the likelihood of financial difficulties. Low savings and risk: Individuals with little or moderate savings may be more vulnerable to financial shocks and may be at a higher risk of defaulting on loans.

barplot_function(credit, "Checking_account")

The bar plot shows the distribution of individuals based on their checking account balance (little, moderate, rich) and risk status.

Individuals with moderate or rich balances.appear to have a lower risk of defaulting on loans compared to those with lower checking account balances. This might suggest that having a sufficient balance in a checking account can provide financial stability and reduce the likelihood of financial difficulties.

barplot_function(credit, "Purpose")

The bar plot shows the distribution of individuals based on the purpose of their loans and Risk status.

The proportion of individuals who are at risk varies across different loan purposes. loans for business and domestic appliances have a higher proportion of individuals at risk, while loans for education and repairs have a lower proportion.

The risk of defaulting on a loan may be influenced by the purpose of the loan. Loans for business and domestic appliances, which may involve higher financial risks and uncertainties, might be associated with a higher risk of default. Conversely, loans for education and repairs, which may be more personal and less risky, might have a lower risk of default.

# Function to plot a barplot for Distribution of Categorical Variables
plot_bar <- function(credit, column_name) {
  # Convert character columns to factors for proper plotting
  if (is.character(credit[[column_name]])) {
    credit[[column_name]] <- as.factor(credit[[column_name]])
  }
  # Create the barplot 
  ggplot(credit, aes_string(x = column_name)) +
    geom_bar(aes_string(fill = column_name)) +
    labs(
      title = paste("Distribution of", column_name),
      x = column_name,
      y = "Count"
    ) +
    scale_fill_hue(h = c(0, 360)) + 
    scale_y_continuous(breaks = seq(0, max(table(credit[[column_name]])), by = 20)) +  #
    theme_minimal() +
    geom_text(
      stat = "count",
      aes(label = paste0(round(..count.. / sum(..count..) * 100, 1), "%")),
      vjust = -0.5
    ) +
    theme(legend.title = element_blank())
}

plot_bar(credit, "Housing")

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The majority of credit loan Applicants own their own homes (71.3%), followed by those who rent (17.9%), and then those who live rent-free (10.8%).

plot_bar(credit, "AgeGroup")

Middle-aged individuals are the most common age group among credit loan applicants. This may suggest that this age group has a higher propensity to seek credit loans.

plot_bar(credit, "Sex")

Male applicants outnumber female applicants in the dataset. This may suggest that there are more male individuals seeking credit loans.

plot_bar(credit, "Job")

Highly skilled applicants are the most common job category among credit loan applicants. This may suggest that individuals with higher levels of education and skills are more likely to seek credit loans. While there is a significant number of unskilled and non-resident applicants, they are still outnumbered by highly skilled applicants. This may indicate that individuals with lower levels of education or those who are not residents may have more difficulty accessing credit loans.

plot_bar(credit, "Saving_accounts")

A large proportion of credit loan applicants have little savings. This may suggest that financial challenges and limited savings are common among individuals seeking credit loans. While a smaller percentage of applicants have moderate, quite rich, or rich savings balances, these individuals may be in a stronger financial position.

plot_bar(credit, "Checking_account")

A higher number of credit loan applicants have limited funds in their checking accounts. This may suggest that financial challenges and limited liquidity are common among individuals seeking credit loans. While a smaller percentage of applicants have rich checking account balances, these individuals may be in a stronger financial position.

plot_bar(credit, "Purpose")

Business loans and loans for personal purchases (car, domestic appliances) are the most popular among credit loan applicants. This suggests that these types of loans are in high demand. A variety of other loan purposes are less frequent, including education, furniture/equipment, radio/TV, repairs, and vacation/others.

MODEL BUILDING

# Remove 'Age' column from the dataset and use AgeGroup for modeling
credit <- credit %>% select(-Age)

# Convert categorical data to factors for modeling
credit$AgeGroup <- as.factor(credit$AgeGroup)
credit$Sex <- as.factor(credit$Sex)
credit$Job <- as.factor(credit$Job)
credit$Housing <- as.factor(credit$Housing)
credit$Saving_accounts <- as.factor(credit$Saving_accounts)
credit$Checking_account <- as.factor(credit$Checking_account)
credit$Purpose <- as.factor(credit$Purpose)
credit$Risk <- as.factor(credit$Risk)

#Split data in to training and testing sets
set.seed(419)
trainIndex <- createDataPartition(credit$Risk, p= 0.7, list = FALSE)
traindata <- credit[trainIndex,]
testdata <-  credit [-trainIndex,]

#Check Class dstribution in Training and Test data
print("Class Distribution in Training Data:")

## [1] "Class Distribution in Training Data:"

print(table(traindata$Risk))

## 
##   0   1 
## 490 210

print("Class Distribution in Test Data:")

## [1] "Class Distribution in Test Data:"

print(table(testdata$Risk))

## 
##   0   1 
## 210  90

# Original class distribution
prop.table(table(credit$Risk))

## 
##   0   1 
## 0.7 0.3

prop.table(table(testdata$Risk))

## 
##   0   1 
## 0.7 0.3

prop.table(table(traindata$Risk))

## 
##   0   1 
## 0.7 0.3

Both the training and test sets have a consistent class distribution, with 70% of the data representing good credit risks and 30% representing bad credit risks. This is to check if the data splitting maintained the original proportions of the classes.

# Balance the data using 
balance_traindata <- ovun.sample(Risk ~ ., data = traindata, method = "both", p = 0.5, seed = 123)$data

Balance the training data by equalizing the number of instances in each class (low-risk and high-risk) for better model performance.

Logistic Regression

# Train the GLM model with logistic regression
#gmodel <- glm(Risk ~ ., data = balance_traindata, family = binomial())

 #Save to reload the model later and use it for predictions or further analysis without having to  retrain it
#saveRDS(gmodel, file = "credit_risk_model_glm.rds")

# Splitting the dataset into features (X) and target (y)
gX_train <- traindata[, setdiff(names(traindata), "Risk")]
gy_train <- traindata$Risk

gX_test <- testdata[, setdiff(names(testdata), "Risk")]
gy_test <- testdata$Risk

# Load the logistic model
gmodel <- readRDS("credit_risk_model_glm.rds")

# Function to evaluate the model
evaluate_model <- function(model, X, y) {
  # Predict probabilities on the provided dataset
  logistic_predict <- predict(model, X, type = "response")
  
  # Convert probabilities to binary predictions using 0.5 threshold
  gprediction <- ifelse(logistic_predict > 0.5, 1, 0)
  
  # Generate the confusion matrix
  confusion_matrix <- confusionMatrix(factor(gprediction, levels = c(0, 1)), factor(y, levels = c(0, 1)))
  
  # Extract metrics
  metrics <- data.frame(
    Accuracy = confusion_matrix$overall['Accuracy'],
    Recall = confusion_matrix$byClass['Sensitivity'],
    Precision = confusion_matrix$byClass['Pos Pred Value'],
    F1 = confusion_matrix$byClass['F1']
  )
  
  # Return both metrics and confusion matrix
  return(list(
    Metrics = metrics,
    ConfusionMatrix = confusion_matrix$table  
  ))
}

# Evaluate the model on the test dataset
evaluation <- evaluate_model(gmodel, gX_test, gy_test)

# Convert confusion matrix to a data frame for ggplot
cm_table <- as.data.frame(evaluation$ConfusionMatrix)

# Plot the confusion matrix
ggplot(cm_table, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), color = "white") +
  geom_text(aes(label = Freq), vjust = 1) +
  scale_fill_gradient(low = "lightblue", high = "blue") +
  theme_minimal() +
  labs(title = "Confusion Matrix", x = "True Class", y = "Predicted Class")

# Evaluate the model on the training dataset
train_results <- evaluate_model(gmodel, gX_train, gy_train)
cat("Training Performance:\n")

## Training Performance:

print(train_results$Metrics)

##          Accuracy    Recall Precision        F1
## Accuracy     0.65 0.6897959 0.7842227 0.7339848

# Evaluate the model on the test dataset
test_results <- evaluate_model(gmodel, gX_test, gy_test)
cat("Testing Performance:\n")

## Testing Performance:

print(test_results$Metrics)

##          Accuracy    Recall Precision        F1
## Accuracy     0.62 0.6571429 0.7666667 0.7076923

Decision Tree

#Train the decision tree model
#model <- rpart(Risk ~ ., data = balance_traindata, method = "class", control = rpart.control(cp = 0.01))

#saveRDS(gmodel, file = "credit_risk_model.rds")

model <- readRDS("credit_risk_model.rds")
predict_credit <- predict(model, testdata, type = "class")

confMatrix <- confusionMatrix(predict_credit, testdata$Risk)
# Convert the confusion matrix table to a data frame for ggplot
cm_table <- as.data.frame(confMatrix$table)

# Plot the confusion matrix
ggplot(cm_table, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), color = "white") +
  geom_text(aes(label = Freq), vjust = 1) +
  scale_fill_gradient(low = "lightblue", high = "blue") +
  theme_minimal() +
  labs(title = "Confusion Matrix", x = "True Class", y = "Predicted Class")

The model correctly identified 143 low-risk customers who were not likely to default.
These are 67 customers wrongly labeled as high risk even though they were actually low risk.
These 39 customers were mistakenly identified as low risk when in reality, they were high risk.
The model correctly identified 51 high-risk customers who are likely to default.

# Function to evaluate the model
evaluate_model <- function(model, X, y) {
  predictions <- predict(model, X, type = "class")
  confusionmatrix <- confusionMatrix(predictions, y)
  # Extract metrics
  metrics <- data.frame(
    Accuracy = confusionmatrix$overall['Accuracy'],
    Recall = confusionmatrix$byClass['Sensitivity'],
    Precision = confusionmatrix$byClass['Pos Pred Value'],
    F1 = confusionmatrix$byClass['F1']
  )
  
  return(metrics)
}

# Extract features and target variable from the training and test datasets
set.seed(419)
X_train <- traindata[, setdiff(names(traindata), "Risk")]
y_train <- traindata$Risk
X_test <- testdata[, setdiff(names(testdata), "Risk")]
y_test <- testdata$Risk

# Evaluate model on training set
train_metrics <- evaluate_model(model, X_train, y_train)
cat("Training Performance:\n")

## Training Performance:

print(train_metrics)

##           Accuracy    Recall Precision        F1
## Accuracy 0.6828571 0.7142857 0.8101852 0.7592191

# Evaluate model on test set
test_metrics <- evaluate_model(model, X_test, y_test)
cat("Testing Performance:\n")

## Testing Performance:

print(test_metrics)

##           Accuracy    Recall Precision        F1
## Accuracy 0.6466667 0.6809524 0.7857143 0.7295918

The model is good at identifying low-risk cases and has a moderate ability to detect high-risk cases but misses some of them.

It has a higher F1 scores than the Logistic Regression Model. The F1 score, which balances precision and recall, is especially important in this credit risk scenario.

The Decision Tree model is more reliable for minimizing financial losses while ensuring accurate loan approvals compared to the Logistic Regression model.

# Extract feature importance from the decision tree model
importance <- model$variable.importance
  # Convert to a data frame for ggplot
  importance_df <- data.frame(
    Feature = names(importance),
    Importance = importance
  )
  # Sort the data frame by importance
  importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]
  
  # Plot the feature importance 
  library(ggplot2)
  ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
    geom_bar(stat = "identity", fill = "violet") +
    coord_flip() +
    labs(title = "Feature Importance for Credit Risk Prediction",
         x = "Feature",
         y = "Relative Importance") +
    theme_minimal()

The most influential feature is Credit_amount, followed by Duration, Checking_account, and Saving_accounts. These variables have the highest importance in the model, while demographic features like Sex and AgeGroup contribute less to the prediction of credit risk.

CONCLUSION

In conclusion, this case study highlights the importance of accurately assessing credit risk in the banking sector to minimize potential losses and make informed loan approval decisions.By analyzing demographic and socio-economic attributes of loan applicants, I developed a predictive model that aids in determining whether a prospective applicant is likely to default.

This model, along with interactive data visualizations and exploratory tools, is accessible through a dedicated application. Banks can utilize this Application . to analyze the credit data and obtain real-time predictions on applicant creditworthiness.

Credit Loan Risk Analysis

Yusuf Anigilaje

2024-09-15