Definition of Credit Risk

Credit Risk vs Default Risk vs Counterparty Risk

  • Default risk specifically refers to the risk of non-payment or failure to meet contractual obligations by a borrower or issuer of debt. It focuses on the probability of a borrower defaulting on their financial obligations, such as interest payments or principal repayments. Default risk is typically assessed by credit ratings agencies, which assign ratings reflecting the likelihood of default.

  • Counterparty risk pertains to the risk of loss arising from the potential default or failure of a trading partner or counterparty in a financial transaction. It is particularly relevant in derivatives trading, where parties are exposed to the risk of their counterparties not fulfilling their contractual obligations. Counterparty risk extends beyond default risk, as it includes other risks like settlement risk, operational risk, and legal risk associated with the counterparty.2

How to measure credit risk? Let’s look at PD LGD EAD

  • EAD = Exposure At Default represents the amount of exposure or outstanding balance at the time of a borrower’s default. It captures the potential maximum loss if default occurs.

  • PD = Probability of Default represents the likelihood of a borrower or counterparty defaulting on their financial obligations within a specific time horizon.

  • LGD = Loss Given Default represents the expected loss that a lender or investor would incur in the event of a borrower’s default. It is the proportion of the exposure that is not recoverable after default and is typically expressed as a percentage of the exposure amount.

Today’s Stuff: Credit Risk Modelling using Machine Learning with R

Next Class!

If you like this class, you may also like…

Data Case: The German Credit Dataset

# Clear Working Space
rm(list = ls())

# Import Dataset (You can also import the data manually using the "Import Dataset" button on the upper right environment)
data <- read.table("~/Downloads/german.data")

# If you want to manually import the german.csv: 
# write.csv(data, file = "german.csv")
data <- german[,2:22]

# Data Description
# See: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) 

# Define column names
colnames(data) <- c("CHK", "DUR", "CRH", "PUR", "CRA",
                 "SAB", "EMY", "INT", "STA", "GUA",
                 "RES", "PRP", "AGE", "OIP", "HOU", 
                 "CRD", "JOB", "PPL", "TEL", "FOR", "TYP")

# Data Structure 
str(data)

# Attribute Information:
# 
# Attribute 1: (qualitative)
# Status of existing checking account - CHK
# A11 : ... < 0 DM
# A12 : 0 <= ... < 200 DM
# A13 : ... >= 200 DM / salary assignments for at least 1 year
# A14 : no checking account
# 
# Attribute 2: (numerical)
# Duration in month - DUR
# 
# Attribute 3: (qualitative)
# Credit history - CRH
# A30 : no credits taken/ all credits paid back duly
# A31 : all credits at this bank paid back duly
# A32 : existing credits paid back duly till now
# A33 : delay in paying off in the past
# A34 : critical account/ other credits existing (not at this bank)
# 
# Attribute 4: (qualitative)
# Purpose - PUR
# A40 : car (new)
# A41 : car (used)
# A42 : furniture/equipment
# A43 : radio/television
# A44 : domestic appliances
# A45 : repairs
# A46 : education
# A47 : (vacation - does not exist?)
# A48 : retraining
# A49 : business
# A410 : others
# 
# Attribute 5: (numerical)
# Credit amount - CRA
# 
# Attibute 6: (qualitative)
# Savings account/bonds - SAB
# A61 : ... < 100 DM
# A62 : 100 <= ... < 500 DM
# A63 : 500 <= ... < 1000 DM
# A64 : .. >= 1000 DM
# A65 : unknown/ no savings account
# 
# Attribute 7: (qualitative)
# Present employment since - EMY 
# A71 : unemployed
# A72 : ... < 1 year
# A73 : 1 <= ... < 4 years
# A74 : 4 <= ... < 7 years
# A75 : .. >= 7 years
# 
# Attribute 8: (numerical)
# Installment rate in percentage of disposable income - INT
# 
# Attribute 9: (qualitative)
# Personal status and sex - STA
# A91 : male : divorced/separated
# A92 : female : divorced/separated/married
# A93 : male : single
# A94 : male : married/widowed
# A95 : female : single
# 
# Attribute 10: (qualitative)
# Other debtors / guarantors - GUA
# A101 : none
# A102 : co-applicant
# A103 : guarantor
# 
# Attribute 11: (numerical)
# Present residence since - RES
# 
# Attribute 12: (qualitative)
# Property - PRP
# A121 : real estate
# A122 : if not A121 : building society savings agreement/ life insurance
# A123 : if not A121/A122 : car or other, not in attribute 6
# A124 : unknown / no property
# 
# Attribute 13: (numerical)
# Age in years - AGE
# 
# Attribute 14: (qualitative)
# Other installment plans - OIP
# A141 : bank
# A142 : stores
# A143 : none
# 
# Attribute 15: (qualitative)
# Housing - HOU
# A151 : rent
# A152 : own
# A153 : for free
# 
# Attribute 16: (numerical)
# Number of existing credits at this bank - CRD
# 
# Attribute 17: (qualitative)
# Job - JOB
# A171 : unemployed/ unskilled - non-resident
# A172 : unskilled - resident
# A173 : skilled employee / official
# A174 : management/ self-employed/
# highly qualified employee/ officer
# 
# Attribute 18: (numerical)
# Number of people being liable to provide maintenance for - PPL 
# 
# Attribute 19: (qualitative)
# Telephone - TEL
# A191 : none
# A192 : yes, registered under the customers name
# 
# Attribute 20: (qualitative)
# foreign worker - FOR
# A201 : yes
# A202 : no

# Data Cleaning 

# Define: Good Type = 0 and Bad Type = 1
data$TYP <- data$TYP -1 
summary(data$TYP)

# Expand the dataset (include dummy variables )
x.con <- data[,c(2,5,8,11,13,16,18)]

x.dum.1 <- model.matrix(~ CHK - 1, data)
x.dum.2 <- model.matrix(~ CRH - 1, data)
x.dum.3 <- model.matrix(~ PUR - 1, data)
x.dum.4 <- model.matrix(~ SAB - 1, data)
x.dum.5 <- model.matrix(~ EMY - 1, data)
x.dum.6 <- model.matrix(~ STA - 1, data)
x.dum.7 <- model.matrix(~ GUA - 1, data)
x.dum.8 <- model.matrix(~ PRP - 1, data)
x.dum.9 <- model.matrix(~ OIP - 1, data)
x.dum.10 <- model.matrix(~ HOU - 1, data)
x.dum.11 <- model.matrix(~ JOB - 1, data)
x.dum.12 <- model.matrix(~ TEL - 1, data)
x.dum.13 <- model.matrix(~ FOR - 1, data)
x.dum <- cbind(x.dum.1, x.dum.2, x.dum.3, x.dum.4, x.dum.5,
               x.dum.6, x.dum.7, x.dum.8, x.dum.9, x.dum.10,
               x.dum.11, x.dum.12, x.dum.13)

x <- cbind(x.con, x.dum)

rm(x.con, x.dum, x.dum.1, x.dum.2, x.dum.3, x.dum.4, x.dum.5,
        x.dum.6, x.dum.7, x.dum.8, x.dum.9, x.dum.10, x.dum.11,
        x.dum.12, x.dum.13)

y <- data[,21]

# Define a function: Factor normalization by column
factor_normalize <- function(x) {
  normalized <- t(apply(x, 2, function(col) (col - min(col)) / (max(col) - min(col))))
  return(normalized)
}

# Apply factor normalization to the data
x <- factor_normalize(x)
x <- t(x)

### Now we have two datasets:

### "data" is the original data: 20 factors and 1 label ("TYP")
### "x" is expanded from the original 20-factor dataset to 61-factor dataset including dummy variables; and "y" is the label  

# Data Exploration 
psych::describe(data)
psych::describe(x)
psych::describe(y)

# Data Visualization
# install.packages("scatterPlotMatrix")
library(scatterPlotMatrix)
scatterPlotMatrix(data)
scatterPlotMatrix(data, zAxisDim = "TYP")
scatterPlotMatrix(data, zAxisDim = "TYP", regressionType = 1)

Basic Modelling Framework Using Logistic Regression

# Model estimation using Logistic regression 
mod_1 <- glm(TYP ~ 1, data = data, family = "binomial")
summary(mod_1)

mod_full <- glm(TYP ~ ., data = data, family = "binomial")
summary(mod_full)

# Type Prediction
prob.pre <- predict.glm(mod_full, type = "response")
type.pre <- as.matrix(ifelse(prob.pre<= 0.5,0,1), 1000,1)
type.act <- as.matrix(data$TYP,1000,1)

# Compare the Prediction with the Actual Data
# We will compute the following metrics to evaluate the model:
#   1. Confusion matrix
#   2. Accuracy 
#   3. F1 Score
#   4. Receiver Operating Curve (ROC)
#   5. Area Under ROC (AUROC or AUC)
  
# I will create a function called "accuracyFun" - that takes pred.pre, type.pre and type.act as inputs - to compute and return all of the above metrics as outputs

accuracyFun <- function(prob.pre, type.pre, type.act){
  
  mod.tab <- table(type.pre, type.act)
  
  tp <- mod.tab[2,2]
  tn <- mod.tab[1,1]
  fp <- mod.tab[2,1]
  fn <- mod.tab[1,2]
  tot <- tp + tn + fp + fn

  mod.acc <- (tp + tn) / tot
  mod.f1  <- (2 * tp)/(2*tp + fp + fn)
  
  # Draw the ROC Curve and calculate Area Under ROC
  # install.packages("pROC")
  library(pROC)
  mod.roc <- roc(response = type.act, predictor = prob.pre)
  plot(mod.roc)
  auc(mod.roc)
  ans <- list(mod.tab, mod.acc, mod.f1, plot(mod.roc), auc(mod.roc))
  return(ans)
}

accuracyFun(prob.pre, type.pre, type.act)

Let’s Write The Laziest Machine Learning Codes!

# Import the data
data <- read.table("~/Downloads/german.data")

# ... maybe a bit of cleaning 
colnames(data) <- c("CHK", "DUR", "CRH", "PUR", "CRA",
                 "SAB", "EMY", "INT", "STA", "GUA",
                 "RES", "PRP", "AGE", "OIP", "HOU", 
                 "CRD", "JOB", "PPL", "TEL", "FOR", "TYP")

data$TYP <- data$TYP -1 

# Let's throw all the factors in and estimate the model 
mod_full <- glm(TYP ~ ., data = data, family = "binomial")

# ... and predict 
prob.pre <- predict.glm(mod_full, type = "response")
type.pre <- as.matrix(ifelse(prob.pre<= 0.5,0,1), 1000,1)
type.act <- as.matrix(data$TYP,1000,1)

# ... and evaluate!

accuracyFun(prob.pre, type.pre, type.act)

Can we do better?

Can we do better? Stepwise Regression

# Stepwise GLM: Forward and Backward Selection
mod_step <- step(mod_full)
formula(mod_step)
summary(mod_step)

# ... and predict 
prob.pre.step <- predict.glm(mod_step, type = "response")
type.pre.step <- as.matrix(ifelse(prob.pre.step<= 0.5,0,1), 1000,1)
type.act.step <- as.matrix(data$TYP,1000,1)

# ... and evaluate!
accuracyFun(prob.pre.step, type.pre.step, type.act.step)

Can we do better? Regularization

Can we do better? Regularization

LASSO: Least Absolute Shrinkage and Selection Operator

  • In lasso regression, the objective is to minimize the sum of squared residuals, similar to ordinary linear regression, while simultaneously adding a penalty term based on the absolute values of the regression coefficients. This penalty term encourages sparsity in the coefficient estimates, effectively shrinking some coefficients to zero, resulting in automatic feature selection.3

  • The lasso regression model can be represented mathematically as:

    minimize: (1 / (2 * n)) * ||Y - X * β||^2 + λ * ||β||_1

    Where:

    • Y represents the response variable.

    • X represents the matrix of predictor variables.

    • β represents the vector of regression coefficients.

    • λ (lambda) is the tuning parameter that controls the strength of the regularization.

    • By varying the value of λ, you can control the amount of regularization applied. Higher values of λ lead to more coefficients being set to zero, resulting in a sparser model.

Ridge Regression

  • Ridge regression is another type of regularized linear regression technique that helps mitigate the issue of multicollinearity and overfitting in predictive models. It adds a penalty term to the least squares objective function, which is based on the sum of squared regression coefficients.

  • The ridge regression model can be represented mathematically as:

    minimize: (1 / (2 * n)) * ||Y - X * β||^2 + λ * ||β||_2^2

    Where:

    • Y represents the response variable.

    • X represents the matrix of predictor variables.

    • β represents the vector of regression coefficients.

    • λ (lambda) is the tuning parameter that controls the strength of the regularization.

    • By varying the value of λ, you can control the amount of regularization applied. Higher values of λ increase the amount of shrinkage applied to the coefficients, resulting in a more constrained model.4

Elastic Net

  • Elastic Net is a regularized regression technique that combines both L1 (Lasso) and L2 (Ridge) regularization penalties. It is useful when dealing with high-dimensional datasets and situations where there are many correlated predictor variables.

  • The elastic net regression model can be represented mathematically as:

    minimize: (1 / (2 * n)) * ||Y - X * β||^2 + λ1 * ||β||_1 + λ2 * ||β||_2^2

    Where:

    • Y represents the response variable.

    • X represents the matrix of predictor variables.

    • β represents the vector of regression coefficients.

    • λ1 and λ2 are the tuning parameters that control the strength of the L1 and L2 regularization, respectively.

    • The λ1 parameter controls the amount of L1 regularization (sparsity-inducing) applied, similar to the Lasso penalty. The λ2 parameter controls the amount of L2 regularization (shrinkage) applied, similar to the Ridge penalty.

    • By adjusting the values of λ1 and λ2, you can control the trade-off between sparsity and shrinkage in the model.5

When to use LASSO, Ridge, Elastic Net?

  • From a practical standpoint, L1 (LASSO) tends to shrink coefficients to zero whereas L2 (RIDGE) tends to shrink coefficients evenly.

  • L1 is therefore useful for feature selection, as we can drop any variables associated with coefficients that go to zero. L2, on the other hand, is useful when you have collinear/codependent features.

# install.packages("glmnet")
library(glmnet)

# We will use "x" and "y" for glmnet. See https://glmnet.stanford.edu/ for the manual.

#############################
# Regularization: LASSO
#############################

# Just set the alpha to "1"

mod.lasso <- glmnet(x, y, family = "binomial", alpha = 1, lambda = NULL)
print(mod.lasso)
plot(mod.lasso)
coef(mod.lasso, s = 0.01)

# Once we decide on the value of lambda (or "s") in the formula, we can proceed to the prediction and evaluation parts

# ... and predict 
prob.pre.lasso <- predict(mod.lasso, x, s = 0.01, type = "response")
type.pre.lasso <- as.matrix(ifelse(prob.pre.lasso<= 0.5,0,1), 1000,1)
type.act.lasso <- as.matrix(y,1000,1)

# ... and evaluate!
accuracyFun(prob.pre.lasso, type.pre.lasso, type.act.lasso)

### You can vary/experiment with the s (or lambda) and see if the results improve!

#############################
# Regularization: RIDGE
#############################

# Just set the alpha to "0"

mod.ridge <- glmnet(x, y, family = "binomial", alpha = 0, lambda = NULL)
print(mod.ridge)
plot(mod.ridge)
coef(mod.ridge, s = 1)

# Once we decide on the value of lambda (or "s") in the formula, we can proceed to the prediction and evaluation parts

# ... and predict 
prob.pre.ridge <- predict(mod.ridge, x, s = 0.01, type = "response")
type.pre.ridge <- as.matrix(ifelse(prob.pre.ridge<= 0.5,0,1), 1000,1)
type.act.ridge <- as.matrix(y,1000,1)

# ... and evaluate!
accuracyFun(prob.pre.ridge, type.pre.ridge, type.act.ridge)

#############################
# Regularization: ELASTIC NET
#############################

# Now alpha can vary between 0 and 1. Let's say alpha = 0.5

mod.elas <- glmnet(x, y, family = "binomial", alpha = 0.25, lambda = NULL)
print(mod.elas)
plot(mod.elas)
coef(mod.elas, s = 0.08)

# Once we decide on the value of lambda (or "s") in the formula, we can proceed to the prediction and evaluation parts

# ... and predict 
prob.pre.elas <- predict(mod.elas, x, s = 0.08, type = "response")
type.pre.elas <- as.matrix(ifelse(prob.pre.elas<= 0.5,0,1), 1000,1)
type.act.elas <- as.matrix(y,1000,1)

# ... and evaluate!
accuracyFun(prob.pre.elas, type.pre.elas, type.act.elas)

Can we do better? K-Fold Cross Validation

#####################################
# A Simple Cross Validation 
#####################################

# Randomly split data into train vs test datasets: 80% vs 20% 

train_percentage <- 0.8
train_samples <- floor(train_percentage * nrow(data))
train_data <- data[1:train_samples, ]
test_data <- data[(train_samples + 1):nrow(data), ]

# Train the model on the train dataset
# Let's throw all the factors in and estimate the model 
mod_full_train <- glm(TYP ~ ., data = train_data, family = "binomial")

# ... and predict 
prob.pre.train <- predict.glm(mod_full_train, type = "response")
type.pre.train <- ifelse(prob.pre.train<= 0.5,0,1)
type.act.train <- data$TYP[1:train_samples]

# ... and evaluate!
accuracyFun(prob.pre.train,type.pre.train,type.act.train)

# Test the model on the test dataset 
# Use the train model to predict the default probabiilty on the test data
prob.pre.test <- predict.glm(mod_full_train, type = "response", newdata = test_data)
type.pre.test <- ifelse(prob.pre.test<= 0.5,0,1)
type.act.test <- data$TYP[(train_samples+1):1000]

# ... and evaluate
accuracyFun(prob.pre.test,type.pre.test,type.act.test)
#############################################
# 5-fold Cross Validation (The Short Version)
#############################################

# Define the number of folds for cross-validation
numFold <- 5

# Perform k-fold cross-validation
# install.packages("caret")
library(caret)

set.seed(123)
folds <- caret::createFolds(data$TYP, k = numFold, list = TRUE)

accuracy <- numeric(numFold)

for (i in 1:numFold) {
  # Split data into training and testing sets based on the fold indices
  train_indices <- unlist(folds[-i])
  test_indices <- folds[[i]]
  
  train <- data[train_indices,]
  x_test <- data[test_indices,1:20]
  y_test <- data[test_indices,21]
  
  # Build and train your model
  model <- glm(TYP ~ ., data = train, family = "binomial")
  
  # Make predictions on the test set
  y_pred <- ifelse(predict(model, x_test, type = "response")<=0.5,0,1)
  
  # Evaluate the model's performance
  accuracy[i] <- sum(y_pred == y_test) / length(y_test)
}

mean(accuracy)

Can we do better? Apply the (not-so) new modelling techniques

Neural Networks

  • A neural network (NN) is a type of artificial neural network (ANN) that consists of multiple layers of interconnected nodes (neurons). It is designed to learn and model complex patterns and relationships in data by leveraging a hierarchical representation of features. Here’s the quick introduction of the model features:

    • Architecture: A DNN typically consists of an input layer, one or more hidden layers, and an output layer. Each layer is composed of multiple neurons, and the neurons are interconnected with weights.

    • Deep Learning: The “deep” in deep neural networks refers to the presence of multiple hidden layers. Having multiple layers allows the network to learn hierarchical representations of the input data, enabling it to capture increasingly abstract features as the information flows through the network.

    • Activation Functions: Neurons within each layer apply an activation function to the weighted sum of their inputs to introduce non-linearities into the network. Common activation functions include ReLU (Rectified Linear Unit), sigmoid, and tanh.

    • Forward Propagation: During the forward propagation phase, data is fed into the network, and computations are performed layer by layer. Each layer applies a weighted sum followed by the activation function to generate the output.

    • Backpropagation: The backpropagation algorithm is used to train the DNN. It involves computing the prediction error (or loss) and propagating it backward through the network to update the weights using gradient descent optimization. This process iteratively adjusts the weights to minimize the prediction error.

Why does NN work? We have the “Universal Approximation Theorem”

  • The Universal Approximation Theorem is a fundamental result in the field of artificial neural networks, which states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function to arbitrary precision on a compact subset of the input space. In other words, given enough neurons in a single hidden layer, a neural network can approximate any continuous function.

Choice of Activation Function

Here is a list of popular activation functions commonly used in neural networks:

  1. Sigmoid Function (Logistic Function): The sigmoid function is defined as f(x) = 1 / (1 + e^(-x)). It maps the input to a value between 0 and 1, which is useful for binary classification problems or as a smooth approximation of a step function.

  2. Hyperbolic Tangent Function (Tanh): The tanh function is defined as f(x) = (e^x - e^(-x)) / (e^x + e^(-x)). It maps the input to a value between -1 and 1, and it is symmetric around the origin. It is commonly used in hidden layers of neural networks.

  3. Rectified Linear Unit (ReLU): The ReLU function is defined as f(x) = max(0, x). It returns 0 for negative inputs and the input value for positive inputs. ReLU has become widely used in deep learning due to its simplicity and ability to alleviate the vanishing gradient problem.

  4. Softmax Function: The softmax function is commonly used in the output layer of a neural network for multi-class classification tasks. It takes a vector of real numbers as input and converts them into a probability distribution, where the sum of the probabilities is equal to 1. It is defined as f(x_i) = e^(x_i) / (Σ(e^(x_j))) for each element x_i in the input vector.

  5. There are other types such as leaky ReLU, parametric ReLU, Exponential LU etc.

# install.packages("keras")
library(keras)

# Prepare the dataset for the deep neural nets 
# I will use the datasets "x" and "y" here. We supposed to organize the data this way for keras  
# See https://tensorflow.rstudio.com/guides/keras/basics


# Split the data into training and testing sets
train_percentage <- 0.6
train_indices <- sample(1:nrow(data), floor(train_percentage * nrow(data)))

x_train <- x[train_indices,]
y_train <- y[train_indices]

x_test <- x[-train_indices,]
y_test <- y[-train_indices]

# Define the neural network model
mod.nn <- keras_model_sequential()

mod.nn %>%
  layer_dense(units = 10, activation = "relu", input_shape = 61) %>%
  layer_dense(units = 4, activation = "relu") %>%
  layer_dense(units = 1, activation =  "sigmoid") 

# Compile the model
mod.nn %>% compile(
  loss = "binary_crossentropy",
  optimizer = "adam",
  metrics = c("accuracy")
)

# Train the model
history <- mod.nn %>% fit(
  x_train, y_train,
  epochs = 20,
  batch_size = 32,
  validation_data = list(x_test, y_test)
)

# Evaluate the model
evaluation <- mod.nn %>% evaluate(x_test, y_test) 
print(evaluation)

# predicted probability

prob.pre.nn <- predict(mod.nn, x_test) 
type.pre.nn <- ifelse(prob.pre.nn <= 0.5,0,1)
type.act.nn <- y_test

# ... and evaluate!
accuracyFun(prob.pre.nn,type.pre.nn,type.act.nn)

What Have We Learned Today?

Next Up!


  1. International Monetary Fund (IMF). (2018). Global Financial Stability Report: A Bumpy Road Ahead. Retrieved from https://www.imf.org/en/Publications/GFSR/Issues/2018/09/25/global-financial-stability-report-october-2018↩︎

  2. ↩︎
  3. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1), 267-288.

    Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.↩︎

  4. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

    Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1), 267-288.↩︎

  5. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.

    Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.↩︎

  6. https://www.researchgate.net/profile/David-West/publication/223425357_Neural_Network_Credit_Scoring_Models/links/5ae9c71c45851588dd826629/Neural-Network-Credit-Scoring-Models.pdf

    https://arxiv.org/abs/2207.08815↩︎