This is a demo on end-to-end implementation of well-known classification methods using machine learning (ML) techniques in the context of credit risk measurement in R program. This demo is organized as follows:

1 Classification using XGBoost in R

This section is mainly focused on the implementation of gradient boosting technique for the classification problems including credit default forecast. We don not cover the theory or math behind the algorithm here and leave it for another time. However, the interested reader can consult several online sources to have a sense of the theory behind gradient boosting algorithm.

“Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.” See A gentle intro to xgboost for a quick background on GBoost and XGBoost.

Note 1: It is worth highlighting that xgboost package:

  • manages only numeric vectors. That is, if we have non-numeric (e.g., factors) vectors, features, in our dataset we need to transform them to numeric ones. This can be achieved via one-hot encoding step.

  • Because xgboost can manage huge dataset very efficiently, we can use spars matrix as input to fed into the model.

Both items above can be simultaneously achieved by calling sparse.model.matrix() function.

In what follows, we provide a step by step implementation of xgboost algorithm for credit loan default forecasting problem.

  • The fist step in working with xgboost package is to install it in your R program. This package along with Matrix package can be installed and loaded as follows:
pack = c("xgboost", "Matrix")

sapply(pack, require, character.only = T)

rm(list = ls())
  • In the second step we chnage the format of the input dataset to the one discussed in Note1 as follows:
# One-hot encoding

dtrain = sparse.model.matrix(Default ~ .-1, data=train) 

# We take out the label column form the matrix.
# We treat the label column seperately as a numeric vector. That's it, if the label vector is not numeric.
# then we need to create a numeric equivalent vector.

dtest = sparse.model.matrix(Default ~ .-1, data=test)

train.label  <- train$Default
test.label   <- test$Default

Formulae Default~.-1 used above means transform all categorical features except column Default (the label vector) to binary values. The -1 is here to remove the first column which is full of 1 (this column is generated by the conversion).

  • In the next step we need to train the model. For simplicity, we first define a list of parameters we need to use in the model and then define the model itself.
#----------------- Train the model---------------------------------------------------------------

# Set our hyperparameters
param <- list(objective   = "binary:logistic",
              eval_metric = "error",
              max_depth   = 7,
              eta         = 0.1,
              gammma      = 1)
                       

set.seed(1234)

# Pass in our hyperparameteres and train the model 
system.time(xgb = xgboost(params  = param,
                           data    = dtrain,
                           label   = train.label, 
                           nrounds = 500,
                           print_every_n = 100,
                           verbose = 1))

The default booster for xgboost() function is “gbtree” booster. “gblinear” booster is another option which can be used as the booster. For a complete list of options availabe for xgboost() function, please consult xgboost package in R.

Note 2: It is worth mentioning that we can tun our hyperparameters using different approaches, among which is the Bayesian optimization technique. We later explain in details how to perform hyperparameters tunning using Bayesian optimization technique in R.

  • In the next step we check the model performance by applying the fitted model on the test dataset and calculate the confusion matrix.
#------------------------ confusion matrix---------------------

# Create our prediction probabilities
pred = predict(xgb, dtest)

# Set our cutoff threshold
pred.resp = ifelse(pred >= 0.5, 1, 0)

# Create the confusion matrix
confusionMatrix(pred.resp, test.label, positive="1")

2 Classification using deep learning in R

This approach is thoroughly discussed in Deep Learning in R.