Introduction to Low Default Portfolios (LDPs)

The Basel Accord provides no formal definition of a low default portfolio (LDP). The Bank of England earlier suggested 20 as the minimum number of required defaults to begin modeling (Prudential Regulation Authority 2013). Hence, if you have fewer than 20 defaults, you definitely have a low default portfolio. The definition of a low default portfolio strongly depends not only on the quality, but also on the ratio between default cases and nondefault ones.

For example, suppose you have two classes: A (for example, Nondefaulter) and B (Default). Class A is 90% of your data-set and class B is the other 10%, but you are most interested in identifying instances of class B. You can reach an accuracy of 90% by simply predicting class A every time, but this provides a useless classifier for your intended use case. Instead, a properly calibrated method may achieve a lower accuracy, but would have a substantially higher true positive rate (or recall), which is really the metric you should have been optimizing for. These scenarios often occur in the context of detection, such as for abusive content online, or disease markers in medical data.

The situation described above is also called “imbalance problem” or “imbalanced data” or low default portfolio in context of credit scoring and classification of loan applications. The problem of imbalanced data is recognized as one of the major problems in the field of data analysis, data mining and machine learning as most statistical models as well as machine learning algorithms assume that data is equally distributed (or unskewed data). In the case of imbalanced data, majority classes dominate over minority classes, causing the machine learning classifiers to be more biased towards majority classes. This causes poor classification of minority classes. Classifiers may even predict all the test data as majority classes. Here are a few practical settings where class imbalance often occurs:

  • Online advertising: An advertisement is presented to a viewer which creates an impression. The click through rate is the number of times an ad was clicked on divided by the total number of impressions and tends to be very low (Richardson et al. 2007 cite a rate less than 2.4%).

  • Pharmaceutical research: High-throughput screening is an experimental technique where large numbers of molecules (10000s) are rapidly evaluated for biological activity. Usually only a few molecules show high activity; therefore, the frequency of interesting compounds is low.

  • Insurance Sector: Artis et al. (2002) investigated auto insurance damage claims in Spain between the years of 1993 and 1996. Of claims undergoing auditing, the rate of fraud was estimated to be approximately 22 %.

Difficulties When Handling With Low Default Portfolios

Low default portfolios are quite common in a financial setting. A popular example is exposures to sovereigns; very few countries have gone into default in the past. Other examples are exposures to banks, insurance companies, and project finance, which is finance for large projects such as building highways or nuclear reactors. Exposures to large corporations and/or specialized lending are additional examples. When you bring new products to the market, it will also take some time before you have the necessary number of defaults to estimate standard credit risk models.

For low default portfolios, typically you have a lack of modeling data, especially default data, which makes it very difficult to apply the advanced internal ratings based (IRB) approach, in which case you need to estimate the prob- ability of default (PD), the LGD, and the EAD. Historical average default rates are not appropriate since they have been calculated on only a few observations. Because of data scarcity, the credit risk can thus be substantially underestimated or overesti- mated. This is a significant problem, especially given the fact that a substantial portion of a bank’s assets might consist of low default portfolios.

Here you can see some statements made by the Basel Committee Accord Implementation Group’s Validation Subgroup on the issue of low default portfolios (Basel Committee on Banking Supervision 2005) in context of modelling scorecard and loan application classification:

  • “LDPs should not, by their very nature, automatically be excluded from IRB treatment.”

  • “…an additional set of rules or principles specifically applying to LDPs is neither necessary nor desirable.”

  • “…relatively sparse data might require increased reliance on alternative data sources and data enhancing tools for quantification and alternative techniques for validation.”

  • “…LDPs should not be considered or treated as conceptually different from other portfolios.”

The Financial Services Authority (FSA), which was the predecessor of the Prudential Regulation Authority (PRA) in the United Kingdom, earlier also explicitly confirmed that it should be possible to include a firm’s LDPs in the IRB approach (see Financial Services Authority 2006a, Section 7).

In case of rare diseases classification, a machine learning or statistical model may suffer from accuracy paradox, which makes it difficult to control false positives (or Type I Error) and false negatives (or Type II Error). This means that the patient may suffer from a rare disease but the machine learning model will not predict so since the majority of the data will be from patients without the disease. In the example of loan classification, the goal is to identify whether the loan application is default or not. Because most cases are nondefault, this causes the model to predict the default applications as valid.

Select the Right Evaluation Metrics

For imbalaned data sets, ppplying inappropriate evaluation metrics for evaluating model can be dangerous. Imagine our training data is the one illustrated in graph above. If accuracy is used to measure the goodness of a model, a model which classifies all testing samples into “0” will have an excellent accuracy (99.8%), but obviously, this model won’t provide any valuable information for us. In this case, other alternative evaluation metrics can be applied such as:

  • AUC: relation between true-positive rate and false positive rate.

  • Precision/Specificity: how many selected instances are relevant.

  • Recall/Sensitivity: how many relevant instances are selected.

  • F1 score: harmonic mean of precision and recall.

Remedies for Low Default Portfolios

Default risk data sets often have a very skewed target class distribution where typi- cally only about 1 percent or even less of the transactions are defaulters. Obviously, this creates problems for the analytical techniques discussed earlier since they are being flooded by all the nondefault observations and will thus tend toward classi- fying every observation as nondefault. Think about decision trees, for example: If they start from a data set with 99 percent/1 percent nondefault/default observations, then the entropy is already very low and hence it is very likely that the decision tree does not find any useful split and classifies all observations as nondefault, thereby achieving a classification accuracy of 99 percent, but essentially detecting none of the defaulters. It is thus recommended to increase the number of default observations or their weight, such that the analytical techniques can pay better attention to them. Various remedies are possible to do this and will be outlined in what follows.

Solution 1: Approach Based on Sampling Technique

When there is a priori knowledge of a class imbalance, one straightforward method to reduce its impact on model training is to select a training set sample to have roughly equal event rates during the initial data collection (see, e.g., Artis et al. 2002). Basically, instead of having the model deal with the imbalance, we can attempt to balance the class frequencies. Taking this approach eliminates the fundamental imbalance issue that plagues model training. However, if the training set is sampled to be balanced, the test set should be sampled to be more consistent with the state of nature and should reflect the imbalance so that honest estimates of future performance can be computed.

Two general post hoc approaches are under-sampling (or downsampling) and up-sampling (or oversampling) the data. Up-sampling is any technique that simulates or imputes additional data points to improve balance across classes, while down-sampling refers to any technique that reduces the number of samples to improve the balance across classes.

More specially, the first way to increase the number of defaulters is by increasing the weight of the defaulters is by either oversampling (or up-sampling) them or by undersampling the nondefaulters. Here, the idea is to replicate the defaulters two or more times so as to make the distribution less skewed. Ling and Li (1998) provide one approach to up-sampling in which cases from the minority classes are sampled with replacement until each class has approximately the same number. For the insurance data, the training set contained 6466 non-policy and 411 insured customers. If we keep the original minority class data, adding 6055 random samples (with replacement) would bring the minority class equal to the majority. In doing this, some minority class samples may show up in the training set with a fairly high frequency while each sample in the majority class has a single realization in the data. This is very similar to the case weight approach shown in an earlier section, with varying weights per case.

On the contrary, undersampling (or Down-sampling) balances the dataset by reducing the size of the abundant class. This method is used when quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modelling. Down-sampling selects data points from the majority class so that the ma- jority class is roughly the same size as the minority class(es). There are several approaches to down-sampling. First, a basic approach is to randomly sample the majority classes so that all classes have approximately the same size. Another approach would be to take a bootstrap sample across all cases such that the classes are balanced in the bootstrap set. The advantage of this approach is that the bootstrap selection can be run many times so that the estimate of variation can be obtained about the down-sampling. One implementation of random forests can inherently down-sample by controlling the bootstrap sampling process within a stratification variable. If class is used as the stratification variable, then bootstrap samples will be created that are roughly the same size per class. These internally down-sampled versions of the training set are then used to construct trees in the ensemble.

The third approach is a “mixed combination” of the two methods descibed above. The synthetic minority over-sampling technique (SMOTE), proposed by Chawla et al. (2002), is a data sampling procedure that uses both up-sampling and down-sampling, depending on the class, and has three operational parameters: the amount of up-sampling, the amount of down-sampling, and the number of neighbors that are used to impute new cases. To up-sample for the minority class, SMOTE synthesizes new cases. To do this, a data point is randomly selected from the minority class and its K-nearest neighbors (KNNs) are determined. Chawla et al. (2002) used five neighbors in their analyses, but different values can be used depending on the data. The new synthetic data point is a random combination of the predictors of the randomly selected data point and its neighbors. While the SMOTE algorithm adds new samples to the minority class via up-sampling, it also can down-sample cases from the majority class via random sampling in order to help balance the training set.

Solution 2: Model Tuning

The simplest approach to counteracting the negative effects of class imbalance is to tune the model to maximize the accuracy of the minority class(es). For default loan prediction, tuning the model to maximize the sensitivity may help desensitize the training process to the high percentage of nondefault cases in the training set.

Solution 3: Search Optimal Threshold

When there are two possible outcome categories (such as in case of default loan prediction), another method for increasing the prediction accuracy of the minority class samples is to determine alternative cutoffs for the predicted probabilities which effectively changes the definition of a predicted event. The most straightforward approach is to use the ROC curve since it calculates the sensitivity and specificity across a continuum of cutoffs. Using this curve, an appropriate balance between sensitivity and specificity can be determined.

Solution 3: Adjusting Prior Probabilities

Several techniques exist for determining a new cutoff. First, if there is a particular target that must be met for the sensitivity or specificity, this point can be found on the ROC curve and the corresponding cutoff can be determined. Another approach is to find the point on the ROC curve that is closest (i.e., the shortest distance) to the perfect model (with 100 % sensitivity and 100 % specificity), which is associated with the upper left corner of the plot.

Another approach for determining the cutoff uses Youden’s J index, which measures the proportion of correctly predicted samples for both the event and nonevent groups. This index can be computed for each cutoff that is used to create the ROC curve. The cutoff associated with the largest value of the Youden index may also show superior performance relative to the default 50 % value.

Some models use prior probabilities, such as naı̈ve Bayes and discriminant analysis classifiers. Unless specified manually, these models typically derive the value of the priors from the training data. Weiss and Provost (2001a) suggest that priors that reflect the natural class imbalance will materially bias predictions to the majority class. Using more balanced priors or a balanced training set may help deal with a class imbalance.

Solution 4: Cost-Sensitive Training

Instead of optimizing the typical performance measure, such as accuracy or impurity, some models can alternatively optimize a cost or loss function that differentially weights specific types of errors. For example, it may be appropriate to believe that misclassifying true events (false negatives) is X times as costly as incorrectly predicting nonevents (false positives). Incorporation of specific costs during model training may bias the model towards less frequent classes. Unlike using alternative cutoffs, unequal costs can affect the model parameters and thus have the potential to make true improvements to the classifier.

A Real-world Application: Predicting Defaulters from Mortgage Applications

In this post I will only present the method of using the oversampling technique for dealing with imbalanced data in practice. Data used is Caravan which contains 5822 real customer records. ach record consists of 86 variables, containing sociodemographic data (variables 1-43) and product ownership (variables 44-86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. Variable 86 (Purchase) indicates whether the customer purchased a caravan insurance policy. Further information on the individual variables can be obtained at http://www.liacs.nl/~putten/library/cc2000/data.html.

For this data set, there are two technical aspects must be considered: (1) Between-Predictor Correlations, (2) Zero- and Near Zero-Variance Predictors.

Between-Predictor Correlations

Collinearity is the technical term for the situation where a pair of predictor variables have a substantial correlation with each other. It is also possible to have relationships between multiple predictors at once (called multicollinearity).

In general, there are good reasons to avoid data with highly correlated predictors. First, redundant predictors frequently add more complexity to the model than information they provide to the model. In situations where obtaining the predictor data is costly (either in time or money), fewer variables is obviously better. While this argument is mostly philosophical, there are mathematical disadvantages to having correlated predictor data. Using highly correlated predictors in techniques like linear regression can result in highly unstable models, numerical errors, and degraded predictive performance.

Since collinear predictors can impact the variance of parameter estimates in this model, a statistic called the variance inflation factor (VIF) can be used to identify predictors that are impacted (Myers, 1994). Beyond linear regression, this method may be inadequate for several reasons: it was developed for linear models, it requires more samples than predictor variables, and, while it does identify collinear predictors, it does not determine which should be removed to resolve the problem.

A less theoretical, more heuristic approach to dealing with this issue is to remove the minimum number of predictors to ensure that all pairwise correlations are below a certain threshold. While this method only identify collinearities in two dimensions, it can have a significantly positive effect on the performance of some models.

The algorithm is as follows:

  1. Calculate the correlation matrix of the predictors.

  2. Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B).

  3. Determine the average correlation between A and the other variables. Do the same for predictor B.

  4. If A has a larger average correlation, remove it; otherwise, remove predictor B.

  5. Repeat Steps 2–4 until no absolute correlations are above the threshold.

The idea is to first remove the predictors that have the most correlated relationships.

For example, Suppose we wanted to use a model that is particularly sensitive to between-predictor correlations, we might apply a threshold of 0.75. This means that we want to eliminate the minimum number of predictors to achieve all pairwise correlations less than 0.75.

Zero- and Near Zero-Variance Predictors

There are potential advantages to removing predictors prior to modeling. First, fewer predictors means decreased computational time and complexity. Second, if two predictors are highly correlated, this implies that they are measuring the same underlying information. Removing one should not compromise the performance of the model and might lead to a more parsimoniousand interpretable model. Third, some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables.

Consider a predictor variable that has a single unique value; we refer to this type of data as a zero variance predictor. For some models, such an uninformative variable may have little effect on the calculations. For example, a tree-based model is impervious to this type of predictor since it would never be used in a split. However, a model such as linear regression would find these data problematic and is likely to cause an error in the computations. In either case, these data have no information and can easily be discarded.

Similarly, some predictors might have only a handful of unique values that occur with very low frequencies. These “near-zero variance predictors” may have a single value for the vast majority of the samples.

How can the user diagnose this mode of problematic data? A rule of thumb for detecting near-zero variance predictors is:

  • The fraction of unique values over the sample size is low (say 10 %).

  • The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model.

R Codes

In this section I will present R codes for oversampling and undersampling technique.

The first stage is to determine the optimal ratio between majority and minority class. This process is done as follows. In the first step, an analytical model is built on the original data set with the skew class distribution (for example, 95 percent/5 percent nondefaulters/ defaulters). The area under the curve (AUC) of this model is recorded (possibly on an independent validation data set). In a next step, over- or undersampling is used to change the class distribution by 5 percent (for example, 90 percent/10 percent). Again, the AUC of the model is recorded. Subsequent models are built on samples of 85 percent/ 15 percent, 80 percent/20 percent, 75 percent/25 percent, and so on, each time recording their AUCs. Once the AUC starts to stagnate (or drop), the procedure stops and the optimal odds ratio has been found.

The second stage is compare model results based on some valuation criteria.

R Codes for conducting over- and under-sampling techniques:

# Split data: 
set.seed(1)
id <- createDataPartition(y = remaining_df_after_zero$Purchase, p = 0.5, list = FALSE)
df_train <- remaining_df_after_zero[id, ] # For training
df_test <- remaining_df_after_zero[-id, ] # For testing


#-----------------------------------------------------------------------------
#  A function for calculating AUC / ROC based on 10 samples with given rate 
#  of minority class as described by oversampling technique 
#-----------------------------------------------------------------------------

library(ROSE)
library(pROC)

# Set conditions for training: 
  
set.seed(19950917)
ctrl <- trainControl(method = "repeatedcv",
                     number = 3,
                     repeats = 1,
                     summaryFunction = multiClassSummary, 
                     allowParallel = TRUE,
                     classProbs = TRUE)
  

my_auc_over <- function(minority_rate) {
  
  p <- minority_rate
  df_com <- data_frame()
  for (j in 1:10) {
    
    # Use oversampling for training logistic model: 
    set.seed(j)
    data_balanced_over <- ovun.sample(Purchase ~., data = df_train, p = p, method = "over")$data
    
    # Train logistic model: 
    my_logistic <- train(Purchase ~., method = "glm", trControl = ctrl, data = data_balanced_over)
    
    # Calculate some model performance metrics: 
    
    test_pred <- predict(my_logistic, df_test)
    cm <- confusionMatrix(df_test$Purchase, test_pred)
    
    bg_gg <- cm$table %>% 
      as.vector() %>% 
      matrix(ncol = 4) %>% 
      as.data.frame()
    
    names(bg_gg) <- c("BB", "GB", "BG", "GG")
    kq <- c(cm$overall, cm$byClass) 
    
    ten <- kq %>% 
      as.data.frame() %>% 
      row.names()
    
    kq %>% 
      as.vector() %>% 
      matrix(ncol = 18) %>% 
      as.data.frame() -> all_df
    
    names(all_df) <- ten
    all_df <- bind_cols(all_df, bg_gg)
    
    # Calculate AUC: 
    pred <- predict(my_logistic, df_test, type = "prob") %>% pull(Yes)
    auc <- roc(df_test$Purchase, pred)$auc

    # Add AUC: 
    all_df %<>% mutate(AUC = auc %>% as.vector(), Minority_Rate = p)
    df_com <- bind_rows(df_com, all_df)

  }
  return(df_com)

}

# Minorrity rate from original data: 
Caravan$Purchase %>% table() / nrow(Caravan)
## .
##         No        Yes 
## 0.94022673 0.05977327
##    user  system elapsed 
##  101.25    3.19  104.80
## # A tibble: 9 x 2
##   Minority_Rate avg_auc
##           <dbl>   <dbl>
## 1          0.1    0.683
## 2          0.15   0.683
## 3          0.2    0.687
## 4          0.25   0.687
## 5          0.3    0.687
## 6          0.35   0.687
## 7          0.4    0.689
## 8          0.45   0.687
## 9          0.5    0.685
## # A tibble: 9 x 2
##   Minority_Rate avg_acc
##           <dbl>   <dbl>
## 1          0.1    0.935
## 2          0.15   0.923
## 3          0.2    0.907
## 4          0.25   0.885
## 5          0.3    0.857
## 6          0.35   0.820
## 7          0.4    0.777
## 8          0.45   0.728
## 9          0.5    0.680

##    user  system elapsed 
##   65.15    0.25   65.78
## # A tibble: 9 x 2
##   Majority_Rate avg_auc
##           <dbl>   <dbl>
## 1          0.5    0.665
## 2          0.55   0.652
## 3          0.6    0.655
## 4          0.65   0.646
## 5          0.7    0.636
## 6          0.75   0.646
## 7          0.8    0.609
## 8          0.85   0.589
## 9          0.9    0.562
## # A tibble: 9 x 2
##   Majority_Rate avg_acc
##           <dbl>   <dbl>
## 1          0.5    0.625
## 2          0.55   0.557
## 3          0.6    0.507
## 4          0.65   0.461
## 5          0.7    0.419
## 6          0.75   0.368
## 7          0.8    0.315
## 8          0.85   0.267
## 9          0.9    0.244

Some Key Conclusions

From empirical evidences based on Caravan data set, we can conclude that:

  1. Oversampling technique has a negligible impact on AUC and Accuracy.

  2. Undersampling technique has an adverse impact on AUC and Accuracy.

The same conclusions are found from a post by Nina Zumel and John Mount: http://www.win-vector.com/blog/2015/02/does-balancing-classes-improve-classifier-performance/.

References

  1. Basel Committee on Banking Supervision. 2005. “Validation of Low-Default Portfolios in the Basel II Framework.” Basel Committee Newsletter no. 6, September.

  2. Artis M, Ayuso M, Guillen M (2002). “Detection of Automobile Insurance Fraud with Discrete Choice Models and Misclassified Claims.” The Journal of Risk and Insurance, 69(3), 325–340.

  3. Richardson M, Dominowska E, Ragno R (2007). “Predicting Clicks: Estimating the Click–Through Rate for New Ads.” In “Proceedings of the 16 th International Conference on the World Wide Web,” pp. 521–530.

  4. Visa, S., & Ralescu, A. (2005, April). Issues in mining imbalanced data sets - a review paper. In Proceedings of the sixteen midwest artificial intelligence and cognitive science conference (Vol. 2005, pp. 67-73).

  5. Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30(1), 25-36.

  6. Maloof, M. A. (2003, August). Learning when data sets are imbalanced and when costs are unequal and unknown. In ICML-2003 workshop on learning from imbalanced data sets II (Vol. 2, pp. 2-1).

  7. Japkowicz, N. (2003, August). Class imbalances: are we focusing on the right issue. In Workshop on Learning from Imbalanced Data Sets II (Vol. 1723, p. 63).

  8. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

  9. Myers R (1994). Classical and Modern Regression with Applications. PWS-KENT Publishing Company, Boston, MA, second edition.

---
title: "Problem of Low Default Portfolios (LDPs) in Credit Scoring and Modelling Scorecard" 
# subtitle: "The Serious Problem Must Be Handled"
author: "Nguyen Chi Dung"
output:
  html_document: 
    code_download: true
    # code_folding: hide
    highlight: pygments
    # number_sections: yes
    theme: "flatly"
    toc: TRUE
    toc_float: TRUE
---

```{r setup,include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

### Introduction to Low Default Portfolios (LDPs)

The Basel Accord provides no formal definition of a low default portfolio (LDP). The Bank
of England earlier suggested 20 as the minimum number of required defaults to begin
modeling (Prudential Regulation Authority 2013). Hence, if you have fewer than 20
defaults, you definitely have a low default portfolio. The definition of a low default
portfolio strongly depends not only on the quality, but also on the ratio between default cases and nondefault ones. 

For example, suppose you have two classes: A (for example, Nondefaulter) and B (Default). Class A is 90% of your data-set and class B is the other 10%, but you are most interested in identifying instances of class B. You can reach an accuracy of 90% by simply predicting class A every time, but this provides a useless classifier for your intended use case. Instead, a properly calibrated method may achieve a lower accuracy, but would have a substantially higher true positive rate (or recall), which is really the metric you should have been optimizing for. These scenarios often occur in the context of detection, such as for abusive content online, or disease markers in medical data.

The situation described above is also called "imbalance problem" or "imbalanced data" or low default portfolio in context of credit scoring and classification of loan applications. The problem of imbalanced data is recognized as one of the major problems in the field of data analysis, data mining and machine learning as most statistical models as well as  machine learning algorithms assume that data is equally distributed (or unskewed data). In the case of imbalanced data, majority classes dominate over minority classes, causing the machine learning classifiers to be more biased towards majority classes. This causes poor classification of minority classes. Classifiers may even predict all the test data as majority classes. Here are a few practical settings where class imbalance often occurs:


- Online advertising: An advertisement is presented to a viewer which creates
an impression. The click through rate is the number of times an ad was
clicked on divided by the total number of impressions and tends to be very
low (Richardson et al. 2007 cite a rate less than 2.4%).

- Pharmaceutical research: High-throughput screening is an experimental
technique where large numbers of molecules (10000s) are rapidly evaluated
for biological activity. Usually only a few molecules show high activity;
therefore, the frequency of interesting compounds is low.

- Insurance Sector: Artis et al. (2002) investigated auto insurance damage
claims in Spain between the years of 1993 and 1996. Of claims undergoing
auditing, the rate of fraud was estimated to be approximately 22 %.


### Difficulties When Handling With Low Default Portfolios

Low default portfolios are quite common in a financial setting. A popular example
is exposures to sovereigns; very few countries have gone into default in the past.
Other examples are exposures to banks, insurance companies, and project finance,
which is finance for large projects such as building highways or nuclear reactors.
Exposures to large corporations and/or specialized lending are additional examples.
When you bring new products to the market, it will also take some time before you
have the necessary number of defaults to estimate standard credit risk models.

For low default portfolios, typically you have a lack of modeling data, especially default data, which makes it very difficult to apply the advanced
internal ratings based (IRB) approach, in which case you need to estimate the prob-
ability of default (PD), the LGD, and the EAD. Historical average default rates are
not appropriate since they have been calculated on only a few observations. Because
of data scarcity, the credit risk can thus be substantially underestimated or overesti-
mated. This is a significant problem, especially given the fact that a substantial portion
of a bank’s assets might consist of low default portfolios.

Here you can see some statements made by the Basel Committee Accord Implementation Group’s Validation Subgroup on the issue of low default portfolios (Basel
Committee on Banking Supervision 2005) in context of modelling scorecard and loan application classification: 

- “LDPs should not, by their very nature, automatically be excluded from IRB treatment.”

- “...an additional set of rules or principles specifically applying to LDPs is neither necessary nor desirable.”

- “...relatively sparse data might require increased reliance on alternative data sources and data enhancing tools for quantification and alternative techniques for validation.”

- “...LDPs should not be considered or treated as conceptually different from other portfolios.”

The Financial Services Authority (FSA), which was the predecessor of the Prudential Regulation Authority (PRA) in the United Kingdom, earlier also explicitly
confirmed that it should be possible to include a firm’s LDPs in the IRB approach (see Financial Services Authority 2006a, Section 7).

In case of rare diseases classification, a machine learning or statistical model may suffer from accuracy paradox, which makes it difficult to control false positives (or Type I Error) and false negatives (or Type II Error). This means that the patient may suffer from a rare disease but the machine learning model will not predict so since the majority of the data will be from patients without the disease. In the example of loan classification, the goal is to identify whether the loan application is default or not. Because most cases are nondefault, this causes the model to predict the default applications as valid. 


### Select the Right Evaluation Metrics

For imbalaned data sets, ppplying inappropriate evaluation metrics for evaluating model can be dangerous. Imagine our training data is the one illustrated in graph above. If accuracy is used to measure the goodness of a model, a model which classifies all testing samples into “0” will have an excellent accuracy (99.8%), but obviously, this model won’t provide any valuable information for us. In this case, other alternative evaluation metrics can be applied such as:

- **AUC**: relation between true-positive rate and false positive rate.

- **Precision/Specificity**: how many selected instances are relevant.

- **Recall/Sensitivity**: how many relevant instances are selected.

- **F1 score**: harmonic mean of precision and recall.

### Remedies for Low Default Portfolios

Default risk data sets often have a very skewed target class distribution where typi-
cally only about 1 percent or even less of the transactions are defaulters. Obviously, 
this creates problems for the analytical techniques discussed earlier since they are
being flooded by all the nondefault observations and will thus tend toward classi-
fying every observation as nondefault. Think about decision trees, for example: If
they start from a data set with 99 percent/1 percent nondefault/default observations,
then the entropy is already very low and hence it is very likely that the decision tree
does not find any useful split and classifies all observations as nondefault, thereby
achieving a classification accuracy of 99 percent, but essentially detecting none of
the defaulters. It is thus recommended to increase the number of default observations or their weight, such that the analytical techniques can pay better attention to
them. Various remedies are possible to do this and will be outlined in what follows.

### Solution 1: Approach Based on Sampling Technique

When there is a priori knowledge of a class imbalance, one straightforward
method to reduce its impact on model training is to select a training set
sample to have roughly equal event rates during the initial data collection
(see, e.g., Artis et al. 2002). Basically, instead of having the model deal with
the imbalance, we can attempt to balance the class frequencies. Taking this
approach eliminates the fundamental imbalance issue that plagues model
training. However, if the training set is sampled to be balanced, the test set
should be sampled to be more consistent with the state of nature and should
reflect the imbalance so that honest estimates of future performance can be
computed.


Two general post hoc approaches are **under-sampling** (or downsampling) and **up-sampling** (or oversampling) the data. Up-sampling is any technique that simulates or imputes additional data points to improve balance across classes, while down-sampling refers to any technique that reduces the number of samples to improve the
balance across classes.


More specially, the first way to increase the number of defaulters is by increasing the weight of the defaulters is by either oversampling (or up-sampling) them
or by undersampling the nondefaulters. Here, the idea is to replicate the defaulters two or more times so as to make the
distribution less skewed. Ling and Li (1998) provide one approach to up-sampling in which cases
from the minority classes are sampled with replacement until each class has
approximately the same number. For the insurance data, the training set
contained 6466 non-policy and 411 insured customers. If we keep the original
minority class data, adding 6055 random samples (with replacement) would
bring the minority class equal to the majority. In doing this, some minority
class samples may show up in the training set with a fairly high frequency
while each sample in the majority class has a single realization in the data.
This is very similar to the case weight approach shown in an earlier section,
with varying weights per case.


On the contrary, undersampling (or Down-sampling) balances the dataset by reducing the size of the abundant class. This method is used when quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modelling. Down-sampling selects data points from the majority class so that the ma-
jority class is roughly the same size as the minority class(es). There are several
approaches to down-sampling. First, a basic approach is to randomly sample
the majority classes so that all classes have approximately the same size. Another approach would be to take a bootstrap sample across all cases such that
the classes are balanced in the bootstrap set. The advantage of this approach
is that the bootstrap selection can be run many times so that the estimate of variation can be obtained about the down-sampling. One implementation
of random forests can inherently down-sample by controlling the bootstrap
sampling process within a stratification variable. If class is used as the stratification variable, then bootstrap samples will be created that are roughly the
same size per class. These internally down-sampled versions of the training
set are then used to construct trees in the ensemble.


The third approach is a "mixed combination" of the two methods descibed above. **The synthetic minority over-sampling technique (SMOTE)**, proposed by
Chawla et al. (2002), is a data sampling procedure that uses both up-sampling
and down-sampling, depending on the class, and has three operational parameters: the amount of up-sampling, the amount of down-sampling, and the
number of neighbors that are used to impute new cases. To up-sample for the
minority class, SMOTE synthesizes new cases. To do this, a data point is randomly selected from the minority class and its K-nearest neighbors (KNNs)
are determined. Chawla et al. (2002) used five neighbors in their analyses,
but different values can be used depending on the data. The new synthetic
data point is a random combination of the predictors of the randomly selected data point and its neighbors. While the SMOTE algorithm adds new
samples to the minority class via up-sampling, it also can down-sample cases
from the majority class via random sampling in order to help balance the
training set.


### Solution 2: Model Tuning

The simplest approach to counteracting the negative effects of class imbalance
is to tune the model to maximize the accuracy of the minority class(es).
For default loan prediction, tuning the model to maximize the sensitivity may
help desensitize the training process to the high percentage of nondefault cases in the training set. 

### Solution 3: Search Optimal Threshold

When there are two possible outcome categories (such as in case of default loan prediction), another method for increasing the prediction accuracy of the minority class samples is to determine alternative cutoffs for the predicted probabilities which effectively changes
the definition of a predicted event. The most straightforward approach is to
use the ROC curve since it calculates the sensitivity and specificity across
a continuum of cutoffs. Using this curve, an appropriate balance between
sensitivity and specificity can be determined.

### Solution 3: Adjusting Prior Probabilities

Several techniques exist for determining a new cutoff. First, if there is
a particular target that must be met for the sensitivity or specificity, this
point can be found on the ROC curve and the corresponding cutoff can be
determined. Another approach is to find the point on the ROC curve that is
closest (i.e., the shortest distance) to the perfect model (with 100 % sensitivity
and 100 % specificity), which is associated with the upper left corner of the
plot. 

Another approach for determining the cutoff uses Youden’s J index, which measures the proportion of correctly predicted samples
for both the event and nonevent groups. This index can be computed for each
cutoff that is used to create the ROC curve. The cutoff associated with the
largest value of the Youden index may also show superior performance relative to the default 50 % value.

Some models use prior probabilities, such as naı̈ve Bayes and discriminant
analysis classifiers. Unless specified manually, these models typically derive
the value of the priors from the training data. Weiss and Provost (2001a)
suggest that priors that reflect the natural class imbalance will materially bias
predictions to the majority class. Using more balanced priors or a balanced
training set may help deal with a class imbalance.

### Solution 4: Cost-Sensitive Training

Instead of optimizing the typical performance measure, such as accuracy or
impurity, some models can alternatively optimize a cost or loss function
that differentially weights specific types of errors. For example, it may be appropriate to believe that misclassifying true events (false negatives) is X
times as costly as incorrectly predicting nonevents (false positives). Incorporation of specific costs during model training may bias the model towards less frequent classes. Unlike using alternative cutoffs, unequal costs can affect the model parameters and thus have the potential to make true improvements to the classifier.

### A Real-world Application: Predicting Defaulters from Mortgage Applications

In this post I will only present the method of using the oversampling technique for dealing with imbalanced data in practice. Data used is Caravan which contains 5822 real customer records. ach record consists of 86 variables, containing sociodemographic data (variables 1-43) and product ownership (variables 44-86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. Variable 86 (Purchase) indicates whether the customer purchased a caravan insurance policy. Further information on the individual variables can be obtained at http://www.liacs.nl/~putten/library/cc2000/data.html. 


For this data set, there are two technical aspects must be considered: (1) Between-Predictor Correlations, (2) Zero- and Near Zero-Variance Predictors. 

### Between-Predictor Correlations


Collinearity is the technical term for the situation where a pair of predictor variables have a substantial correlation with each other. It is also possible to have relationships between multiple predictors at once (called multicollinearity).

In general, there are good reasons to avoid data with highly correlated
predictors. First, redundant predictors frequently add more complexity to the
model than information they provide to the model. In situations where obtaining the predictor data is costly (either in time or money), fewer variables
is obviously better. While this argument is mostly philosophical, there are
mathematical disadvantages to having correlated predictor data. Using highly
correlated predictors in techniques like linear regression can result in highly
unstable models, numerical errors, and degraded predictive performance.

Since collinear predictors can impact the variance of parameter estimates in this model, a statistic called the variance inflation
factor (VIF) can be used to identify predictors that are impacted (Myers, 1994). Beyond linear regression, this method may be inadequate for several
reasons: it was developed for linear models, it requires more samples than predictor variables, and, while it does identify collinear predictors, it does
not determine which should be removed to resolve the problem.

A less theoretical, more heuristic approach to dealing with this issue is to remove the minimum number of predictors to ensure that all pairwise
correlations are below a certain threshold. While this method only identify collinearities in two dimensions, it can have a significantly positive effect on the performance of some models.

The algorithm is as follows:

1. Calculate the correlation matrix of the predictors.

2. Determine the two predictors associated with the largest absolute pairwise
correlation (call them predictors A and B).

3. Determine the average correlation between A and the other variables.
Do the same for predictor B.

4. If A has a larger average correlation, remove it; otherwise, remove predictor B.

5. Repeat Steps 2–4 until no absolute correlations are above the threshold.

The idea is to first remove the predictors that have the most correlated relationships.

For example, Suppose we wanted to use a model that is particularly sensitive to between-predictor correlations, we might apply a threshold of 0.75. This means that we
want to eliminate the minimum number of predictors to achieve all pairwise correlations less than 0.75.

### Zero- and Near Zero-Variance Predictors

There are potential advantages to removing predictors prior to modeling.
First, fewer predictors means decreased computational time and complexity.
Second, if two predictors are highly correlated, this implies that they are measuring the same underlying information. Removing one should not compromise the performance of the model and might lead to a more parsimoniousand interpretable model. Third, some models can be crippled by predictors
with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables.

Consider a predictor variable that has a single unique value; we refer to this type of data as **a zero variance predictor**. For some models, such an uninformative variable may have little effect on the calculations. For example, a tree-based model is impervious to this type of predictor since it
would never be used in a split. However, a model such as linear regression
would find these data problematic and is likely to cause an error in the computations. In either case, these data have no information and can easily be discarded.

Similarly, some predictors might have only a handful of unique values that occur with very low frequencies. These “near-zero variance predictors” may have a single value for the vast majority of the samples. 

How can the user diagnose this mode of problematic data? A rule of thumb for detecting near-zero variance predictors is: 

- The fraction of unique values over the sample size is low (say 10 %).

- The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model.

### R Codes

In this section I will present R codes for oversampling and undersampling technique. 

The first stage is to determine the optimal ratio between majority and minority class. This process is done as follows. In the first step, an analytical model is built on the original data set with the skew class distribution (for example, 95 percent/5 percent nondefaulters/
defaulters). The area under the curve (AUC) of this model is recorded (possibly on
an independent validation data set). In a next step, over- or undersampling is used
to change the class distribution by 5 percent (for example, 90 percent/10 percent). Again, the
AUC of the model is recorded. Subsequent models are built on samples of 85 percent/
15 percent, 80 percent/20 percent, 75 percent/25 percent, and so on, each time
recording their AUCs. Once the AUC starts to stagnate (or drop), the procedure stops
and the optimal odds ratio has been found.

The second stage is compare model results based on some valuation criteria. 


```{r, fig.fullwidth = TRUE, fig.height=12, fig.width=12}
#=================================
#  Stage 1: Data Pre-processing
#=================================

# Clear workspace: 
rm(list = ls())

# Load packages and data: 
library(tidyverse)
library(magrittr)
library(ISLR)
data("Caravan")

#-------------------------------------------------------
#  Remove collinear and near-zero variance predictors
#-------------------------------------------------------

# All original predictors: 
df_predictors <- Caravan %>% select(-Purchase)

# Compute and plot a correlation matrix:  
library(corrplot)

df_predictors %>%
  cor(.) -> correlations 

correlations %>%
  corrplot(., order = "hclust", tl.cex = .35)

# Filter variables based on threshold of 0.8: 
library(caret)
highCorr <- findCorrelation(correlations, cutoff = .8)

# Remaining predictors:
remaining_df <- df_predictors %>% select(-highCorr)

# Remove near-zero variance predictors: 
remaining_df_after_zero <- remaining_df %>% select(-nearZeroVar(remaining_df))

# Add target variable: 
remaining_df_after_zero %<>% mutate(Purchase = Caravan$Purchase)

```

R Codes for conducting over- and under-sampling techniques: 

```{r, fig.fullwidth = TRUE, fig.height=12, fig.width=12}
# Split data: 
set.seed(1)
id <- createDataPartition(y = remaining_df_after_zero$Purchase, p = 0.5, list = FALSE)
df_train <- remaining_df_after_zero[id, ] # For training
df_test <- remaining_df_after_zero[-id, ] # For testing


#-----------------------------------------------------------------------------
#  A function for calculating AUC / ROC based on 10 samples with given rate 
#  of minority class as described by oversampling technique 
#-----------------------------------------------------------------------------

library(ROSE)
library(pROC)

# Set conditions for training: 
  
set.seed(19950917)
ctrl <- trainControl(method = "repeatedcv",
                     number = 3,
                     repeats = 1,
                     summaryFunction = multiClassSummary, 
                     allowParallel = TRUE,
                     classProbs = TRUE)
  

my_auc_over <- function(minority_rate) {
  
  p <- minority_rate
  df_com <- data_frame()
  for (j in 1:10) {
    
    # Use oversampling for training logistic model: 
    set.seed(j)
    data_balanced_over <- ovun.sample(Purchase ~., data = df_train, p = p, method = "over")$data
    
    # Train logistic model: 
    my_logistic <- train(Purchase ~., method = "glm", trControl = ctrl, data = data_balanced_over)
    
    # Calculate some model performance metrics: 
    
    test_pred <- predict(my_logistic, df_test)
    cm <- confusionMatrix(df_test$Purchase, test_pred)
    
    bg_gg <- cm$table %>% 
      as.vector() %>% 
      matrix(ncol = 4) %>% 
      as.data.frame()
    
    names(bg_gg) <- c("BB", "GB", "BG", "GG")
    kq <- c(cm$overall, cm$byClass) 
    
    ten <- kq %>% 
      as.data.frame() %>% 
      row.names()
    
    kq %>% 
      as.vector() %>% 
      matrix(ncol = 18) %>% 
      as.data.frame() -> all_df
    
    names(all_df) <- ten
    all_df <- bind_cols(all_df, bg_gg)
    
    # Calculate AUC: 
    pred <- predict(my_logistic, df_test, type = "prob") %>% pull(Yes)
    auc <- roc(df_test$Purchase, pred)$auc

    # Add AUC: 
    all_df %<>% mutate(AUC = auc %>% as.vector(), Minority_Rate = p)
    df_com <- bind_rows(df_com, all_df)

  }
  return(df_com)

}

# Minorrity rate from original data: 
Caravan$Purchase %>% table() / nrow(Caravan)

# Use this function: 
system.time(optimal_rate <- lapply(seq(0.1, 0.5, 0.05), my_auc_over))
optimal_rate <- do.call("bind_rows", optimal_rate)

# Show results: 
optimal_rate %>% 
  group_by(Minority_Rate) %>% 
  summarise(avg_auc = mean(AUC))

optimal_rate %>% 
  group_by(Minority_Rate) %>% 
  summarise(avg_acc = mean(Accuracy))

# Model Performance based on some criteria: 

theme_set(theme_minimal())

optimal_rate %>% 
  select(Accuracy, Kappa, Sensitivity, Specificity, Precision, Recall, F1, Prevalence, AUC, Minority_Rate) %>% 
  mutate(Minority_Rate = factor(Minority_Rate)) %>% 
  gather(a, b, -Minority_Rate) %>% 
  ggplot(aes(Minority_Rate, b, fill = Minority_Rate, color = Minority_Rate)) + 
  geom_boxplot(show.legend = FALSE, alpha = 0.3) + 
  facet_wrap(~ a, scales = "free") + 
  coord_flip() + 
  theme(plot.margin = unit(c(1, 1, 1, 1), "cm")) + 
  theme(panel.grid.minor.x = element_blank()) + 
  labs(x = NULL, y = NULL, 
       title = "Figure 1: Model Performance Based on 9 Criteria for Alternative Minority Rates by Upsampling Technique", 
       subtitle = "Data Used: Caravan")


#-----------------------------------------------------------------------------
#  A function for calculating AUC / ROC based on 30 samples with given rate 
#  of majority class as described by undersampling technique 
#-----------------------------------------------------------------------------

my_auc_under <- function(majority_rate) {
  
  p <- majority_rate
  df_com <- data_frame()
  for (j in 1:10) {
    
    # Use oversampling for training logistic model: 
    set.seed(j)
    data_balanced_under <- ovun.sample(Purchase ~., data = df_train, p = p, method = "under")$data
    
    # Train logistic model: 
    my_logistic <- train(Purchase ~., method = "glm", trControl = ctrl, data = data_balanced_under)
    
    # Calculate some model performance metrics: 
    
    test_pred <- predict(my_logistic, df_test)
    cm <- confusionMatrix(df_test$Purchase, test_pred)
    
    bg_gg <- cm$table %>% 
      as.vector() %>% 
      matrix(ncol = 4) %>% 
      as.data.frame()
    
    names(bg_gg) <- c("BB", "GB", "BG", "GG")
    kq <- c(cm$overall, cm$byClass) 
    
    ten <- kq %>% 
      as.data.frame() %>% 
      row.names()
    
    kq %>% 
      as.vector() %>% 
      matrix(ncol = 18) %>% 
      as.data.frame() -> all_df
    
    names(all_df) <- ten
    all_df <- bind_cols(all_df, bg_gg)
    
    # Calculate AUC: 
    pred <- predict(my_logistic, df_test, type = "prob") %>% pull(Yes)
    auc <- roc(df_test$Purchase, pred)$auc

    # Add AUC: 
    all_df %<>% mutate(AUC = auc %>% as.vector(), Majority_Rate = p)
    df_com <- bind_rows(df_com, all_df)

  }
  
  return(df_com)

}


system.time(optimal_rate_under <- lapply(seq(0.5, 0.9, 0.05), my_auc_under))
optimal_rate_under <- do.call("bind_rows", optimal_rate_under)


# Show results: 
optimal_rate_under %>% 
  group_by(Majority_Rate) %>% 
  summarise(avg_auc = mean(AUC))

optimal_rate_under %>% 
  group_by(Majority_Rate) %>% 
  summarise(avg_acc = mean(Accuracy))

optimal_rate_under %>% 
  select(Accuracy, Kappa, Sensitivity, Specificity, Precision, Recall, F1, Prevalence, AUC, Majority_Rate) %>% 
  mutate(Majority_Rate = factor(Majority_Rate)) %>% 
  gather(a, b, -Majority_Rate) %>% 
  ggplot(aes(Majority_Rate, b, fill = Majority_Rate, color = Majority_Rate)) + 
  geom_boxplot(show.legend = FALSE, alpha = 0.3) + 
  facet_wrap(~ a, scales = "free") + 
  coord_flip() + 
  theme(plot.margin = unit(c(1, 1, 1, 1), "cm")) + 
  theme(panel.grid.minor.x = element_blank()) + 
  labs(x = NULL, y = NULL, 
       title = "Figure 2: Model Performance Based on 9 Criteria for Alternative Minority Rates by Downsampling Technique", 
       subtitle = "Data Used: Caravan")


```

### Some Key Conclusions

From empirical evidences based on Caravan data set, we can conclude that: 

1. Oversampling technique has a negligible impact on AUC and Accuracy. 

2. Undersampling technique has an adverse impact on AUC and Accuracy.

The same conclusions are found from a post by Nina Zumel and John Mount: 
http://www.win-vector.com/blog/2015/02/does-balancing-classes-improve-classifier-performance/.  


### References

1. Basel Committee on Banking Supervision. 2005.  “Validation of Low-Default Portfolios in the Basel II Framework.”  Basel Committee Newsletter no. 6,  September.

2. Artis M, Ayuso M, Guillen M (2002). “Detection of Automobile Insurance Fraud with Discrete Choice Models and Misclassified Claims.” The Journal of Risk and Insurance, 69(3), 325–340.

3. Richardson M, Dominowska E, Ragno R (2007). “Predicting Clicks: Estimating the Click–Through Rate for New Ads.” In “Proceedings of the 16 th International Conference on the World Wide Web,” pp. 521–530.

4. Visa, S., & Ralescu, A. (2005, April). Issues in mining imbalanced data sets - a review paper. In Proceedings of the sixteen midwest artificial intelligence and cognitive science conference (Vol. 2005, pp. 67-73).

5. Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30(1), 25-36.

6. Maloof, M. A. (2003, August). Learning when data sets are imbalanced and when costs are unequal and unknown. In ICML-2003 workshop on learning from imbalanced data sets II (Vol. 2, pp. 2-1).

7. Japkowicz, N. (2003, August). Class imbalances: are we focusing on the right issue. In Workshop on Learning from Imbalanced Data Sets II (Vol. 1723, p. 63).

8. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. 

9. Myers R (1994). Classical and Modern Regression with Applications. PWS-KENT Publishing Company, Boston, MA, second edition.



