Credit Scoring Development Using R

Ng Yong Kad

11/9/2020

This article aims at 2 groups of audience. The first group has learned the underlying theories of credit scorecards but have yet seen a detailed application of the theories. The second group is practitioners who are already well-versed with the theories and the applications, but would like to explore developing credit scorecards using a open-source software such as R.

1.0 Introduction

The methodologies and processes of developing credit scorecards for lending purposes have been a long-held intellectual property of financial institutions and consulting firms.

Recently, even though several books have been published to shed lights on credit scorecards, practical guidance and demonstrations on the hands-on aspects of credit scorecards development have been in short supply. This could be due to the following reasons:

  • Availability of data is a common hindrance. Due to regulations and customers’ privacy, financial institutions’ data cannot be shared with external parties. Even in the case of credit bureau, data is subject to strict data protection regulations and, thus, is not available for research purposes. As a result, it is not surprising that step-by-step demonstrations of credit scorecards development are absence from not only published books but also research papers (to the best of my knowledge); and

  • The software used to develop credit scorecards could be another obstacle. Different analysts, financial institutions and consulting firms use different softwares. The codes of one software in general are not compatible with another softwares. Therefore, the selection of a particular software for demonstrating credit scorecards development will immediately reduce its appeals due to the lack of software compatibility. On the other hand, using multiple softwares to demonstrate the development process will cause the scope to become unmanageable.

This article seeks to fill the gap by providing a practical hands-on demonstration on how to develop a credit scorecard, as close to reality as possible. To that end, 2 decisions are made to address the afore-mentioned challenges: A) a real, publicly available (albeit small) credit data set is used, and B) the open-source software R is adopted.

The choice of R is twofold:

  1. R is widely used among practitioners of data science and it is free. In addition, documentations on R is readily available and of good quality. Since R is popular, using it will not reduce the appeal of the demonstrations shown in this article. For non R users, R can be downloaded at no costs to run the R codes provided in this article.

  2. A credit scorecards development “package”, called scorecard, had been built in R by Shichen Xie. This wonderful package makes credit scorecards development a breeze and will be used throughout this article.

With the above, eager students can download R and the data set to learn credit scorecards development. To analysts who are currently using different softwares for development efforts, this article can be taken as a show case on how an open-source software can be used reliably to built credit scorecards. Having a free alternative to consider and to explore is always an attractive business case.

Finally, it is noteworthy that this article does not seek to produce a powerful credit scorecard, given the rather small data set used and also to keep the length of this article manageable. It is also assumed that the reader has minimal or no knowledge of R (comments are added in the codes to assist non R users). Nonetheless, basic knowledge of credit scorecard theories is expected.

Note: Xie had prepared slides to show how a credit scorecard can be built using his package. The slides, however, was written in Chinese and the data set used therein is not what an analyst will likely encounter in a real world situation.

2.0 R and Packages

Other than the data set, which will be described in Section 3.0, the followings are needed for this article:

  1. Base R: Base R can be downloaded from CRAN;

  2. RStudio: This is the IDE for R and can be downloaded for free from RStudio.com;

  3. Required R packages: Install these packages via RStudio (select the option of install all dependencies during packages installation).

    • scorecard - for credit scorecards development;
    • tidyverse - for data import and manipulations;
    • Hmisc - for summary statistics generation;
    • ClusOfVar - for variables clustering.

A quick introduction on how to install the softwares and how to navigate RStudio can be found in Youtube video How to install R and install R Studio

3.0 Data

This section describes the data set as well as the techniques used to process the data set.

A potential question that could be raised by the reader is on the data size R can handle. This is a legitimate question as it is well known that R loads data onto memory (i.e. RAM), which could be an issue if the data size is very large. There are at least 2 ways to overcome this:

  1. Pre-process the data as much as possible by using relational database software before loading into R. R has “connector” that allows R to link to relational database server such as SQL. Via connector, SQL commands can be submitted from R to the server for data manipulation purposes. Once in R, use data.table (instead of data.frame) to handle the data set; or

  2. Have access to cloud computing services such as AWS and Azure.

3.1 The Data Set

The data set analyzed in this article is sourced from Yeh, I. C., & Lien, C. H. (2009) and can be downloaded from UCI Machine Learning Repository. DON’T download now. The download will be done using R.

The data set contains 30,000 credit card holders in Taiwan. What makes this data set interesting is that, in addition to the usual demographic information, payments, outstanding balances and delinquency status of the credit card holders were tracked over a period of time (6 months). This feature makes the data set similar to an analyst would encounter in a real credit scorecards development project.

In total, there are 24 variables in the data set. The columns (i.e. variables) of the data set were neatly labeled and, therefore, are largely self-explanatory. Further remarks will be made in this article should need arise. Nonetheless, interested readers can find the information on the columns’ attributes on UCI Machine Learning Repository page.

Note: As the name of the default indicator default in the next month suggests, a credit scorecard built on this data set will be used to predict whether a credit card holder will default in next month. In banking industry, predicting default in the next 12 months is the norm.

3.2 Import Data into R

The data set from UCI Machine Learning Repository page has an extra row in row 1 of the spreadsheet. It is crucial to delete this row from the spreadsheet. The cleaned up version of the data set is stored at Github

library(tidyverse) initiates the tidyverse package, which will be used for data processing and manipulation. read.csv loads the csv data set into R as an object called data. Examine carefully the first few rows of data, which is generated by running head(data), to ensure that the data set is imported correctly. str(data) gives a summary of the structure of the data set, which can be used in conjunction with head(data) for checking.

# load the tidyverse package. this only needs to be run only once
library(tidyverse)  

# download the file from Github. note the path uses forward slashes
data <- read.csv("https://github.com/ngyongkad/scorecard/blob/main/Import%20Credit%20Card%20Defaults.csv?raw=true")  

# print out the first few lines of the data set
head(data) 

# print out the structure of the data set
str(data)  
##   ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
## 1  1     20000   2         2        1  24     2     2    -1    -1    -2    -2
## 2  2    120000   2         2        2  26    -1     2     0     0     0     2
## 3  3     90000   2         2        2  34     0     0     0     0     0     0
## 4  4     50000   2         2        1  37     0     0     0     0     0     0
## 5  5     50000   1         2        1  57    -1     0    -1     0     0     0
## 6  6     50000   1         1        2  37     0     0     0     0     0     0
##   BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
## 1      3913      3102       689         0         0         0        0      689
## 2      2682      1725      2682      3272      3455      3261        0     1000
## 3     29239     14027     13559     14331     14948     15549     1518     1500
## 4     46990     48233     49291     28314     28959     29547     2000     2019
## 5      8617      5670     35835     20940     19146     19131     2000    36681
## 6     64400     57069     57608     19394     19619     20024     2500     1815
##   PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
## 1        0        0        0        0                          1
## 2     1000     1000        0     2000                          1
## 3     1000     1000     1000     5000                          0
## 4     1200     1100     1069     1000                          0
## 5    10000     9000      689      679                          0
## 6      657     1000     1000      800                          0
## 'data.frame':    30000 obs. of  25 variables:
##  $ ID                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LIMIT_BAL                 : int  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : int  2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : int  2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : int  1 2 2 1 1 2 2 2 1 2 ...
##  $ AGE                       : int  24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_0                     : int  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ PAY_2                     : int  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ PAY_3                     : int  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ PAY_4                     : int  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ PAY_5                     : int  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ PAY_6                     : int  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ BILL_AMT1                 : int  3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
##  $ BILL_AMT2                 : int  3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
##  $ BILL_AMT3                 : int  689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
##  $ BILL_AMT4                 : int  0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
##  $ BILL_AMT5                 : int  0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
##  $ BILL_AMT6                 : int  0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
##  $ PAY_AMT1                  : int  0 0 1518 2000 2000 2500 55000 380 3329 0 ...
##  $ PAY_AMT2                  : int  689 1000 1500 2019 36681 1815 40000 601 0 0 ...
##  $ PAY_AMT3                  : int  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : int  0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
##  $ PAY_AMT5                  : int  0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
##  $ PAY_AMT6                  : int  0 2000 5000 1000 679 800 13770 1542 1000 0 ...
##  $ default.payment.next.month: int  1 1 0 0 0 0 0 0 0 0 ...

3.2 Data Exploration

Once the data set is imported correctly, it is always good to generate summary statistics to understand the data. This can be done using the package Hmisc as below. describe(data) gives very detailed descriptive statistics on all of the variables in the data set. This report should be reviewed thoroughly to detect unusual values or anything that is out of expectations.

library(Hmisc)
describe(data)
## data 
## 
##  3  Variables      30000  Observations
## --------------------------------------------------------------------------------
## PAY_0 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    30000        0       11    0.871  -0.0167    1.159       -2       -1 
##      .25      .50      .75      .90      .95 
##       -1        0        0        2        2 
## 
## lowest : -2 -1  0  1  2, highest:  4  5  6  7  8
##                                                                             
## Value         -2    -1     0     1     2     3     4     5     6     7     8
## Frequency   2759  5686 14737  3688  2667   322    76    26    11     9    19
## Proportion 0.092 0.190 0.491 0.123 0.089 0.011 0.003 0.001 0.000 0.000 0.001
## --------------------------------------------------------------------------------
## PAY_2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    30000        0       11    0.843  -0.1338    1.214       -2       -2 
##      .25      .50      .75      .90      .95 
##       -1        0        0        2        2 
## 
## lowest : -2 -1  0  1  2, highest:  4  5  6  7  8
##                                                                             
## Value         -2    -1     0     1     2     3     4     5     6     7     8
## Frequency   3782  6050 15730    28  3927   326    99    25    12    20     1
## Proportion 0.126 0.202 0.524 0.001 0.131 0.011 0.003 0.001 0.000 0.001 0.000
## --------------------------------------------------------------------------------
## PAY_3 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    30000        0       11    0.843  -0.1662    1.209       -2       -2 
##      .25      .50      .75      .90      .95 
##       -1        0        0        2        2 
## 
## lowest : -2 -1  0  1  2, highest:  4  5  6  7  8
##                                                                             
## Value         -2    -1     0     1     2     3     4     5     6     7     8
## Frequency   4085  5938 15764     4  3819   240    76    21    23    27     3
## Proportion 0.136 0.198 0.525 0.000 0.127 0.008 0.003 0.001 0.001 0.001 0.000
## --------------------------------------------------------------------------------

From the report above (due to the large number of outputs, only a few variables are shown), 2 observations are worth noting:

  1. All variables have no missing values, which is rare in a real world situation. In practice, missing values always warrant attentions and investigations. This is because the existence of missing values could be due to some underlying business processes rather than due to no information captured. Variables with excessive real missing values should be discarded (the benchmark is more than 50% values missing.)

  2. The “PAY_x” variables represent delinquency status. The first delinquency variable is “PAY_0” and follows by “PAY_2”. In other words, “PAY_1” is omitted. In addition, for the delinquency status variables in this data set, “-1” signifies a particular customer was current on the payment. However, from the report generated using R, it can be observed that the delinquency status variables contain values “0” and “-2”, which are out of expectations.

Point 2 will be addressed in the section below.

3.3 Data Manipulation and Variables Generation

3.3.1 Manipulations

There are a few things need to be achieved in this section. First, rename all variables to lower case for easier codes typing. Second, rename “pay_0” to “pay_1”. Third, rename the default indicator “default payment next month” with a shorter name “gb_flag”, where “gb” indicates good bad. Fourth, in all the “pay_x” variables, recode the values “-2” and “-1” into “0”. “0” signifies current on payment. Finally, a cross-tab is run to check the recoding is done correctly.

data <- data %>%
  # rename the variables to lower case
  rename_with(str_to_lower, everything()) %>%
  # rename "pay_0" to "pay_1" and "default payment next month" to "gb_flag"
  rename("pay_1" = "pay_0", "gb_flag" = "default.payment.next.month")

# set up a function to recode delinquency status
new_deq <- function(x) {
  ifelse(x %in% c(-2, -1), 0, x)
}
# recode delinquency status
data <- data %>% 
 mutate(across(pay_1:pay_6, new_deq, .names = "{col}_recode"))

# quick check on the recoding
table(data$pay_1, data$pay_1_recode)
##     
##          0     1     2     3     4     5     6     7     8
##   -2  2759     0     0     0     0     0     0     0     0
##   -1  5686     0     0     0     0     0     0     0     0
##   0  14737     0     0     0     0     0     0     0     0
##   1      0  3688     0     0     0     0     0     0     0
##   2      0     0  2667     0     0     0     0     0     0
##   3      0     0     0   322     0     0     0     0     0
##   4      0     0     0     0    76     0     0     0     0
##   5      0     0     0     0     0    26     0     0     0
##   6      0     0     0     0     0     0    11     0     0
##   7      0     0     0     0     0     0     0     9     0
##   8      0     0     0     0     0     0     0     0    19

3.3.2 Variables Generation

Except for some specific variables such as demographic information, raw variables in a data set are normally not used directly in credit scorecards. In this article, new variables are created from delinquency status, limit, and outstanding balance. To keep the scope simple, no new variables will be generated from payment amount. But by using the ideas and codes provided below, the reader should have no problem of creating his/her own variables from payment amount. Variables generation is only limited by imagination or creativity. Given the same data set, different analysts could generate different variables.

## Delinquency status

# un-needed "pay_x" variables are removed 
var_data <- data %>%
  select(-(pay_1:pay_6))

# identify the position of the variables
deq_pos_l3m <- which(str_detect(names(var_data), "pay_(1|2|3)_recode"))
deq_pos_l6m <- which(str_detect(names(var_data), "pay_(1|2|3|4|5|6)_recode"))

# generate new variables
var_data <- var_data %>%
  mutate(
  # average delinquency
  avg_deq_l3m = apply(var_data[, deq_pos_l3m], 1, mean, na.rm = TRUE),
  avg_deq_l6m = apply(var_data[, deq_pos_l6m], 1, mean, na.rm = TRUE),
   
  # max delinquency
  max_deq_l3m = apply(var_data[, deq_pos_l3m], 1, max, na.rm = TRUE),
  max_deq_l6m = apply(var_data[, deq_pos_l6m], 1, max, na.rm = TRUE),
  
  # min delinquency
  min_deq_l3m = apply(var_data[, deq_pos_l3m], 1, min, na.rm = TRUE),
  min_deq_l6m = apply(var_data[, deq_pos_l6m], 1, min, na.rm = TRUE)
  )
## utilization rate

# set up a function to compute utilization rate
uti_rate <- function(x) {
  ifelse(x < 0, 0, x / var_data$limit_bal)
}

# compute utilization rate
var_data <- var_data %>% 
  mutate(across(contains("bill_amt"), .fns = list(util = uti_rate)))

# identify the position of the variables
util_pos_l3m <- which(str_detect(names(var_data), "bill_amt(1|2|3)_util"))
util_pos_l6m <- which(str_detect(names(var_data), "bill_amt(1|2|3|4|5|6)_util"))

# generate new variables
var_data <- var_data %>% 
  mutate(
  # average utilization
  avg_util_l3m = apply(var_data[, util_pos_l3m], 1, mean, na.rm = TRUE),
  avg_util_l6m = apply(var_data[, util_pos_l6m], 1, mean, na.rm = TRUE),
  
  # max utilization
  max_util_l3m = apply(var_data[, util_pos_l3m], 1, max, na.rm = TRUE),
  max_util_l6m = apply(var_data[, util_pos_l6m], 1, max, na.rm = TRUE),
  
  # min utilization
  min_util_l3m = apply(var_data[, util_pos_l3m], 1, min, na.rm = TRUE),
  min_util_l6m = apply(var_data[, util_pos_l6m], 1, min, na.rm = TRUE)
  )

3.3.3 Sampling

Having generated the new variables, the full data set can now be divided into a development/training sample and a validation sample.

In a real credit scorecard development project, a validation sample is an out-of-time (OOT) sample, which is required to be taken from a different time period as compared to that of the development sample. In this data set, however, splitting the full data set in the way prescribed above is not possible. Instead, 80% of the full population is randomly selected as the development sample, and the remaining 20% is the validation sample.

# set the seed for the random sample
set.seed(1234) 

# randomly select from the id column
dev_ind <- sample(var_data$id, 24000, replace = FALSE) 

# use randomly selected ids to form the development sample  
dev <- var_data[dev_ind,] 

# use non-selected ids to forms the validation sample
oot <- var_data[-dev_ind,] 

4.0 Univariate Analysis

4.1 Fine Classing

Fine classing is a technique that groups a variable’s values into a number of fine bins. Using these bins, a measure of the variable’s predictive power, known as information value (IV), can be computed. Also from these fine bins, further grouping can be carried out to result in coarse classing. As will be shown in the section below, bins from coarse classing are the bins that will be used in a credit scorecard.

Specifically, the fine classing technique used in this article is called the frequency method in the package scorecard. What this technique entails is grouping a variable’s values by having equal percentage of counts in each bin. This is a commonly used technique (see the note below). Following industry practice, the maximum number of fine bins is set at 20.

Note:

  • There are other techniques available in the package scorecard. Interested readers are referred to its
    manual.

  • The number of fine bins depends on the distribution of the variable. Some variables will have concentration in some values. In such cases, the number of fine bins will turn out to be less than 20.

  • The frequency method in the package scorecard works with numeric variables only. It is therefore necessary to convert or recode character variables (e.g. sex, education etc.) into numeric variables before hand (as the collectors of the data set used here had done).

From this point on, the package scorecard will be used.

library(scorecard)

# select all the variables for IV computation
var_list <- dev %>%
  select(-id, -gb_flag) %>%
  names()

# invoke the woebin function in package scorecard
fine_class <- woebin(dev, 
                    y = "gb_flag", 
                    x = var_list, 
                    positive = 1, # the value in gb_flag that indicates default
                    method = "freq", # frequency method
                    bin_num_limit = 20)  # the max number of fine bins
## [INFO] creating woe binning ...
# collect the IV of the variables
iv <- map_df(fine_class, ~pluck(.x, 10, 1)) %>%
  pivot_longer(everything(), names_to = "var", values_to = "iv")
var iv
limit_bal 0.1785871
sex 0.0079849
education 0.0164957
marriage 0.0056892
age 0.0193498
bill_amt1 0.0128997
bill_amt2 0.0146458
bill_amt3 0.0128501
bill_amt4 0.0148755
bill_amt5 0.0153958
bill_amt6 0.0247551
pay_amt1 0.1066663
pay_amt2 0.0798784
pay_amt3 0.0744712
pay_amt4 0.0490983
pay_amt5 0.0421035
pay_amt6 0.0199621
pay_1_recode 0.8648336
pay_2_recode 0.5459762
pay_3_recode 0.4117610
pay_4_recode 0.3616306
pay_5_recode 0.3354879
pay_6_recode 0.2923127
avg_deq_l3m 0.8797259
avg_deq_l6m 0.3011544
max_deq_l3m 0.7979072
max_deq_l6m 0.7441613
min_deq_l3m 0.4711887
min_deq_l6m 0.0000000
bill_amt1_util 0.0156583
bill_amt2_util 0.0133279
bill_amt3_util 0.0128110
bill_amt4_util 0.0177728
bill_amt5_util 0.0182308
bill_amt6_util 0.0209823
avg_util_l3m 0.1136387
avg_util_l6m 0.1346854
max_util_l3m 0.0899689
max_util_l6m 0.1039890
min_util_l3m 0.0146800
min_util_l6m 0.0072010

The reports of the fine classing are contained in fine_class and can be viewed by typing fine_class and running it. Due to the massive amount of reports, only the first one is shown below.

## $limit_bal
##      variable             bin count count_distr good bad   badprob        woe
##  1: limit_bal    [-Inf,30000)  1990  0.08291667 1272 718 0.3608040  0.6802441
##  2: limit_bal   [30000,50000)  1477  0.06154167  936 541 0.3662830  0.7039240
##  3: limit_bal   [50000,70000)  3358  0.13991667 2466 892 0.2656343  0.2352337
##  4: limit_bal  [70000,100000)  2369  0.09870833 1774 595 0.2511608  0.1596895
##  5: limit_bal [100000,140000)  2452  0.10216667 1853 599 0.2442904  0.1228206
##  6: limit_bal [140000,160000)  1512  0.06300000 1228 284 0.1878307 -0.2120476
##  7: limit_bal [160000,200000)  1973  0.08220833 1623 350 0.1773948 -0.2819782
##  8: limit_bal [200000,210000)  1202  0.05008333  987 215 0.1788686 -0.2719118
##  9: limit_bal [210000,270000)  2725  0.11354167 2259 466 0.1710092 -0.3263717
## 10: limit_bal [270000,360000)  2258  0.09408333 1941 317 0.1403897 -0.5599366
## 11: limit_bal [360000,430000)  1448  0.06033333 1235 213 0.1470994 -0.5054139
## 12: limit_bal   [430000, Inf)  1236  0.05150000 1090 146 0.1181230 -0.7582061
##          bin_iv  total_iv breaks is_special_values
##  1: 0.045171693 0.1785871  30000             FALSE
##  2: 0.036066804 0.1785871  50000             FALSE
##  3: 0.008242682 0.1785871  70000             FALSE
##  4: 0.002628082 0.1785871 100000             FALSE
##  5: 0.001593515 0.1785871 140000             FALSE
##  6: 0.002665805 0.1785871 160000             FALSE
##  7: 0.006024928 0.1785871 200000             FALSE
##  8: 0.003423424 0.1785871 210000             FALSE
##  9: 0.010999969 0.1785871 270000             FALSE
## 10: 0.024967126 0.1785871 360000             FALSE
## 11: 0.013268441 0.1785871 430000             FALSE
## 12: 0.023534628 0.1785871    Inf             FALSE

Note: All of the generated fine classing reports can be saved in csv file using the package erer write.list function. Interested reader is encouraged to explore.

4.2 Initial Variables Removal

As discussed in the section above, new variables are derived because not all raw variables should be used directly. Numeric variables in amount such as balance, limit, payment are not advisable to be used because the variances in their values are large, rendering a credit scorecard unstable. Instead, these variables are converted into ratios (e.g. utilization rates) which are bounded below by zero and the maximum value should not be too far away from 100%.

Also discussed above is variables with a lot of missing values should be discarded. But this does not apply in this article because all variables have no missing values.

Lastly, variables with low predictive power as measured by IV are also removed (IV<0.02). However, since age’s IV is at the boarder line and it is preferable to have some demographic variables in a credit scorecard, age is kept.

With the above, the following variables are removed:

  1. Low IV
  • sex
  • education
  • marriage
  • bill_amt1 to bill_amt5
  • pay_amt6
  • min_deq_l6m
  • bill_amt1_util to bill_amt5_util
  • min_util_l3m
  • min_util_l6m
  1. Variables in amount
  • limit_bal
  • pay_amt1 to pay_amt5
  • bill_amt6
# keep the relevant variables
dev <- dev %>%
  select(id, age, gb_flag:max_util_l6m, avg_deq_l3m:min_deq_l3m, bill_amt6_util)

4.2 Variables Clustering

An analyst frequently faces thousands of variables. Even after filtering out variables with high missing values or with low IV values, hundreds of variables will still remain.

In addition to the IV and missing values criteria, variables clustering can be added to the variables reduction tool kit. The idea of variables clustering is to group similar variables into one cluster. Then, a few variables can be selected from each of the different clusters. An additional benefit that stems from variables clustering is that it could help reduce multi-collinearity among variables.

While it is not really applicable in this article as the number of variables is small, variables clustering is conducted for completeness. From the stability plot shown below, 4 clusters appear appropriate for the development sample. Hence, 4 cluster is set using cuttree. In the final output table below, the “x” column is cluster name. The application of this output is provided in the section below.

Note: Variables clustering is computation intensive. It could take up to hours to complete the task if the number of variables is huge.

library(ClustOfVar)

# perform variables clustering
tree <- dev %>%
  select(-id, -gb_flag) %>%
  hclustvar() 

# stability plot
set.seed(345)
stab <- stability(tree, B = 30) # B is bootstrapping sample. set B at 30 to 50 to reduce computing time 

boxplot(stab$matCR, main = "Dispersion of the adjusted Rand index")

# generate the final cluster output
clus <- cutree(tree, 4) 
x
age 1
pay_1_recode 2
pay_2_recode 2
pay_3_recode 2
pay_4_recode 3
pay_5_recode 3
pay_6_recode 3
avg_deq_l3m 2
avg_deq_l6m 2
max_deq_l3m 2
max_deq_l6m 2
min_deq_l3m 2
min_deq_l6m 3
bill_amt1_util 4
bill_amt2_util 4
bill_amt3_util 4
bill_amt4_util 4
bill_amt5_util 4
bill_amt6_util 4
avg_util_l3m 4
avg_util_l6m 4
max_util_l3m 4
max_util_l6m 4

4.3 Variables Reduction

One way to utilize the output from variables clustering is combining it with IV as follows:

var clus iv
age 1 0.0193498
avg_deq_l3m 2 0.8797259
pay_1_recode 2 0.8648336
max_deq_l3m 2 0.7979072
max_deq_l6m 2 0.7441613
pay_2_recode 2 0.5459762
min_deq_l3m 2 0.4711887
pay_3_recode 2 0.4117610
avg_deq_l6m 2 0.3011544
pay_4_recode 3 0.3616306
pay_5_recode 3 0.3354879
pay_6_recode 3 0.2923127
min_deq_l6m 3 0.0000000
avg_util_l6m 4 0.1346854
avg_util_l3m 4 0.1136387
max_util_l6m 4 0.1039890
max_util_l3m 4 0.0899689
bill_amt6_util 4 0.0209823
bill_amt5_util 4 0.0182308
bill_amt4_util 4 0.0177728
bill_amt1_util 4 0.0156583
bill_amt2_util 4 0.0133279
bill_amt3_util 4 0.0128110

In cluster 1, there is only one variable. Thus, age is short-listed. In cluster 2, the first 2 variables avg_deq_l3m and pay_1_recode are selected because they have the highest IVs in the cluster. Likewise for cluster 3, pay_4_recode and pay_5_recode are chosen. Finally, in cluster 4, the first 2 variables’ IV are quite close and these 2 variables are of average utilization. Therefore, instead of choosing the first 2 variables, the first (avg_util_l6m) and the third (max_util_l6m) are short-listed.

Note: The process presented above is one of many ways of using variables clustering. Different analysts use variables clustering differently.

5.0 Coarse Classing

To do coarse classing, it is necessary to go back to the fine classing reports generated previously, focusing solely on the short-listed variables.

One can use the fine classing reports directly for coarse classing. In this article, plots are used because they are easier for presentation. The plots are shown below one by one for the short-listed variables.

The first plot is for age. Referring to the plot, the bad rates of this variable is of U-shape. A plausible explanation that can be put forward is younger credit card holders are risky because of their lower income. On the other hand, old credit card holders are also risky because of having multiple loans, in retirement, or incurring heavy medical costs etc. For age, the coarse classing could be: a) <25, b)25 to <45, c) 45 and above. Note that the third bin is not starting from 46 as suggested by the plot. 45 is used just to make the bound nicer.

plot <- woebin_plot(fine_class_final)
plot[[1]]

ave_deq_l3m is straight-forward. The bins with 44.1% and 45.1% bad rates are grouped together while others remain unchanged.

plot[[2]]

No further classing is necessary for pay_1_recode.

plot[[3]]

No further classing is necessary for pay_4_recode.

plot[[4]]

No further classing is necessary for pay_5_recode.

plot[[5]]

Utilization rates are expected to be monotonic in bad rates. That is, higher utilization rates associate with higher bad rates, or vice versa. Yet, this expectation is not consistent with what shown in the plot of avg_util_l6m. In the plot, it is those with lower utilization rates (i.e. first few bins) have higher bad rates.

For this variable, the coarse classing could be: a) <45%, b) 45% to <83%, and c) 83% and above. Note that minor tweaks are applied to the bounds of the bins. (Refer to the fine class report for the details of the bins.)

plot[[6]]

The situation of max_util_l6m is similar to avg_util_l6m. The coarse classing could be: a) <43%, b) 43% to <100%, and c) 100% and above. Minor tweaks are applied to the bounds of the bins.

plot[[7]]

The bounds resulted from the coarse classing of all short-listed variables can be saved into a list for future use.

breaks_list <- list(age = c("25", "45"), 
                    avg_deq_l3m = c("0.67",  "2"), 
                    pay_1_recode = c("1", "2"), 
                    pay_4_recode = c("1"), 
                    pay_5_recode = c("2"),
                    avg_util_l6m = c("0.45", "0.83"),
                    max_util_l6m = c("0.43", "1.00")
                    )

Caution: What is to be done next is important.

As opposed to other applications, credit scorecards in financial institutions are usually developed to predict “good” instead of “bad” (i.e. default).

The reason of predicting “good” is because credit scores are used as a risk ranking tool. Credit worthy customers (i.e. good customers) are given high scores. And this is easy to communicate to non-technical persons.

Given the afore-mentioned reason, the codes need to be changed slightly. That is, positive = 1 needs to be set as positive = 0. This will not change the IV values and the WOE values of the bins. However, the sign of the WOE values will change. The WOE values with positive = 0 are the values that will be used to develop credit scorecards.

# set positive = 0
coarse_class <- woebin(dev, 
                       y = "gb_flag", 
                       x = short_var_list, 
                       positive = 0, 
                       method = "freq", 
                       breaks_list = breaks_list) # from coarse classing results
## [INFO] creating woe binning ...
# transform variables' values into WOE values
dev_woe <- woebin_ply(dev, coarse_class)  
## [INFO] converting into woe values ...

6.0 Regression Analysis

The ground is now set for developing a credit scorecard. The technique, as widely documented in the literature, is based on logistic regression. To obtain a parsimonious logistic regression, one approach is to use the stepwise method. This method seeks to minimize the AIC by allowing variables to enter or to exit iteratively.

Each type of method has its own pros and cons and this will not be discussed here. Regardless of which type of method is used, it should be viewed as a tool that allows an analyst to keep the task more manageable. Significant amount of manual adjustments and judgments are still needed to be made in order to arrive at the final credit scorecard.

The codes below perform stepwise logistic regression analysis. The stepwise method produces a logistic regression with 6 variables (max_util_l6m is excluded). All 6 variables are significant at the 5% level and all coefficients are positive. To keep things simple, the resulting logistic regression is taken as final.

# logistic regression
logistic <- glm(I(gb_flag == 0) ~ .,  # set the event as good, e.g. gb_flag==0
                family = binomial(),  # binomial is logistic regression
                data = dev_woe %>% select(-id)) # remove id column 
                                                                
# invoke stepwise regression based on AIC 
logistic_step <- step(logistic, direction = "both", trace = FALSE)

# print output
summary(logistic_step) 
## 
## Call:
## glm(formula = I(gb_flag == 0) ~ pay_2_recode + pay_3_recode + 
##     pay_6_recode + avg_deq_l6m + max_deq_l6m + min_deq_l3m + 
##     min_deq_l6m + bill_amt2_util + bill_amt5_util + max_util_l3m + 
##     age_woe + pay_1_recode_woe + pay_4_recode_woe + pay_5_recode_woe + 
##     avg_util_l6m_woe, family = binomial(), data = dev_woe %>% 
##     select(-id))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3131   0.4434   0.4973   0.5407   2.0459  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.03541    0.06094  16.992  < 2e-16 ***
## pay_2_recode     -0.29873    0.04104  -7.280 3.34e-13 ***
## pay_3_recode     -0.33725    0.04501  -7.492 6.77e-14 ***
## pay_6_recode     -0.41495    0.04833  -8.586  < 2e-16 ***
## avg_deq_l6m       1.37781    0.18145   7.593 3.11e-14 ***
## max_deq_l6m      -0.23049    0.03691  -6.245 4.23e-10 ***
## min_deq_l3m       0.12436    0.06516   1.908   0.0563 .  
## min_deq_l6m       0.12222    0.07234   1.689   0.0911 .  
## bill_amt2_util   -0.34741    0.16290  -2.133   0.0330 *  
## bill_amt5_util    0.61995    0.10781   5.751 8.89e-09 ***
## max_util_l3m      0.36507    0.14629   2.495   0.0126 *  
## age_woe           0.62331    0.14728   4.232 2.31e-05 ***
## pay_1_recode_woe  0.89252    0.03171  28.146  < 2e-16 ***
## pay_4_recode_woe  0.42746    0.05038   8.484  < 2e-16 ***
## pay_5_recode_woe  0.44321    0.05262   8.423  < 2e-16 ***
## avg_util_l6m_woe  0.83584    0.14955   5.589 2.28e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25433  on 23999  degrees of freedom
## Residual deviance: 21175  on 23984  degrees of freedom
## AIC: 21207
## 
## Number of Fisher Scoring iterations: 4

The output from the VIF analysis does not suggests any multi-collinearity issue.

# generate VIF 
vif(logistic_step, merge_coef = TRUE)
##             variable   Estimate Std. Error z value Pr(>|z|)      gvif
##  1:      (Intercept)  1.0354118     0.0609 16.9918   0.0000        NA
##  2:     pay_2_recode -0.2987315     0.0410 -7.2799   0.0000  5.052485
##  3:     pay_3_recode -0.3372513     0.0450 -7.4923   0.0000  5.744389
##  4:     pay_6_recode -0.4149456     0.0483 -8.5855   0.0000  5.477218
##  5:      avg_deq_l6m  1.3778062     0.1814  7.5935   0.0000 52.220875
##  6:      max_deq_l6m -0.2304948     0.0369 -6.2452   0.0000  6.205415
##  7:      min_deq_l3m  0.1243624     0.0652  1.9085   0.0563  5.841826
##  8:      min_deq_l6m  0.1222177     0.0723  1.6894   0.0911  3.302808
##  9:   bill_amt2_util -0.3474052     0.1629 -2.1326   0.0330 14.702017
## 10:   bill_amt5_util  0.6199534     0.1078  5.7506   0.0000  4.938752
## 11:     max_util_l3m  0.3650659     0.1463  2.4954   0.0126 12.842162
## 12:          age_woe  0.6233068     0.1473  4.2321   0.0000  1.015499
## 13: pay_1_recode_woe  0.8925226     0.0317 28.1461   0.0000  3.232373
## 14: pay_4_recode_woe  0.4274605     0.0504  8.4841   0.0000  3.394157
## 15: pay_5_recode_woe  0.4432148     0.0526  8.4231   0.0000  3.429983
## 16: avg_util_l6m_woe  0.8358393     0.1495  5.5891   0.0000  6.765641

7.0 Scorecard Creation, Scaling, and Validation

Lastly, with the regression obtained, a credit scorecard can be created and validation exercise can be conducted.

The scaling used in this article is:

  • at score of 500, the good bad odd is 100 and;
  • the point of double odds (PDO) is 30. Moreover, the base score (i.e. intercept of the regression) is distributed evenly across the 6 variables.

The package scorecard can generate a full-blown report that contains variables statistics, score distribution, scaling information, performance indicators and validation results.

Caution: In the codes below, PDO has to be set as -30. This is because positive = 0. And since positive = 0, anything that labeled as good in the report are actually referring to bad, and anything that labeled as bad are actually referring to good. Likewise, bad rate should be interpreted as good rate.

## scorecard creation

# select the variables in the final regression
dev_final <- dev %>%
  select(id, gb_flag, age, avg_deq_l3m, pay_1_recode, pay_4_recode, pay_5_recode, avg_util_l6m)

var_select <- c("age", "avg_deq_l3m", "pay_1_recode", "pay_4_recode", "pay_5_recode", "avg_util_l6m")

breaks_list <- list(age = c("25", "45"), 
                    avg_deq_l3m = c("0.67",  "2"), 
                    pay_1_recode = c("1", "2"), 
                    pay_4_recode = c("1"), 
                    pay_5_recode = c("2"),
                    avg_util_l6m = c("0.45", "0.83")
                    )

bins <- woebin(dev_final, 
               y = "gb_flag", 
               x = var_select,
               positive = 0, 
               method = "freq", 
               breaks_list = breaks_list)                                                                                            
## [INFO] creating woe binning ...
score_card <- scorecard(bins, 
                       logistic_step, 
                       points0 = 500, 
                       pdo = -30, # PDO must set as negative   
                       odds0 = 100, 
                       basepoints_eq0 = TRUE)                                                                                   
# display results
score_card 
## $basepoints
##      variable bin woe points
## 1: basepoints  NA  NA      0
## 
## $pay_2_recode
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $pay_3_recode
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $pay_6_recode
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $avg_deq_l6m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $max_deq_l6m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $min_deq_l3m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $min_deq_l6m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $bill_amt2_util
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $bill_amt5_util
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $max_util_l3m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
## 
## $age
##    variable       bin count count_distr good   bad   badprob         woe
## 1:      age [-Inf,25)  2168  0.09033333  588  1580 0.7287823 -0.26366705
## 2:      age   [25,45) 17639  0.73495833 3711 13928 0.7896139  0.07047959
## 3:      age [45, Inf)  4193  0.17470833 1037  3156 0.7526830 -0.13914675
##         bin_iv   total_iv breaks is_special_values points
## 1: 0.006734044 0.01382605     25             FALSE     16
## 2: 0.003579285 0.01382605     45             FALSE     25
## 3: 0.003512724 0.01382605    Inf             FALSE     19
## 
## $pay_1_recode
##        variable      bin count count_distr good   bad   badprob        woe
## 1: pay_1_recode [-Inf,1) 18504    0.771000 2567 15937 0.8612732  0.5737853
## 2: pay_1_recode    [1,2)  2943    0.122625  995  1948 0.6619096 -0.5803045
## 3: pay_1_recode [2, Inf)  2553    0.106375 1774   779 0.3051312 -2.0751013
##        bin_iv  total_iv breaks is_special_values points
## 1: 0.21391743 0.8648336      1             FALSE     45
## 2: 0.04764138 0.8648336      2             FALSE      1
## 3: 0.60327484 0.8648336    Inf             FALSE    -57
## 
## $pay_4_recode
##        variable      bin count count_distr good   bad   badprob        woe
## 1: pay_4_recode [-Inf,1) 21158   0.8815833 3806 17352 0.8201153  0.2650088
## 2: pay_4_recode [1, Inf)  2842   0.1184167 1530  1312 0.4616467 -1.4058353
##        bin_iv  total_iv breaks is_special_values points
## 1: 0.05735741 0.3616306      1             FALSE     28
## 2: 0.30427319 0.3616306    Inf             FALSE     -3
## 
## $pay_5_recode
##        variable      bin count count_distr good   bad   badprob        woe
## 1: pay_5_recode [-Inf,2) 21595   0.8997917 3992 17603 0.8151424  0.2316568
## 2: pay_5_recode [2, Inf)  2405   0.1002083 1344  1061 0.4411642 -1.4885586
##        bin_iv  total_iv breaks is_special_values points
## 1: 0.04517924 0.3354879      2             FALSE     27
## 2: 0.29030861 0.3354879    Inf             FALSE     -6
## 
## $avg_util_l6m
##        variable         bin count count_distr good   bad   badprob        woe
## 1: avg_util_l6m [-Inf,0.45) 14337   0.5973750 2621 11716 0.8171863  0.2452793
## 2: avg_util_l6m [0.45,0.83)  6103   0.2542917 1546  4557 0.7466820 -0.1711267
## 3: avg_util_l6m [0.83, Inf)  3560   0.1483333 1169  2391 0.6716292 -0.5365572
##         bin_iv   total_iv breaks is_special_values points
## 1: 0.033490587 0.09009968   0.45             FALSE     32
## 2: 0.007798286 0.09009968   0.83             FALSE     17
## 3: 0.048810811 0.09009968    Inf             FALSE      4
# compute score 
score <- scorecard_ply(dev_final, score_card, only_total_score = F)

The validation report will be saved in the work directory.

## validation

# select the variables from OOT sample
oot_final <- oot %>%
  select(id, gb_flag, age, avg_deq_l3m, pay_1_recode, pay_4_recode, pay_5_recode, avg_util_l6m)

# generate reports
# good in the report is actually bad, whereas bad is actually good
report(list(dt1 = dev_final, dt2 = oot_final), 
       y = "gb_flag", 
       x = var_select, 
       breaks_list = breaks_list, 
       seed = NULL, 
       basepoints_eq0 = TRUE, 
       method = "freq", 
       positive = 0, 
       points0 = 500, 
       odds0 = 100,
       pdo = -30)
## [INFO] sheet1-dataset information
## [INFO] sheet2-model coefficients
## [INFO] sheet3-model performance

## [INFO] sheet4-variable woe binning

## [INFO] sheet5-scorecard
## [INFO] sheet6-population stability

## [INFO] sheet7-gains table
## [INFO] The report is saved as report_20201111_095120.xlsx

8.0 Conclusion

This article shows that a credit scorecard can be developed with ease in R using the package scorecard. The caveat is in the data set. The data set used in this article is small and clean, therefore, does not need a lot of cleaning and manipulations. The number of new variables that can be generated from the data set is also not huge.