This article aims at 2 groups of audience. The first group has learned the underlying theories of credit scorecards but have yet seen a detailed application of the theories. The second group is practitioners who are already well-versed with the theories and the applications, but would like to explore developing credit scorecards using a open-source software such as R.
1.0 Introduction
The methodologies and processes of developing credit scorecards for lending purposes have been a long-held intellectual property of financial institutions and consulting firms.
Recently, even though several books have been published to shed lights on credit scorecards, practical guidance and demonstrations on the hands-on aspects of credit scorecards development have been in short supply. This could be due to the following reasons:
Availability of data is a common hindrance. Due to regulations and customers’ privacy, financial institutions’ data cannot be shared with external parties. Even in the case of credit bureau, data is subject to strict data protection regulations and, thus, is not available for research purposes. As a result, it is not surprising that step-by-step demonstrations of credit scorecards development are absence from not only published books but also research papers (to the best of my knowledge); and
The software used to develop credit scorecards could be another obstacle. Different analysts, financial institutions and consulting firms use different softwares. The codes of one software in general are not compatible with another softwares. Therefore, the selection of a particular software for demonstrating credit scorecards development will immediately reduce its appeals due to the lack of software compatibility. On the other hand, using multiple softwares to demonstrate the development process will cause the scope to become unmanageable.
This article seeks to fill the gap by providing a practical hands-on demonstration on how to develop a credit scorecard, as close to reality as possible. To that end, 2 decisions are made to address the afore-mentioned challenges: A) a real, publicly available (albeit small) credit data set is used, and B) the open-source software R is adopted.
The choice of R is twofold:
R is widely used among practitioners of data science and it is free. In addition, documentations on R is readily available and of good quality. Since R is popular, using it will not reduce the appeal of the demonstrations shown in this article. For non R users, R can be downloaded at no costs to run the R codes provided in this article.
A credit scorecards development “package”, called
scorecard, had been built in R by Shichen Xie. This wonderful package makes credit scorecards development a breeze and will be used throughout this article.
With the above, eager students can download R and the data set to learn credit scorecards development. To analysts who are currently using different softwares for development efforts, this article can be taken as a show case on how an open-source software can be used reliably to built credit scorecards. Having a free alternative to consider and to explore is always an attractive business case.
Finally, it is noteworthy that this article does not seek to produce a powerful credit scorecard, given the rather small data set used and also to keep the length of this article manageable. It is also assumed that the reader has minimal or no knowledge of R (comments are added in the codes to assist non R users). Nonetheless, basic knowledge of credit scorecard theories is expected.
Note: Xie had prepared slides to show how a credit scorecard can be built using his package. The slides, however, was written in Chinese and the data set used therein is not what an analyst will likely encounter in a real world situation.
2.0 R and Packages
Other than the data set, which will be described in Section 3.0, the followings are needed for this article:
Base R: Base R can be downloaded from CRAN;
RStudio: This is the IDE for R and can be downloaded for free from RStudio.com;
Required R packages: Install these packages via RStudio (select the option of install all dependencies during packages installation).
- scorecard - for credit scorecards development;
- tidyverse - for data import and manipulations;
- Hmisc - for summary statistics generation;
- ClusOfVar - for variables clustering.
A quick introduction on how to install the softwares and how to navigate RStudio can be found in Youtube video How to install R and install R Studio
3.0 Data
This section describes the data set as well as the techniques used to process the data set.
A potential question that could be raised by the reader is on the data size R can handle. This is a legitimate question as it is well known that R loads data onto memory (i.e. RAM), which could be an issue if the data size is very large. There are at least 2 ways to overcome this:
Pre-process the data as much as possible by using relational database software before loading into R. R has “connector” that allows R to link to relational database server such as SQL. Via connector, SQL commands can be submitted from R to the server for data manipulation purposes. Once in R, use
data.table(instead ofdata.frame) to handle the data set; orHave access to cloud computing services such as AWS and Azure.
3.1 The Data Set
The data set analyzed in this article is sourced from Yeh, I. C., & Lien, C. H. (2009) and can be downloaded from UCI Machine Learning Repository. DON’T download now. The download will be done using R.
The data set contains 30,000 credit card holders in Taiwan. What makes this data set interesting is that, in addition to the usual demographic information, payments, outstanding balances and delinquency status of the credit card holders were tracked over a period of time (6 months). This feature makes the data set similar to an analyst would encounter in a real credit scorecards development project.
In total, there are 24 variables in the data set. The columns (i.e. variables) of the data set were neatly labeled and, therefore, are largely self-explanatory. Further remarks will be made in this article should need arise. Nonetheless, interested readers can find the information on the columns’ attributes on UCI Machine Learning Repository page.
Note: As the name of the default indicator default in the next month suggests, a credit scorecard built on this data set will be used to predict whether a credit card holder will default in next month. In banking industry, predicting default in the next 12 months is the norm.
3.2 Import Data into R
The data set from UCI Machine Learning Repository page has an extra row in row 1 of the spreadsheet. It is crucial to delete this row from the spreadsheet. The cleaned up version of the data set is stored at Github
library(tidyverse) initiates the tidyverse package, which will be used for data processing and manipulation. read.csv loads the csv data set into R as an object called data. Examine carefully the first few rows of data, which is generated by running head(data), to ensure that the data set is imported correctly. str(data) gives a summary of the structure of the data set, which can be used in conjunction with head(data) for checking.
# load the tidyverse package. this only needs to be run only once
library(tidyverse)
# download the file from Github. note the path uses forward slashes
data <- read.csv("https://github.com/ngyongkad/scorecard/blob/main/Import%20Credit%20Card%20Defaults.csv?raw=true")
# print out the first few lines of the data set
head(data)
# print out the structure of the data set
str(data) ## ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
## 1 1 20000 2 2 1 24 2 2 -1 -1 -2 -2
## 2 2 120000 2 2 2 26 -1 2 0 0 0 2
## 3 3 90000 2 2 2 34 0 0 0 0 0 0
## 4 4 50000 2 2 1 37 0 0 0 0 0 0
## 5 5 50000 1 2 1 57 -1 0 -1 0 0 0
## 6 6 50000 1 1 2 37 0 0 0 0 0 0
## BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
## 1 3913 3102 689 0 0 0 0 689
## 2 2682 1725 2682 3272 3455 3261 0 1000
## 3 29239 14027 13559 14331 14948 15549 1518 1500
## 4 46990 48233 49291 28314 28959 29547 2000 2019
## 5 8617 5670 35835 20940 19146 19131 2000 36681
## 6 64400 57069 57608 19394 19619 20024 2500 1815
## PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
## 1 0 0 0 0 1
## 2 1000 1000 0 2000 1
## 3 1000 1000 1000 5000 0
## 4 1200 1100 1069 1000 0
## 5 10000 9000 689 679 0
## 6 657 1000 1000 800 0
## 'data.frame': 30000 obs. of 25 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ LIMIT_BAL : int 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ SEX : int 2 2 2 2 1 1 1 2 2 1 ...
## $ EDUCATION : int 2 2 2 2 2 1 1 2 3 3 ...
## $ MARRIAGE : int 1 2 2 1 1 2 2 2 1 2 ...
## $ AGE : int 24 26 34 37 57 37 29 23 28 35 ...
## $ PAY_0 : int 2 -1 0 0 -1 0 0 0 0 -2 ...
## $ PAY_2 : int 2 2 0 0 0 0 0 -1 0 -2 ...
## $ PAY_3 : int -1 0 0 0 -1 0 0 -1 2 -2 ...
## $ PAY_4 : int -1 0 0 0 0 0 0 0 0 -2 ...
## $ PAY_5 : int -2 0 0 0 0 0 0 0 0 -1 ...
## $ PAY_6 : int -2 2 0 0 0 0 0 -1 0 -1 ...
## $ BILL_AMT1 : int 3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
## $ BILL_AMT2 : int 3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
## $ BILL_AMT3 : int 689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
## $ BILL_AMT4 : int 0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
## $ BILL_AMT5 : int 0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
## $ BILL_AMT6 : int 0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
## $ PAY_AMT1 : int 0 0 1518 2000 2000 2500 55000 380 3329 0 ...
## $ PAY_AMT2 : int 689 1000 1500 2019 36681 1815 40000 601 0 0 ...
## $ PAY_AMT3 : int 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ PAY_AMT4 : int 0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
## $ PAY_AMT5 : int 0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
## $ PAY_AMT6 : int 0 2000 5000 1000 679 800 13770 1542 1000 0 ...
## $ default.payment.next.month: int 1 1 0 0 0 0 0 0 0 0 ...
3.2 Data Exploration
Once the data set is imported correctly, it is always good to generate summary statistics to understand the data. This can be done using the package Hmisc as below. describe(data) gives very detailed descriptive statistics on all of the variables in the data set. This report should be reviewed thoroughly to detect unusual values or anything that is out of expectations.
library(Hmisc)
describe(data)## data
##
## 3 Variables 30000 Observations
## --------------------------------------------------------------------------------
## PAY_0
## n missing distinct Info Mean Gmd .05 .10
## 30000 0 11 0.871 -0.0167 1.159 -2 -1
## .25 .50 .75 .90 .95
## -1 0 0 2 2
##
## lowest : -2 -1 0 1 2, highest: 4 5 6 7 8
##
## Value -2 -1 0 1 2 3 4 5 6 7 8
## Frequency 2759 5686 14737 3688 2667 322 76 26 11 9 19
## Proportion 0.092 0.190 0.491 0.123 0.089 0.011 0.003 0.001 0.000 0.000 0.001
## --------------------------------------------------------------------------------
## PAY_2
## n missing distinct Info Mean Gmd .05 .10
## 30000 0 11 0.843 -0.1338 1.214 -2 -2
## .25 .50 .75 .90 .95
## -1 0 0 2 2
##
## lowest : -2 -1 0 1 2, highest: 4 5 6 7 8
##
## Value -2 -1 0 1 2 3 4 5 6 7 8
## Frequency 3782 6050 15730 28 3927 326 99 25 12 20 1
## Proportion 0.126 0.202 0.524 0.001 0.131 0.011 0.003 0.001 0.000 0.001 0.000
## --------------------------------------------------------------------------------
## PAY_3
## n missing distinct Info Mean Gmd .05 .10
## 30000 0 11 0.843 -0.1662 1.209 -2 -2
## .25 .50 .75 .90 .95
## -1 0 0 2 2
##
## lowest : -2 -1 0 1 2, highest: 4 5 6 7 8
##
## Value -2 -1 0 1 2 3 4 5 6 7 8
## Frequency 4085 5938 15764 4 3819 240 76 21 23 27 3
## Proportion 0.136 0.198 0.525 0.000 0.127 0.008 0.003 0.001 0.001 0.001 0.000
## --------------------------------------------------------------------------------
From the report above (due to the large number of outputs, only a few variables are shown), 2 observations are worth noting:
All variables have no missing values, which is rare in a real world situation. In practice, missing values always warrant attentions and investigations. This is because the existence of missing values could be due to some underlying business processes rather than due to no information captured. Variables with excessive real missing values should be discarded (the benchmark is more than 50% values missing.)
The “PAY_x” variables represent delinquency status. The first delinquency variable is “PAY_0” and follows by “PAY_2”. In other words, “PAY_1” is omitted. In addition, for the delinquency status variables in this data set, “-1” signifies a particular customer was current on the payment. However, from the report generated using R, it can be observed that the delinquency status variables contain values “0” and “-2”, which are out of expectations.
Point 2 will be addressed in the section below.
3.3 Data Manipulation and Variables Generation
3.3.1 Manipulations
There are a few things need to be achieved in this section. First, rename all variables to lower case for easier codes typing. Second, rename “pay_0” to “pay_1”. Third, rename the default indicator “default payment next month” with a shorter name “gb_flag”, where “gb” indicates good bad. Fourth, in all the “pay_x” variables, recode the values “-2” and “-1” into “0”. “0” signifies current on payment. Finally, a cross-tab is run to check the recoding is done correctly.
data <- data %>%
# rename the variables to lower case
rename_with(str_to_lower, everything()) %>%
# rename "pay_0" to "pay_1" and "default payment next month" to "gb_flag"
rename("pay_1" = "pay_0", "gb_flag" = "default.payment.next.month")
# set up a function to recode delinquency status
new_deq <- function(x) {
ifelse(x %in% c(-2, -1), 0, x)
}
# recode delinquency status
data <- data %>%
mutate(across(pay_1:pay_6, new_deq, .names = "{col}_recode"))
# quick check on the recoding
table(data$pay_1, data$pay_1_recode)##
## 0 1 2 3 4 5 6 7 8
## -2 2759 0 0 0 0 0 0 0 0
## -1 5686 0 0 0 0 0 0 0 0
## 0 14737 0 0 0 0 0 0 0 0
## 1 0 3688 0 0 0 0 0 0 0
## 2 0 0 2667 0 0 0 0 0 0
## 3 0 0 0 322 0 0 0 0 0
## 4 0 0 0 0 76 0 0 0 0
## 5 0 0 0 0 0 26 0 0 0
## 6 0 0 0 0 0 0 11 0 0
## 7 0 0 0 0 0 0 0 9 0
## 8 0 0 0 0 0 0 0 0 19
3.3.2 Variables Generation
Except for some specific variables such as demographic information, raw variables in a data set are normally not used directly in credit scorecards. In this article, new variables are created from delinquency status, limit, and outstanding balance. To keep the scope simple, no new variables will be generated from payment amount. But by using the ideas and codes provided below, the reader should have no problem of creating his/her own variables from payment amount. Variables generation is only limited by imagination or creativity. Given the same data set, different analysts could generate different variables.
## Delinquency status
# un-needed "pay_x" variables are removed
var_data <- data %>%
select(-(pay_1:pay_6))
# identify the position of the variables
deq_pos_l3m <- which(str_detect(names(var_data), "pay_(1|2|3)_recode"))
deq_pos_l6m <- which(str_detect(names(var_data), "pay_(1|2|3|4|5|6)_recode"))
# generate new variables
var_data <- var_data %>%
mutate(
# average delinquency
avg_deq_l3m = apply(var_data[, deq_pos_l3m], 1, mean, na.rm = TRUE),
avg_deq_l6m = apply(var_data[, deq_pos_l6m], 1, mean, na.rm = TRUE),
# max delinquency
max_deq_l3m = apply(var_data[, deq_pos_l3m], 1, max, na.rm = TRUE),
max_deq_l6m = apply(var_data[, deq_pos_l6m], 1, max, na.rm = TRUE),
# min delinquency
min_deq_l3m = apply(var_data[, deq_pos_l3m], 1, min, na.rm = TRUE),
min_deq_l6m = apply(var_data[, deq_pos_l6m], 1, min, na.rm = TRUE)
)## utilization rate
# set up a function to compute utilization rate
uti_rate <- function(x) {
ifelse(x < 0, 0, x / var_data$limit_bal)
}
# compute utilization rate
var_data <- var_data %>%
mutate(across(contains("bill_amt"), .fns = list(util = uti_rate)))
# identify the position of the variables
util_pos_l3m <- which(str_detect(names(var_data), "bill_amt(1|2|3)_util"))
util_pos_l6m <- which(str_detect(names(var_data), "bill_amt(1|2|3|4|5|6)_util"))
# generate new variables
var_data <- var_data %>%
mutate(
# average utilization
avg_util_l3m = apply(var_data[, util_pos_l3m], 1, mean, na.rm = TRUE),
avg_util_l6m = apply(var_data[, util_pos_l6m], 1, mean, na.rm = TRUE),
# max utilization
max_util_l3m = apply(var_data[, util_pos_l3m], 1, max, na.rm = TRUE),
max_util_l6m = apply(var_data[, util_pos_l6m], 1, max, na.rm = TRUE),
# min utilization
min_util_l3m = apply(var_data[, util_pos_l3m], 1, min, na.rm = TRUE),
min_util_l6m = apply(var_data[, util_pos_l6m], 1, min, na.rm = TRUE)
)3.3.3 Sampling
Having generated the new variables, the full data set can now be divided into a development/training sample and a validation sample.
In a real credit scorecard development project, a validation sample is an out-of-time (OOT) sample, which is required to be taken from a different time period as compared to that of the development sample. In this data set, however, splitting the full data set in the way prescribed above is not possible. Instead, 80% of the full population is randomly selected as the development sample, and the remaining 20% is the validation sample.
# set the seed for the random sample
set.seed(1234)
# randomly select from the id column
dev_ind <- sample(var_data$id, 24000, replace = FALSE)
# use randomly selected ids to form the development sample
dev <- var_data[dev_ind,]
# use non-selected ids to forms the validation sample
oot <- var_data[-dev_ind,] 4.0 Univariate Analysis
4.1 Fine Classing
Fine classing is a technique that groups a variable’s values into a number of fine bins. Using these bins, a measure of the variable’s predictive power, known as information value (IV), can be computed. Also from these fine bins, further grouping can be carried out to result in coarse classing. As will be shown in the section below, bins from coarse classing are the bins that will be used in a credit scorecard.
Specifically, the fine classing technique used in this article is called the frequency method in the package scorecard. What this technique entails is grouping a variable’s values by having equal percentage of counts in each bin. This is a commonly used technique (see the note below). Following industry practice, the maximum number of fine bins is set at 20.
Note:
There are other techniques available in the package
scorecard. Interested readers are referred to its
manual.The number of fine bins depends on the distribution of the variable. Some variables will have concentration in some values. In such cases, the number of fine bins will turn out to be less than 20.
The frequency method in the package
scorecardworks with numeric variables only. It is therefore necessary to convert or recode character variables (e.g. sex, education etc.) into numeric variables before hand (as the collectors of the data set used here had done).
From this point on, the package scorecard will be used.
library(scorecard)
# select all the variables for IV computation
var_list <- dev %>%
select(-id, -gb_flag) %>%
names()
# invoke the woebin function in package scorecard
fine_class <- woebin(dev,
y = "gb_flag",
x = var_list,
positive = 1, # the value in gb_flag that indicates default
method = "freq", # frequency method
bin_num_limit = 20) # the max number of fine bins## [INFO] creating woe binning ...
# collect the IV of the variables
iv <- map_df(fine_class, ~pluck(.x, 10, 1)) %>%
pivot_longer(everything(), names_to = "var", values_to = "iv")| var | iv |
|---|---|
| limit_bal | 0.1785871 |
| sex | 0.0079849 |
| education | 0.0164957 |
| marriage | 0.0056892 |
| age | 0.0193498 |
| bill_amt1 | 0.0128997 |
| bill_amt2 | 0.0146458 |
| bill_amt3 | 0.0128501 |
| bill_amt4 | 0.0148755 |
| bill_amt5 | 0.0153958 |
| bill_amt6 | 0.0247551 |
| pay_amt1 | 0.1066663 |
| pay_amt2 | 0.0798784 |
| pay_amt3 | 0.0744712 |
| pay_amt4 | 0.0490983 |
| pay_amt5 | 0.0421035 |
| pay_amt6 | 0.0199621 |
| pay_1_recode | 0.8648336 |
| pay_2_recode | 0.5459762 |
| pay_3_recode | 0.4117610 |
| pay_4_recode | 0.3616306 |
| pay_5_recode | 0.3354879 |
| pay_6_recode | 0.2923127 |
| avg_deq_l3m | 0.8797259 |
| avg_deq_l6m | 0.3011544 |
| max_deq_l3m | 0.7979072 |
| max_deq_l6m | 0.7441613 |
| min_deq_l3m | 0.4711887 |
| min_deq_l6m | 0.0000000 |
| bill_amt1_util | 0.0156583 |
| bill_amt2_util | 0.0133279 |
| bill_amt3_util | 0.0128110 |
| bill_amt4_util | 0.0177728 |
| bill_amt5_util | 0.0182308 |
| bill_amt6_util | 0.0209823 |
| avg_util_l3m | 0.1136387 |
| avg_util_l6m | 0.1346854 |
| max_util_l3m | 0.0899689 |
| max_util_l6m | 0.1039890 |
| min_util_l3m | 0.0146800 |
| min_util_l6m | 0.0072010 |
The reports of the fine classing are contained in fine_class and can be viewed by typing fine_class and running it. Due to the massive amount of reports, only the first one is shown below.
## $limit_bal
## variable bin count count_distr good bad badprob woe
## 1: limit_bal [-Inf,30000) 1990 0.08291667 1272 718 0.3608040 0.6802441
## 2: limit_bal [30000,50000) 1477 0.06154167 936 541 0.3662830 0.7039240
## 3: limit_bal [50000,70000) 3358 0.13991667 2466 892 0.2656343 0.2352337
## 4: limit_bal [70000,100000) 2369 0.09870833 1774 595 0.2511608 0.1596895
## 5: limit_bal [100000,140000) 2452 0.10216667 1853 599 0.2442904 0.1228206
## 6: limit_bal [140000,160000) 1512 0.06300000 1228 284 0.1878307 -0.2120476
## 7: limit_bal [160000,200000) 1973 0.08220833 1623 350 0.1773948 -0.2819782
## 8: limit_bal [200000,210000) 1202 0.05008333 987 215 0.1788686 -0.2719118
## 9: limit_bal [210000,270000) 2725 0.11354167 2259 466 0.1710092 -0.3263717
## 10: limit_bal [270000,360000) 2258 0.09408333 1941 317 0.1403897 -0.5599366
## 11: limit_bal [360000,430000) 1448 0.06033333 1235 213 0.1470994 -0.5054139
## 12: limit_bal [430000, Inf) 1236 0.05150000 1090 146 0.1181230 -0.7582061
## bin_iv total_iv breaks is_special_values
## 1: 0.045171693 0.1785871 30000 FALSE
## 2: 0.036066804 0.1785871 50000 FALSE
## 3: 0.008242682 0.1785871 70000 FALSE
## 4: 0.002628082 0.1785871 100000 FALSE
## 5: 0.001593515 0.1785871 140000 FALSE
## 6: 0.002665805 0.1785871 160000 FALSE
## 7: 0.006024928 0.1785871 200000 FALSE
## 8: 0.003423424 0.1785871 210000 FALSE
## 9: 0.010999969 0.1785871 270000 FALSE
## 10: 0.024967126 0.1785871 360000 FALSE
## 11: 0.013268441 0.1785871 430000 FALSE
## 12: 0.023534628 0.1785871 Inf FALSE
Note: All of the generated fine classing reports can be saved in csv file using the package erer write.list function. Interested reader is encouraged to explore.
4.2 Initial Variables Removal
As discussed in the section above, new variables are derived because not all raw variables should be used directly. Numeric variables in amount such as balance, limit, payment are not advisable to be used because the variances in their values are large, rendering a credit scorecard unstable. Instead, these variables are converted into ratios (e.g. utilization rates) which are bounded below by zero and the maximum value should not be too far away from 100%.
Also discussed above is variables with a lot of missing values should be discarded. But this does not apply in this article because all variables have no missing values.
Lastly, variables with low predictive power as measured by IV are also removed (IV<0.02). However, since age’s IV is at the boarder line and it is preferable to have some demographic variables in a credit scorecard, age is kept.
With the above, the following variables are removed:
- Low IV
- sex
- education
- marriage
- bill_amt1 to bill_amt5
- pay_amt6
- min_deq_l6m
- bill_amt1_util to bill_amt5_util
- min_util_l3m
- min_util_l6m
- Variables in amount
- limit_bal
- pay_amt1 to pay_amt5
- bill_amt6
# keep the relevant variables
dev <- dev %>%
select(id, age, gb_flag:max_util_l6m, avg_deq_l3m:min_deq_l3m, bill_amt6_util)4.2 Variables Clustering
An analyst frequently faces thousands of variables. Even after filtering out variables with high missing values or with low IV values, hundreds of variables will still remain.
In addition to the IV and missing values criteria, variables clustering can be added to the variables reduction tool kit. The idea of variables clustering is to group similar variables into one cluster. Then, a few variables can be selected from each of the different clusters. An additional benefit that stems from variables clustering is that it could help reduce multi-collinearity among variables.
While it is not really applicable in this article as the number of variables is small, variables clustering is conducted for completeness. From the stability plot shown below, 4 clusters appear appropriate for the development sample. Hence, 4 cluster is set using cuttree. In the final output table below, the “x” column is cluster name. The application of this output is provided in the section below.
Note: Variables clustering is computation intensive. It could take up to hours to complete the task if the number of variables is huge.
library(ClustOfVar)
# perform variables clustering
tree <- dev %>%
select(-id, -gb_flag) %>%
hclustvar()
# stability plot
set.seed(345)
stab <- stability(tree, B = 30) # B is bootstrapping sample. set B at 30 to 50 to reduce computing time boxplot(stab$matCR, main = "Dispersion of the adjusted Rand index")# generate the final cluster output
clus <- cutree(tree, 4) | x | |
|---|---|
| age | 1 |
| pay_1_recode | 2 |
| pay_2_recode | 2 |
| pay_3_recode | 2 |
| pay_4_recode | 3 |
| pay_5_recode | 3 |
| pay_6_recode | 3 |
| avg_deq_l3m | 2 |
| avg_deq_l6m | 2 |
| max_deq_l3m | 2 |
| max_deq_l6m | 2 |
| min_deq_l3m | 2 |
| min_deq_l6m | 3 |
| bill_amt1_util | 4 |
| bill_amt2_util | 4 |
| bill_amt3_util | 4 |
| bill_amt4_util | 4 |
| bill_amt5_util | 4 |
| bill_amt6_util | 4 |
| avg_util_l3m | 4 |
| avg_util_l6m | 4 |
| max_util_l3m | 4 |
| max_util_l6m | 4 |
4.3 Variables Reduction
One way to utilize the output from variables clustering is combining it with IV as follows:
| var | clus | iv |
|---|---|---|
| age | 1 | 0.0193498 |
| avg_deq_l3m | 2 | 0.8797259 |
| pay_1_recode | 2 | 0.8648336 |
| max_deq_l3m | 2 | 0.7979072 |
| max_deq_l6m | 2 | 0.7441613 |
| pay_2_recode | 2 | 0.5459762 |
| min_deq_l3m | 2 | 0.4711887 |
| pay_3_recode | 2 | 0.4117610 |
| avg_deq_l6m | 2 | 0.3011544 |
| pay_4_recode | 3 | 0.3616306 |
| pay_5_recode | 3 | 0.3354879 |
| pay_6_recode | 3 | 0.2923127 |
| min_deq_l6m | 3 | 0.0000000 |
| avg_util_l6m | 4 | 0.1346854 |
| avg_util_l3m | 4 | 0.1136387 |
| max_util_l6m | 4 | 0.1039890 |
| max_util_l3m | 4 | 0.0899689 |
| bill_amt6_util | 4 | 0.0209823 |
| bill_amt5_util | 4 | 0.0182308 |
| bill_amt4_util | 4 | 0.0177728 |
| bill_amt1_util | 4 | 0.0156583 |
| bill_amt2_util | 4 | 0.0133279 |
| bill_amt3_util | 4 | 0.0128110 |
In cluster 1, there is only one variable. Thus, age is short-listed. In cluster 2, the first 2 variables avg_deq_l3m and pay_1_recode are selected because they have the highest IVs in the cluster. Likewise for cluster 3, pay_4_recode and pay_5_recode are chosen. Finally, in cluster 4, the first 2 variables’ IV are quite close and these 2 variables are of average utilization. Therefore, instead of choosing the first 2 variables, the first (avg_util_l6m) and the third (max_util_l6m) are short-listed.
Note: The process presented above is one of many ways of using variables clustering. Different analysts use variables clustering differently.
5.0 Coarse Classing
To do coarse classing, it is necessary to go back to the fine classing reports generated previously, focusing solely on the short-listed variables.
One can use the fine classing reports directly for coarse classing. In this article, plots are used because they are easier for presentation. The plots are shown below one by one for the short-listed variables.
The first plot is for age. Referring to the plot, the bad rates of this variable is of U-shape. A plausible explanation that can be put forward is younger credit card holders are risky because of their lower income. On the other hand, old credit card holders are also risky because of having multiple loans, in retirement, or incurring heavy medical costs etc. For age, the coarse classing could be: a) <25, b)25 to <45, c) 45 and above. Note that the third bin is not starting from 46 as suggested by the plot. 45 is used just to make the bound nicer.
plot <- woebin_plot(fine_class_final)
plot[[1]]ave_deq_l3m is straight-forward. The bins with 44.1% and 45.1% bad rates are grouped together while others remain unchanged.
plot[[2]]No further classing is necessary for pay_1_recode.
plot[[3]]No further classing is necessary for pay_4_recode.
plot[[4]]No further classing is necessary for pay_5_recode.
plot[[5]]Utilization rates are expected to be monotonic in bad rates. That is, higher utilization rates associate with higher bad rates, or vice versa. Yet, this expectation is not consistent with what shown in the plot of avg_util_l6m. In the plot, it is those with lower utilization rates (i.e. first few bins) have higher bad rates.
For this variable, the coarse classing could be: a) <45%, b) 45% to <83%, and c) 83% and above. Note that minor tweaks are applied to the bounds of the bins. (Refer to the fine class report for the details of the bins.)
plot[[6]]The situation of max_util_l6m is similar to avg_util_l6m. The coarse classing could be: a) <43%, b) 43% to <100%, and c) 100% and above. Minor tweaks are applied to the bounds of the bins.
plot[[7]]The bounds resulted from the coarse classing of all short-listed variables can be saved into a list for future use.
breaks_list <- list(age = c("25", "45"),
avg_deq_l3m = c("0.67", "2"),
pay_1_recode = c("1", "2"),
pay_4_recode = c("1"),
pay_5_recode = c("2"),
avg_util_l6m = c("0.45", "0.83"),
max_util_l6m = c("0.43", "1.00")
)Caution: What is to be done next is important.
As opposed to other applications, credit scorecards in financial institutions are usually developed to predict “good” instead of “bad” (i.e. default).
The reason of predicting “good” is because credit scores are used as a risk ranking tool. Credit worthy customers (i.e. good customers) are given high scores. And this is easy to communicate to non-technical persons.
Given the afore-mentioned reason, the codes need to be changed slightly. That is, positive = 1 needs to be set as positive = 0. This will not change the IV values and the WOE values of the bins. However, the sign of the WOE values will change. The WOE values with positive = 0 are the values that will be used to develop credit scorecards.
# set positive = 0
coarse_class <- woebin(dev,
y = "gb_flag",
x = short_var_list,
positive = 0,
method = "freq",
breaks_list = breaks_list) # from coarse classing results## [INFO] creating woe binning ...
# transform variables' values into WOE values
dev_woe <- woebin_ply(dev, coarse_class) ## [INFO] converting into woe values ...
6.0 Regression Analysis
The ground is now set for developing a credit scorecard. The technique, as widely documented in the literature, is based on logistic regression. To obtain a parsimonious logistic regression, one approach is to use the stepwise method. This method seeks to minimize the AIC by allowing variables to enter or to exit iteratively.
Each type of method has its own pros and cons and this will not be discussed here. Regardless of which type of method is used, it should be viewed as a tool that allows an analyst to keep the task more manageable. Significant amount of manual adjustments and judgments are still needed to be made in order to arrive at the final credit scorecard.
The codes below perform stepwise logistic regression analysis. The stepwise method produces a logistic regression with 6 variables (max_util_l6m is excluded). All 6 variables are significant at the 5% level and all coefficients are positive. To keep things simple, the resulting logistic regression is taken as final.
# logistic regression
logistic <- glm(I(gb_flag == 0) ~ ., # set the event as good, e.g. gb_flag==0
family = binomial(), # binomial is logistic regression
data = dev_woe %>% select(-id)) # remove id column
# invoke stepwise regression based on AIC
logistic_step <- step(logistic, direction = "both", trace = FALSE)
# print output
summary(logistic_step) ##
## Call:
## glm(formula = I(gb_flag == 0) ~ pay_2_recode + pay_3_recode +
## pay_6_recode + avg_deq_l6m + max_deq_l6m + min_deq_l3m +
## min_deq_l6m + bill_amt2_util + bill_amt5_util + max_util_l3m +
## age_woe + pay_1_recode_woe + pay_4_recode_woe + pay_5_recode_woe +
## avg_util_l6m_woe, family = binomial(), data = dev_woe %>%
## select(-id))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3131 0.4434 0.4973 0.5407 2.0459
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03541 0.06094 16.992 < 2e-16 ***
## pay_2_recode -0.29873 0.04104 -7.280 3.34e-13 ***
## pay_3_recode -0.33725 0.04501 -7.492 6.77e-14 ***
## pay_6_recode -0.41495 0.04833 -8.586 < 2e-16 ***
## avg_deq_l6m 1.37781 0.18145 7.593 3.11e-14 ***
## max_deq_l6m -0.23049 0.03691 -6.245 4.23e-10 ***
## min_deq_l3m 0.12436 0.06516 1.908 0.0563 .
## min_deq_l6m 0.12222 0.07234 1.689 0.0911 .
## bill_amt2_util -0.34741 0.16290 -2.133 0.0330 *
## bill_amt5_util 0.61995 0.10781 5.751 8.89e-09 ***
## max_util_l3m 0.36507 0.14629 2.495 0.0126 *
## age_woe 0.62331 0.14728 4.232 2.31e-05 ***
## pay_1_recode_woe 0.89252 0.03171 28.146 < 2e-16 ***
## pay_4_recode_woe 0.42746 0.05038 8.484 < 2e-16 ***
## pay_5_recode_woe 0.44321 0.05262 8.423 < 2e-16 ***
## avg_util_l6m_woe 0.83584 0.14955 5.589 2.28e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25433 on 23999 degrees of freedom
## Residual deviance: 21175 on 23984 degrees of freedom
## AIC: 21207
##
## Number of Fisher Scoring iterations: 4
The output from the VIF analysis does not suggests any multi-collinearity issue.
# generate VIF
vif(logistic_step, merge_coef = TRUE)## variable Estimate Std. Error z value Pr(>|z|) gvif
## 1: (Intercept) 1.0354118 0.0609 16.9918 0.0000 NA
## 2: pay_2_recode -0.2987315 0.0410 -7.2799 0.0000 5.052485
## 3: pay_3_recode -0.3372513 0.0450 -7.4923 0.0000 5.744389
## 4: pay_6_recode -0.4149456 0.0483 -8.5855 0.0000 5.477218
## 5: avg_deq_l6m 1.3778062 0.1814 7.5935 0.0000 52.220875
## 6: max_deq_l6m -0.2304948 0.0369 -6.2452 0.0000 6.205415
## 7: min_deq_l3m 0.1243624 0.0652 1.9085 0.0563 5.841826
## 8: min_deq_l6m 0.1222177 0.0723 1.6894 0.0911 3.302808
## 9: bill_amt2_util -0.3474052 0.1629 -2.1326 0.0330 14.702017
## 10: bill_amt5_util 0.6199534 0.1078 5.7506 0.0000 4.938752
## 11: max_util_l3m 0.3650659 0.1463 2.4954 0.0126 12.842162
## 12: age_woe 0.6233068 0.1473 4.2321 0.0000 1.015499
## 13: pay_1_recode_woe 0.8925226 0.0317 28.1461 0.0000 3.232373
## 14: pay_4_recode_woe 0.4274605 0.0504 8.4841 0.0000 3.394157
## 15: pay_5_recode_woe 0.4432148 0.0526 8.4231 0.0000 3.429983
## 16: avg_util_l6m_woe 0.8358393 0.1495 5.5891 0.0000 6.765641
7.0 Scorecard Creation, Scaling, and Validation
Lastly, with the regression obtained, a credit scorecard can be created and validation exercise can be conducted.
The scaling used in this article is:
- at score of 500, the good bad odd is 100 and;
- the point of double odds (PDO) is 30. Moreover, the base score (i.e. intercept of the regression) is distributed evenly across the 6 variables.
The package scorecard can generate a full-blown report that contains variables statistics, score distribution, scaling information, performance indicators and validation results.
Caution: In the codes below, PDO has to be set as -30. This is because positive = 0. And since positive = 0, anything that labeled as good in the report are actually referring to bad, and anything that labeled as bad are actually referring to good. Likewise, bad rate should be interpreted as good rate.
## scorecard creation
# select the variables in the final regression
dev_final <- dev %>%
select(id, gb_flag, age, avg_deq_l3m, pay_1_recode, pay_4_recode, pay_5_recode, avg_util_l6m)
var_select <- c("age", "avg_deq_l3m", "pay_1_recode", "pay_4_recode", "pay_5_recode", "avg_util_l6m")
breaks_list <- list(age = c("25", "45"),
avg_deq_l3m = c("0.67", "2"),
pay_1_recode = c("1", "2"),
pay_4_recode = c("1"),
pay_5_recode = c("2"),
avg_util_l6m = c("0.45", "0.83")
)
bins <- woebin(dev_final,
y = "gb_flag",
x = var_select,
positive = 0,
method = "freq",
breaks_list = breaks_list) ## [INFO] creating woe binning ...
score_card <- scorecard(bins,
logistic_step,
points0 = 500,
pdo = -30, # PDO must set as negative
odds0 = 100,
basepoints_eq0 = TRUE)
# display results
score_card ## $basepoints
## variable bin woe points
## 1: basepoints NA NA 0
##
## $pay_2_recode
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $pay_3_recode
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $pay_6_recode
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $avg_deq_l6m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $max_deq_l6m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $min_deq_l3m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $min_deq_l6m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $bill_amt2_util
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $bill_amt5_util
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $max_util_l3m
## Empty data.table (0 rows and 13 cols): variable,bin,count,count_distr,good,bad...
##
## $age
## variable bin count count_distr good bad badprob woe
## 1: age [-Inf,25) 2168 0.09033333 588 1580 0.7287823 -0.26366705
## 2: age [25,45) 17639 0.73495833 3711 13928 0.7896139 0.07047959
## 3: age [45, Inf) 4193 0.17470833 1037 3156 0.7526830 -0.13914675
## bin_iv total_iv breaks is_special_values points
## 1: 0.006734044 0.01382605 25 FALSE 16
## 2: 0.003579285 0.01382605 45 FALSE 25
## 3: 0.003512724 0.01382605 Inf FALSE 19
##
## $pay_1_recode
## variable bin count count_distr good bad badprob woe
## 1: pay_1_recode [-Inf,1) 18504 0.771000 2567 15937 0.8612732 0.5737853
## 2: pay_1_recode [1,2) 2943 0.122625 995 1948 0.6619096 -0.5803045
## 3: pay_1_recode [2, Inf) 2553 0.106375 1774 779 0.3051312 -2.0751013
## bin_iv total_iv breaks is_special_values points
## 1: 0.21391743 0.8648336 1 FALSE 45
## 2: 0.04764138 0.8648336 2 FALSE 1
## 3: 0.60327484 0.8648336 Inf FALSE -57
##
## $pay_4_recode
## variable bin count count_distr good bad badprob woe
## 1: pay_4_recode [-Inf,1) 21158 0.8815833 3806 17352 0.8201153 0.2650088
## 2: pay_4_recode [1, Inf) 2842 0.1184167 1530 1312 0.4616467 -1.4058353
## bin_iv total_iv breaks is_special_values points
## 1: 0.05735741 0.3616306 1 FALSE 28
## 2: 0.30427319 0.3616306 Inf FALSE -3
##
## $pay_5_recode
## variable bin count count_distr good bad badprob woe
## 1: pay_5_recode [-Inf,2) 21595 0.8997917 3992 17603 0.8151424 0.2316568
## 2: pay_5_recode [2, Inf) 2405 0.1002083 1344 1061 0.4411642 -1.4885586
## bin_iv total_iv breaks is_special_values points
## 1: 0.04517924 0.3354879 2 FALSE 27
## 2: 0.29030861 0.3354879 Inf FALSE -6
##
## $avg_util_l6m
## variable bin count count_distr good bad badprob woe
## 1: avg_util_l6m [-Inf,0.45) 14337 0.5973750 2621 11716 0.8171863 0.2452793
## 2: avg_util_l6m [0.45,0.83) 6103 0.2542917 1546 4557 0.7466820 -0.1711267
## 3: avg_util_l6m [0.83, Inf) 3560 0.1483333 1169 2391 0.6716292 -0.5365572
## bin_iv total_iv breaks is_special_values points
## 1: 0.033490587 0.09009968 0.45 FALSE 32
## 2: 0.007798286 0.09009968 0.83 FALSE 17
## 3: 0.048810811 0.09009968 Inf FALSE 4
# compute score
score <- scorecard_ply(dev_final, score_card, only_total_score = F)The validation report will be saved in the work directory.
## validation
# select the variables from OOT sample
oot_final <- oot %>%
select(id, gb_flag, age, avg_deq_l3m, pay_1_recode, pay_4_recode, pay_5_recode, avg_util_l6m)
# generate reports
# good in the report is actually bad, whereas bad is actually good
report(list(dt1 = dev_final, dt2 = oot_final),
y = "gb_flag",
x = var_select,
breaks_list = breaks_list,
seed = NULL,
basepoints_eq0 = TRUE,
method = "freq",
positive = 0,
points0 = 500,
odds0 = 100,
pdo = -30)## [INFO] sheet1-dataset information
## [INFO] sheet2-model coefficients
## [INFO] sheet3-model performance
## [INFO] sheet4-variable woe binning
## [INFO] sheet5-scorecard
## [INFO] sheet6-population stability
## [INFO] sheet7-gains table
## [INFO] The report is saved as report_20201111_095120.xlsx
8.0 Conclusion
This article shows that a credit scorecard can be developed with ease in R using the package scorecard. The caveat is in the data set. The data set used in this article is small and clean, therefore, does not need a lot of cleaning and manipulations. The number of new variables that can be generated from the data set is also not huge.