Introduction

The aim of the project is to perform a classification on a data set and to compare the quality of the models and methods used.

The structure of the project is as follows. The first part presents description of data and problem analyzed (initial descriptive analysis of the data, variable transformations, variable selection methods). The second part contains training / test data division, resampling (e.g. cross-validation). The third part covers comparison of prediction accuracy of at least 3 different machine learning techniques (Decision Tree, Random Forest, XGBoost). In order to compare the obtained results, a logistic regression is constructed as a Benchmark model. Finally, the ensembling procedure is used. The last part presents summary and conclusions.

Libraries

In order to realize presented aim of the project useful packages should be installed.¶

Data and problem description

The German scoring data is about persons, their characteristics and also include the variable responsible for credit risk (variable: default= {(1-default; 0-non-default). The data is collected by Professor Dr. Hans Hofmann from Institut fur Statistik und Okonometrie from Universitat Hamburg. The data is found on UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

Number of Observations: 1000
Number of Attributes: 20 (7 numerical, 13 categorical)

Attribute description for the German scoring data:

Attribute 1: (qualitative) Status of existing checking account
– A11 : … < 0 DM
– A12 : 0 <= … < 200 DM
– A13 : … >= 200 DM / salary assignments for at least 1 year
– A14 : no checking account
Attribute 2: (numerical) Duration in month
Attribute 3: (qualitative) Credit history
– A30 : no credits taken/ all credits paid back duly
– A31 : all credits at this bank paid back duly
– A32 : existing credits paid back duly till now
– A33 : delay in paying off in the past
– A34 : critical account/other credits existing (not at this bank)
Attribute 4: (qualitative) Purpose
– A40 : car (new)
– A41 : car (used)
– A42 : furniture/equipment
– A43 : radio/television
– A44 : domestic appliances
– A45 : repairs
– A46 : education
– A47 : (vacation - does not exist?)
– A48 : retraining
– A49 : business
– A410 : others
Attribute 5: (numerical) Credit amount
Attribute 6: (qualitative) Savings account/bonds
– A61 : … < 100 DM
– A62 : 100 <= … < 500 DM
– A63 : 500 <= … < 1000 DM
– A64 : .. >= 1000 DM
– A65 : unknown/ no savings account
Attribute 7: (qualitative) Present employment since
– A71 : unemployed
– A72 : … < 1 year
– A73 : 1 <= … < 4 years
– A74 : 4 <= … < 7 years
– A75 : .. >= 7 years
Attribute 8: (numerical) Installment rate in percentage of disposable income
Attribute 9: (qualitative) Personal status and sex
– A91 : male : divorced/separated
– A92 : female : divorced/separated/married
– A93 : male : single
– A94 : male : married/widowed
– A95 : female : single
Attribute 10: (qualitative) Other debtors / guarantors
– A101 : none
– A102 : co-applicant
– A103 : guarantor
Attribute 11: (numerical) Present residence since
Attribute 12: (qualitative) Property
– A121 : real estate – A122 : if not A121 : building society savings agreement/ life insurance
– A123 : if not A121/A122 : car or other, not in attribute 6
– A124 : unknown / no property
Attribute 13: (numerical) Age in years
Attribute 14: (qualitative) Other installment plans
– A141 : bank
– A142 : stores
– A143 : none
Attribute 15: (qualitative) Housing
– A151 : rent
– A152 : own
– A153 : for free
Attribute 16: (numerical) Number of existing credits at this bank
Attribute 17: (qualitative) Job
– A171 : unemployed/ unskilled - non-resident
– A172 : unskilled - resident
– A173 : skilled employee / official
– A174 : management/ self-employed/ highly qualified employee/ officer
Attribute 18: (numerical) Number of people being liable to provide maintenance for
Attribute 19: (qualitative) Telephone
– A191 : none
– A192 : yes, registered under the customers name
Attribute 20: (qualitative) foreign worker
– A201 : yes
– A202 : no

Exploratory data analysis

Initially, let’s summarize the data with the use of in-built r functions.

credit <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
summary(credit)

##       V1                  V2            V3                 V4           
##  Length:1000        Min.   : 4.0   Length:1000        Length:1000       
##  Class :character   1st Qu.:12.0   Class :character   Class :character  
##  Mode  :character   Median :18.0   Mode  :character   Mode  :character  
##                     Mean   :20.9                                        
##                     3rd Qu.:24.0                                        
##                     Max.   :72.0                                        
##        V5             V6                 V7                  V8       
##  Min.   :  250   Length:1000        Length:1000        Min.   :1.000  
##  1st Qu.: 1366   Class :character   Class :character   1st Qu.:2.000  
##  Median : 2320   Mode  :character   Mode  :character   Median :3.000  
##  Mean   : 3271                                         Mean   :2.973  
##  3rd Qu.: 3972                                         3rd Qu.:4.000  
##  Max.   :18424                                         Max.   :4.000  
##       V9                V10                 V11            V12           
##  Length:1000        Length:1000        Min.   :1.000   Length:1000       
##  Class :character   Class :character   1st Qu.:2.000   Class :character  
##  Mode  :character   Mode  :character   Median :3.000   Mode  :character  
##                                        Mean   :2.845                     
##                                        3rd Qu.:4.000                     
##                                        Max.   :4.000                     
##       V13            V14                V15                 V16       
##  Min.   :19.00   Length:1000        Length:1000        Min.   :1.000  
##  1st Qu.:27.00   Class :character   Class :character   1st Qu.:1.000  
##  Median :33.00   Mode  :character   Mode  :character   Median :1.000  
##  Mean   :35.55                                         Mean   :1.407  
##  3rd Qu.:42.00                                         3rd Qu.:2.000  
##  Max.   :75.00                                         Max.   :4.000  
##      V17                 V18            V19                V20           
##  Length:1000        Min.   :1.000   Length:1000        Length:1000       
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :1.000   Mode  :character   Mode  :character  
##                     Mean   :1.155                                        
##                     3rd Qu.:1.000                                        
##                     Max.   :2.000                                        
##       V21     
##  Min.   :1.0  
##  1st Qu.:1.0  
##  Median :1.0  
##  Mean   :1.3  
##  3rd Qu.:2.0  
##  Max.   :2.0

str(credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ V1 : chr  "A11" "A12" "A14" "A11" ...
##  $ V2 : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ V3 : chr  "A34" "A32" "A34" "A32" ...
##  $ V4 : chr  "A43" "A43" "A46" "A42" ...
##  $ V5 : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ V6 : chr  "A65" "A61" "A61" "A61" ...
##  $ V7 : chr  "A75" "A73" "A74" "A74" ...
##  $ V8 : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ V9 : chr  "A93" "A92" "A93" "A93" ...
##  $ V10: chr  "A101" "A101" "A101" "A103" ...
##  $ V11: int  4 2 3 4 4 4 4 2 4 2 ...
##  $ V12: chr  "A121" "A121" "A121" "A122" ...
##  $ V13: int  67 22 49 45 53 35 53 35 61 28 ...
##  $ V14: chr  "A143" "A143" "A143" "A143" ...
##  $ V15: chr  "A152" "A152" "A152" "A153" ...
##  $ V16: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ V17: chr  "A173" "A173" "A172" "A173" ...
##  $ V18: int  1 1 2 2 2 2 1 1 1 1 ...
##  $ V19: chr  "A192" "A191" "A191" "A191" ...
##  $ V20: chr  "A201" "A201" "A201" "A201" ...
##  $ V21: int  1 2 1 1 2 1 1 1 1 2 ...

# lets count missings in every column
colSums(is.na(credit)) %>% 
  sort()

##  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## V21 
##   0

any(is.na(credit))

## [1] FALSE

# no missings

Data credit consists of 21 variables. Above mentioned variables do not include missing observations. We should mark that our variables do not include the readable name of the variables. Therefore we should give the name of the variables.

colnames(credit) <- c("checking_account", "duration", "credit_history", "purpose", "credit_amount", "saving_account", "employment", "installment_rate", "status_sex", "other_debtors", "residince", "property", "age", "install_plans", "housing", "number_credits","job", "number_people", "telephone", "foreign", "default")

Let’s analyze dependent variable - default - in our dataset.

Data about defaults

credit$default <- as.factor(ifelse(credit$default == '2', 1, 0))
prop.table(table(credit$default))

## 
##   0   1 
## 0.7 0.3

In our sample we have 30.0% entities in default.

Let’s analyze explanatory variables in our dataset.

Initial analysis of the variable

We should start with the preparation of the variables for further analysis. We start with the preparation of the nominal variables.

#---------------------------------------------------
# nominal variables
#---------------------------------------------------
# Storing qualitative variables as factors

# 1. credit_history
is.character(credit$credit_history)

## [1] TRUE

credit$credit_history <- factor(credit$credit_history,
                            levels = c("A30", "A31", "A32", "A33", "A34"),
                            labels = c("NoCredit", "CreditsBankPaids", "CreditsPaid", "CreditDelay", "CriticalAccount"))



table(credit$credit_history)

## 
##         NoCredit CreditsBankPaids      CreditsPaid      CreditDelay 
##               40               49              530               88 
##  CriticalAccount 
##              293

# it should be converted into a factor
credit$credit_history <- as.factor(credit$credit_history)
table(credit$credit_history)

## 
##         NoCredit CreditsBankPaids      CreditsPaid      CreditDelay 
##               40               49              530               88 
##  CriticalAccount 
##              293

levels(credit$credit_history)

## [1] "NoCredit"         "CreditsBankPaids" "CreditsPaid"      "CreditDelay"     
## [5] "CriticalAccount"

glimpse(credit$credit_history)

##  Factor w/ 5 levels "NoCredit","CreditsBankPaids",..: 5 3 5 3 4 3 3 3 3 5 ...

table_credit_history <- table(credit$credit_history, credit$default)
mosaicplot(table_credit_history, color = T)

CrossTable(credit$credit_history, credit$default, prop.c = F, prop.t = F, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                       | credit$default 
## credit$credit_history |         0 |         1 | Row Total | 
## ----------------------|-----------|-----------|-----------|
##              NoCredit |        15 |        25 |        40 | 
##                       |     0.375 |     0.625 |     0.040 | 
## ----------------------|-----------|-----------|-----------|
##      CreditsBankPaids |        21 |        28 |        49 | 
##                       |     0.429 |     0.571 |     0.049 | 
## ----------------------|-----------|-----------|-----------|
##           CreditsPaid |       361 |       169 |       530 | 
##                       |     0.681 |     0.319 |     0.530 | 
## ----------------------|-----------|-----------|-----------|
##           CreditDelay |        60 |        28 |        88 | 
##                       |     0.682 |     0.318 |     0.088 | 
## ----------------------|-----------|-----------|-----------|
##       CriticalAccount |       243 |        50 |       293 | 
##                       |     0.829 |     0.171 |     0.293 | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |       700 |       300 |      1000 | 
## ----------------------|-----------|-----------|-----------|
## 
##

#Pearson's Chi-squared test
chisq.test(credit$default, credit$credit_history)

## 
##  Pearson's Chi-squared test
## 
## data:  credit$default and credit$credit_history
## X-squared = 61.691, df = 4, p-value = 1.279e-12

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

Based on the presented graph and table it is noticed that exists the relation between the credit history and the level of default.

In case of existing credits paid back duly till now it is observed that 361 entities (68.1%) with non-default status and 169 entities (31.9%) with default status.

People with no credits taken/ all credits paid back duly are the most often vulnerable to being in default status.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

In case of the variable purpose it is observed that purpose of taking a loan to buy radio/television 218 entities (77.9%) are in non-default status and 62 entities (22.1%) are in default status. Additionally there is need to mark that people taking a loan for education more often are in default status than other purpose to take the credit.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

In case of the variable personal status and sex it is observed that single male 402 entities (73.4%) are in non-default status and 146 entities (26.6%) are in default status. Male with divorced/separated status more often are in default status in comparison to other status.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

In case of other debtors / guarantors it is observed that the position “none” has 635 entities (70%) are in non-default status and 272 entities (30%) are in default status. Also other debtors / guarantors in position co-applicant more often are in default than in other positions.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

In case of property it is observed that property with position car 230 entities (69.3%) are in non-default status and 102 entities (30.7%) are in default status. Additionally people with no property more often in default in comparison to other properties.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

It is observed that in other installment plans for position “none” 590 entities (72.5%) are in non-default status and 224 entities (27.5%) are in default status. For other installment plans with position bank exists higher probability being in default status than in other positions.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

In case of houses it is observed that own houses 527 entities (73.9%) are in non-default status and 186 entities (26.1%) are in default status. Additionally people with housing for free have high probability being default status in comparison to other positions.

Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level. This variable is removed.

In case of job it is observed that skilled employee / official 444 entities (70.5%) are in non-default status and 186 entities (29.5%) are in default status. People who work in management area have high probability of default than other type of jobs.

Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level. This variable is removed.

In case of having telephone it is observed that 409 (68.6%) people with telephone are in non-default status and 187 entities (31.4%) are in default status. People without telephone have high probability of default than people with telephone.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

In case of foreign people it is observed that 667 (69.3%) entities are in non-default status and 296 entities (30.7%) are in default status. Foreign people have high probability of default than non-foreign people.

Then discrete ordered variables will be prepared.

#--------------------------------------------
#  ORDINAL variables
#--------------------------------------------
# 1. saving_account

table(credit$saving_account)

## 
## A61 A62 A63 A64 A65 
## 603 103  63  48 183

# we add the order of values
credit$saving_account<- factor(credit$saving_account,
                           # levels from lowest to highest    
                           levels = c("A65",
                                      "A61",
                                      "A62",
                                      "A63",
                                      "A64"),
                           ordered = TRUE) # ordinal

credit$saving_account <- factor(credit$saving_account,
                            levels = c("A65", "A61", "A62", "A63", "A64"),
                            labels = c("NoSavings", "<100", "100_500", "500_1000", ">=1000"))

# we check the table of frequencies
table(credit$saving_account)

## 
## NoSavings      <100   100_500  500_1000    >=1000 
##       183       603       103        63        48

# The levels are shown in an increasing order
levels(credit$saving_account)

## [1] "NoSavings" "<100"      "100_500"   "500_1000"  ">=1000"

glimpse(credit$saving_account)

##  Ord.factor w/ 5 levels "NoSavings"<"<100"<..: 1 2 2 2 2 1 4 2 5 2 ...

table_saving_account <- table(credit$saving_account, credit$default)
mosaicplot(table_saving_account, color = T)

CrossTable(credit$saving_account, credit$default, prop.c = F, prop.t = F, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                       | credit$default 
## credit$saving_account |         0 |         1 | Row Total | 
## ----------------------|-----------|-----------|-----------|
##             NoSavings |       151 |        32 |       183 | 
##                       |     0.825 |     0.175 |     0.183 | 
## ----------------------|-----------|-----------|-----------|
##                  <100 |       386 |       217 |       603 | 
##                       |     0.640 |     0.360 |     0.603 | 
## ----------------------|-----------|-----------|-----------|
##               100_500 |        69 |        34 |       103 | 
##                       |     0.670 |     0.330 |     0.103 | 
## ----------------------|-----------|-----------|-----------|
##              500_1000 |        52 |        11 |        63 | 
##                       |     0.825 |     0.175 |     0.063 | 
## ----------------------|-----------|-----------|-----------|
##                >=1000 |        42 |         6 |        48 | 
##                       |     0.875 |     0.125 |     0.048 | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |       700 |       300 |      1000 | 
## ----------------------|-----------|-----------|-----------|
## 
##

#Pearson's Chi-squared test
chisq.test(credit$default, credit$saving_account)

## 
##  Pearson's Chi-squared test
## 
## data:  credit$default and credit$saving_account
## X-squared = 36.099, df = 4, p-value = 2.761e-07

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

People with saving account less than 100 have high probability of default than others.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

People with checking account less than 0 have high probability than others.

Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.

People employed in enterprises with less than 1 people have higher probability of default than others.

Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level.

In case of one child being liable to provide maintenance for has higher probability of default than two children being liable to provide maintenance for.

Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level. But if we consider significance level on 15% the null hypothesis is rejected that the differences in groups seem to be not statistically significant.

For 4% installment rate has higher probability of default than other installment rate.

Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level.

People with two present residences have higher probability of default than others.

Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level.

People with four number of credits have higher probability of default than others.

Next, we prepare the continuous explanatory variables.

#--------------------------------------
# - X - NUMERIC VARIABLES
#--------------------------------------
# To applied to all columns we use sapply()
sapply(credit, is.numeric)

## checking_account         duration   credit_history          purpose 
##            FALSE             TRUE            FALSE            FALSE 
##    credit_amount   saving_account       employment installment_rate 
##             TRUE            FALSE            FALSE            FALSE 
##       status_sex    other_debtors        residince         property 
##            FALSE            FALSE            FALSE            FALSE 
##              age    install_plans          housing   number_credits 
##             TRUE            FALSE            FALSE            FALSE 
##              job    number_people        telephone          foreign 
##            FALSE            FALSE            FALSE            FALSE 
##          default 
##            FALSE

# We filter out those  variables which are numeric
# We store their list in a vector
credit_numeric_vars <- 
  sapply(credit, is.numeric) %>% 
  which() %>% 
  names()
credit_numeric_vars

## [1] "duration"      "credit_amount" "age"

# We check how many unique values

sapply(credit[, credit_numeric_vars], 
       function(x) 
         unique(x) %>% 
         length()) %>% 
  # lets sort variables by increasing 
  # number of levels in the end
  sort()

##      duration           age credit_amount 
##            33            53           921

ggplot(credit, aes(x = duration)) +
  geom_bar() +
  facet_grid(default ~ ., )

by(credit$duration, credit$default, summary)

## credit$default: 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   12.00   18.00   19.21   24.00   60.00 
## ------------------------------------------------------------ 
## credit$default: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   12.00   24.00   24.86   36.00   72.00

ggplot(credit, aes(x = age)) +
  geom_bar() +
  facet_grid(default ~ ., )

by(credit$age, credit$default, summary)

## credit$default: 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   27.00   34.00   36.22   42.25   75.00 
## ------------------------------------------------------------ 
## credit$default: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   25.00   31.00   33.96   40.00   74.00

ggplot(credit, aes(x = credit_amount)) +
  geom_bar() +
  facet_grid(default ~ ., )

by(credit$credit_amount, credit$default, summary)

## credit$default: 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1376    2244    2985    3635   15857 
## ------------------------------------------------------------ 
## credit$default: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     433    1352    2574    3938    5142   18424

Duration for non-default status is 19 months and for default-status is 25 months. Average age people in default is 36 years and in non-default is 34 years. People with non-default status have 2985 units of credit amount and people with default status have 3938 units of credit amount on average level.

Let’s store their list in a vector as numeric.

credit_numeric_vars <- 
  sapply(credit, is.numeric) %>% 
  which() %>% 
  names()

credit_numeric_vars

## [1] "duration"      "credit_amount" "age"

sapply(credit[, credit_numeric_vars], 
       function(x) 
         unique(x) %>% 
         length()) %>% 
  # lets sort variables by increasing 
  # number of levels in the end
  sort()

##      duration           age credit_amount 
##            33            53           921

In order to verify collinearity correlation matrix between numeric variables is calculated.

#---------------------------------------------------
# mutually correlated (irrelevant) variables
#---------------------------------------------------
# we calculate correlations between variables to identify redundant features
credit_correlations <- 
  cor(credit[,credit_numeric_vars],
      use = "pairwise.complete.obs")

#credit_correlations

In order to avoid the problem of collinearity, there are no variables highly correlated.

# lets decrease the cutoff to 0.75
findCorrelation(credit_correlations,
                cutoff = 0.75,
                names = TRUE)

## character(0)

# We this variables excluded from the model

Information value

Let’s consider Information Value for each variables in our dataset.

save(list = c("credit"),
     file = "slides/credit.RData")
credit$default <- ifelse(credit$default == '1', 1, 0)
IV <- create_infotables(data = credit, y = "default", bins = 5, parallel = FALSE)
IV$Summary

##            Variable           IV
## 1  checking_account 6.660115e-01
## 3    credit_history 2.932335e-01
## 2          duration 2.183994e-01
## 6    saving_account 1.960096e-01
## 4           purpose 1.691951e-01
## 12         property 1.126383e-01
## 5     credit_amount 9.342540e-02
## 13              age 8.781228e-02
## 7        employment 8.643363e-02
## 15          housing 8.329343e-02
## 14    install_plans 5.761454e-02
## 9        status_sex 4.467068e-02
## 20          foreign 4.387741e-02
## 10    other_debtors 3.201932e-02
## 8  installment_rate 2.632209e-02
## 16   number_credits 1.326652e-02
## 17              job 8.762766e-03
## 19        telephone 6.377605e-03
## 11        residince 3.588773e-03
## 18    number_people 4.339223e-05

Information values for each variables are greater than 0.02 without number_credits, job, telephone, residince and number_people. These variables will be excluded from the model.

credit[c("number_credits", "job", "telephone", "residince", "number_people", "status_sex")] <- list(NULL)

Additionally status_sex is dropped because actually commercial banks do not use in models discrimination by sex.

Data division, modelling and results

The data is divided into training and test samples in proportions 50:50, because small sample is considered.

credit$default <- as.factor(ifelse(credit$default == 1, "Yes", "No"))
set.seed(123456)
d = sort(sample(nrow(credit), nrow(credit) * .5))
train <- credit[d, ]
test <- credit[-d, ]

Before producing the first tree, let us also examine frequencies of the target variable inside the training and the testing set.

prop.table(table(train$default))

## 
##   No  Yes 
## 0.73 0.27

prop.table(table(test$default))

## 
##   No  Yes 
## 0.67 0.33

Percent of defaults in train sample and test sample are approximately equal.

Comparison of prediction accuracy of at least 3 different techniques

Classification trees

Let us define the formula of the model.

model1.formula <- default ~ checking_account + duration + credit_history + 
  purpose + credit_amount + saving_account + employment + installment_rate  + 
  property + age + housing +other_debtors + install_plans + foreign

The tree is created on the basis of all chosen predictors.

credit.tree1 is constructed using the Gini Index split criterion.

save(list = c("train"),
     file = "slides/train.RData")
save(list = c("test"),
     file = "slides/test.RData")
credit.tree1 <- 
  rpart(model1.formula, # model formula
        data = train, # data
        method = "class") # type of the tree: classification
#credit.tree1
fancyRpartPlot(credit.tree1)

#summary(credit.tree1)

credit.tree2 is constructed using Information Gain which is based on the concept of entropy.

credit.tree2 <- 
  rpart(model1.formula,
        data = train,
        method = "class",
        parms = list(split = 'information'))

#credit.tree2
fancyRpartPlot(credit.tree2)

Obtained tree no 2 is quite identical as that obtained with Gini coefficient.

credit.tree3 is constructed based on stopping criteria: minsplit, minbucket, maxdepth.

credit.tree3 <- 
  rpart(model1.formula,
        data = train,
        method = "class",
        minsplit = 100, # minimum number of observations in the given node for
        #the considered split (default = 20)
        minbucket = 50, # minimum number of observations in the terminal nodes 
        # (default = round(minsplit/3))
        maxdepth = 10,  # maximum depth of the tree, assuming that first node 
        #(root node) is denoted by 0 (default = 30)
        
        cp = 0) # non-negative complexity parameter
fancyRpartPlot(credit.tree3)

Based on the criteria used, smaller tree is obtained.

credit.tree4 is constructed based on imposing no restrictions on the growth of trees.

credit.tree4 <- 
  rpart(model1.formula,
        data = train,
        method = "class",
        # we don't impose any restriction on the tree growth 
        cp = -1) # cp - complexity parameter

#credit.tree4
fancyRpartPlot(credit.tree4)

The resulting tree is unlimited (large) and unreadable. The results obtained, due to the possibility of allowing trees to grow to large sizes, lead to over-fit of trees and poor results in the test sample.

credit.tree4p is constructed based on pruning criteria. In the construction of the tree, the parameter cp with the least error is used.

opt <- which.min(credit.tree4$cptable[, "xerror"])
cp <- credit.tree4$cptable[opt, "CP"]
credit.tree4p <- 
  prune(credit.tree4, cp = cp)
fancyRpartPlot(credit.tree4p)

credit.tree5 is constructed based on choosing the range of cp values for the cross-validation process in relation to the AUC (ROC) values.

In obtaining the above tree, caret package is used.

Let us also define a summary functions which will report 5 standard measures of the model quality.

fiveStats <- function(...) c(twoClassSummary(...), 
                             defaultSummary(...))

ctrl_cv5 <- trainControl(method = "cv",
                         number = 5, 
                         # saving all predictions
                         savePredictions = "final",
                         summaryFunction = fiveStats,
                         classProbs = TRUE)

levels(train$default) <- c("Yes", "No")
modelLookup("rpart")

##   model parameter                label forReg forClass probModel
## 1 rpart        cp Complexity Parameter   TRUE     TRUE      TRUE

cp.grid <- expand.grid(cp = seq(0, 0.03, 0.001))
set.seed(123456789)
credit.tree5 <- 
  train(model1.formula,
        data = train, 
        method = "rpart", 
        metric = "ROC",
        trControl = ctrl_cv5,
        tuneGrid  = cp.grid)
credit.tree5

## CART 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results across tuning parameters:
## 
##   cp     ROC        Sens       Spec       Accuracy  Kappa    
##   0.000  0.7017757  0.8739726  0.3481481  0.732     0.2425358
##   0.001  0.7017757  0.8739726  0.3481481  0.732     0.2425358
##   0.002  0.7017757  0.8739726  0.3481481  0.732     0.2425358
##   0.003  0.7017757  0.8739726  0.3481481  0.732     0.2425358
##   0.004  0.6979198  0.8876712  0.3407407  0.740     0.2557514
##   0.005  0.6968544  0.9068493  0.3333333  0.752     0.2745060
##   0.006  0.6968544  0.9068493  0.3333333  0.752     0.2745060
##   0.007  0.6970066  0.9068493  0.3259259  0.750     0.2667056
##   0.008  0.6970066  0.9068493  0.3259259  0.750     0.2667056
##   0.009  0.6970066  0.9068493  0.3259259  0.750     0.2667056
##   0.010  0.7078133  0.8904110  0.3555556  0.746     0.2769349
##   0.011  0.7078133  0.8904110  0.3555556  0.746     0.2769349
##   0.012  0.7078133  0.8904110  0.3555556  0.746     0.2769349
##   0.013  0.6988838  0.8876712  0.3629630  0.746     0.2812751
##   0.014  0.6988838  0.8876712  0.3629630  0.746     0.2812751
##   0.015  0.6988838  0.8876712  0.3629630  0.746     0.2812751
##   0.016  0.6988838  0.8876712  0.3629630  0.746     0.2812751
##   0.017  0.6988838  0.8876712  0.3629630  0.746     0.2812751
##   0.018  0.6988838  0.8876712  0.3629630  0.746     0.2812751
##   0.019  0.6883815  0.8821918  0.3777778  0.746     0.2885569
##   0.020  0.6883815  0.8821918  0.3777778  0.746     0.2885569
##   0.021  0.6883815  0.8821918  0.3777778  0.746     0.2885569
##   0.022  0.6850330  0.8794521  0.3481481  0.736     0.2539525
##   0.023  0.6850330  0.8794521  0.3481481  0.736     0.2539525
##   0.024  0.6887874  0.8931507  0.3185185  0.738     0.2433087
##   0.025  0.6960426  0.8904110  0.3185185  0.736     0.2391009
##   0.026  0.6960426  0.8904110  0.3185185  0.736     0.2391009
##   0.027  0.6960426  0.8904110  0.3185185  0.736     0.2391009
##   0.028  0.7029427  0.8986301  0.2962963  0.736     0.2247007
##   0.029  0.7029427  0.8986301  0.2962963  0.736     0.2247007
##   0.030  0.7029427  0.8986301  0.2962963  0.736     0.2247007
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.012.

All the necessary trees for verification are assessed. Predictions on the training set and on the test set are created.

The values of the ROC curve for training sample are calculated.

We present the ROC curve on the graph for the obtained models for training sample. Additionally, the Gini coefficients are presented. The higher the value of the Gini coefficient, the better the model classifies.

To sum up, in case of training sample the highest Gini coefficient for tree4 and tree5.

The same calculations are repeated for the testing sample. The values of the ROC curve for test sample are calculated.

The ROC curve is presented on the graph for the obtained models for test sample.

The models perform a bit worse on the test sample.tree4p model is more stable.

Then we indicate VARIABLE IMPORTANCE.

In order to answer the question, the rpart() function which relies on the overall measure of importance of the predictors (the sum of the quality of the divisions for which the predictor is used in the division rule).

tree4p.importance <- credit.tree4p$variable.importance
# margins c(bottom, left, top, right) 
# default c(5, 4, 4, 2) + 0.1.
par(mar = c(5.1, 6.1, 4.1, 2.1))
barplot(rev(tree4p.importance), # vector
        col = "blue",  # colors
        main = "importance of variables in model tree4p",
        horiz = T,  # horizontal type of plot
        las = 1,    # labels always horizontally 
        cex.names = 0.6)

In this case, five most important predictors are: checking_account, duration, credit_history, purpose, credit_amount.

In addition, the varImp() function from the caret package to determine the importance of the variables is used, which consists in determining the measure: average decrease in node impurity.

(tree4p.importance2 <- varImp(credit.tree4p))

##                    Overall
## age              17.284520
## checking_account 21.380564
## credit_amount    17.951667
## credit_history   26.135652
## duration         27.721887
## employment        1.854236
## installment_rate  5.628966
## other_debtors     5.099142
## property          4.722302
## purpose           7.357860
## saving_account    5.640125
## housing           0.000000
## install_plans     0.000000
## foreign           0.000000

tree4p.importance2 <- 
  tree4p.importance2 %>% 
  rownames_to_column("variable") %>%
  arrange(desc(Overall))

barplot(rev(tree4p.importance2$Overall), # wector
        names.arg = rev(tree4p.importance2$variable), # lables
        col = "blue",  # colors
        main = "importance of variables in tree4p",
        horiz = T,  # horizontal type of plot
        las = 1,    # labels always horizontally
        cex.names = 0.6)

Let us analyze the first 10 variables that are selected on the basis of two approaches.

intersect(tree4p.importance2$variable[1:10],
          names(tree4p.importance)[1:10])

## [1] "duration"         "credit_history"   "checking_account" "credit_amount"   
## [5] "age"              "purpose"          "saving_account"   "property"

Based on two approaches for choosing important variables it is observed that 8 of them are common, but in different order.

Objects with the models into separate rds files are saved.

credit.tree1  %>% saveRDS(here("output", "credit.tree1.rds"))
credit.tree2  %>% saveRDS(here("output", "credit.tree2.rds"))
credit.tree3  %>% saveRDS(here("output", "credit.tree3.rds"))
credit.tree4  %>% saveRDS(here("output", "credit.tree4.rds"))
credit.tree4p %>% saveRDS(here("output", "credit.tree4p.rds"))
credit.tree5  %>% saveRDS(here("output", "credit.tree5.rds"))

Random Forest

Our second approach for classification is Random Forest.

credit.rf1 is our first random forest using randomForests() function.

set.seed(123456789)
credit.rf1 <- randomForest(model1.formula, 
                           data = train, localImp = TRUE)
# Saving the object with the model to the external `rds` file
saveRDS(credit.rf1, file = here("output", "credit.rf.rds"))
print(credit.rf1)

## 
## Call:
##  randomForest(formula = model1.formula, data = train, localImp = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 23.6%
## Confusion matrix:
##     Yes No class.error
## Yes 332 33  0.09041096
## No   85 50  0.62962963

Based on the classification table prediction error is lower for predictions of the "Yes" value and much more higher for the "No" value.

OOB estimate of error rate: 23.6%

plot(credit.rf1)

# the dark solid line: OOB error,
# the red line: the prediction error for "Yes" response,
# the green line: the prediction error for "No" response.

Based on the results of the OOB (out-of-bag) error approx. 100 trees is enough for analysis.

In according to above for credit.rf2 is restricted to 100 trees and estimate the model on bootstrap samples from the full data set. Additionally the mtry parameter is optimized: between 2 and 12. To do that the cross-validation process is used.

parameters_rf <- expand.grid(mtry = 2:12)
ctrl_oob <- trainControl(method = "oob", classProbs = TRUE)
  set.seed(123456789)
  credit.rf2 <-
    train(model1.formula,
          data = train,
          method = "rf",
          ntree = 100,
          nodesize = 100,
          tuneGrid = parameters_rf,
          trControl = ctrl_oob,
          importance = TRUE)


  # saving the object to the external file
saveRDS(object = credit.rf2,
          file   = here("output", "credit.rf2.rds"))
# loading the object from the external file
credit.rf2 <- readRDS(here("output", "credit.rf2.rds"))

credit.rf2

## Random Forest 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa     
##    2    0.730     0.00000000
##    3    0.730     0.00000000
##    4    0.730     0.00000000
##    5    0.730     0.00000000
##    6    0.730     0.00000000
##    7    0.736     0.03860160
##    8    0.734     0.04727794
##    9    0.740     0.06568923
##   10    0.738     0.04907085
##   11    0.732     0.02403496
##   12    0.736     0.04500072
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 9.

#str(credit.rf2)
plot(credit.rf2$results$mtry,
     credit.rf2$results$Accuracy, type = "b")

plot(credit.rf2$results$mtry,
     credit.rf2$results$Kappa, type = "b")

The final value used for the model is mtry = 9.

credit.rf3 is constrained to implement random forests model from the ranger package to optimize mtry.

modelLookup("ranger")

##    model     parameter                         label forReg forClass probModel
## 1 ranger          mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE
## 2 ranger     splitrule                Splitting Rule   TRUE     TRUE      TRUE
## 3 ranger min.node.size             Minimal Node Size   TRUE     TRUE      TRUE

Tuning parameter splitrule is held constant at a value of Gini.

parameters_ranger <- 
  expand.grid(mtry = 2:12,
              # split rule
              splitrule = "gini",
              # minimum size of the terminal node
              min.node.size = c(40, 80, 100))

  set.seed(123456789)
  credit.rf3 <- 
    train(model1.formula, 
          data = train, 
          method = "ranger", 
          num.trees = 100, # default = 500
          # numbers of processor cores to use in computations
          num.threads = 3,
          # impurity measure
          importance = "impurity",
          # parameters
          tuneGrid = parameters_ranger, 
          trControl = ctrl_cv5)


# saving the object to the external file
saveRDS(credit.rf3, here("output", "credit.rf3.rds"))
credit.rf3 <- readRDS(here("output", "credit.rf3.rds"))

credit.rf3

## Random Forest 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  ROC        Sens       Spec         Accuracy
##    2     40            0.7420599  0.9890411  0.007407407  0.724   
##    2     80            0.7459158  1.0000000  0.000000000  0.730   
##    2    100            0.7348554  1.0000000  0.000000000  0.730   
##    3     40            0.7433790  0.9753425  0.066666667  0.730   
##    3     80            0.7504820  1.0000000  0.007407407  0.732   
##    3    100            0.7404363  1.0000000  0.000000000  0.730   
##    4     40            0.7541350  0.9698630  0.096296296  0.734   
##    4     80            0.7507864  0.9917808  0.037037037  0.734   
##    4    100            0.7535261  0.9890411  0.014814815  0.726   
##    5     40            0.7451040  0.9698630  0.140740741  0.746   
##    5     80            0.7511923  0.9835616  0.081481481  0.740   
##    5    100            0.7512938  0.9972603  0.037037037  0.738   
##    6     40            0.7396246  0.9589041  0.237037037  0.764   
##    6     80            0.7519026  0.9835616  0.066666667  0.736   
##    6    100            0.7532217  0.9890411  0.037037037  0.732   
##    7     40            0.7496702  0.9479452  0.207407407  0.748   
##    7     80            0.7544394  0.9726027  0.088888889  0.734   
##    7    100            0.7614409  0.9863014  0.044444444  0.732   
##    8     40            0.7503805  0.9479452  0.207407407  0.748   
##    8     80            0.7566717  0.9726027  0.155555556  0.752   
##    8    100            0.7534247  0.9780822  0.074074074  0.734   
##    9     40            0.7541350  0.9589041  0.207407407  0.756   
##    9     80            0.7465246  0.9616438  0.162962963  0.746   
##    9    100            0.7569762  0.9808219  0.096296296  0.742   
##   10     40            0.7557585  0.9506849  0.259259259  0.764   
##   10     80            0.7509893  0.9616438  0.162962963  0.746   
##   10    100            0.7505835  0.9753425  0.111111111  0.742   
##   11     40            0.7525114  0.9452055  0.251851852  0.758   
##   11     80            0.7507864  0.9698630  0.155555556  0.750   
##   11    100            0.7504820  0.9780822  0.111111111  0.744   
##   12     40            0.7417555  0.9369863  0.244444444  0.750   
##   12     80            0.7512938  0.9698630  0.148148148  0.748   
##   12    100            0.7540335  0.9643836  0.140740741  0.742   
##   Kappa       
##   -0.004715835
##    0.000000000
##    0.000000000
##    0.058580360
##    0.010633649
##    0.000000000
##    0.088894949
##    0.040687679
##    0.005855492
##    0.145600924
##    0.087925390
##    0.049124854
##    0.248023148
##    0.069970616
##    0.036841674
##    0.197158083
##    0.083220385
##    0.043621500
##    0.196814603
##    0.169244049
##    0.071577723
##    0.212924898
##    0.160962963
##    0.102574611
##    0.258746153
##    0.163451815
##    0.115803995
##    0.244267168
##    0.164627056
##    0.120663424
##    0.224380646
##    0.155983717
##    0.137005151
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 6, splitrule = gini
##  and min.node.size = 40.

plot(credit.rf3)

The results of randomForests and ranger are different.

Based on the results the final values used for the model are mtry = 6, splitrule = gini and min.node.size = 40.

Based on the estimated models, the quality is compared to training sample and test sample.

To sum up, the first model rf1 on training sample is strongly over-fitted!

On the test sample the model rf3 performs better than the other models (excluding rf1) but the model rf3 is over-fitted.

VARIABLE IMPORTANCE is indicated.

source(here("funs", "getVarImpPlot4Ranger.R"))
getVarImpPlot4Ranger(credit.rf3$finalModel,
                     sort = TRUE,
                     main = "Importance of predictors",
                     n.var = 15)

In this case, five most important predictors are: checking_account, duration, credit_amount, age, checking_account

XGBoost

The third alternative algorithm is eXtreme Gradient Boosting, because popular, quite quick and effective using the xgboost package.

The parameters which might be manipulated.

modelLookup("xgbTree")

##     model        parameter                          label forReg forClass
## 1 xgbTree          nrounds          # Boosting Iterations   TRUE     TRUE
## 2 xgbTree        max_depth                 Max Tree Depth   TRUE     TRUE
## 3 xgbTree              eta                      Shrinkage   TRUE     TRUE
## 4 xgbTree            gamma         Minimum Loss Reduction   TRUE     TRUE
## 5 xgbTree colsample_bytree     Subsample Ratio of Columns   TRUE     TRUE
## 6 xgbTree min_child_weight Minimum Sum of Instance Weight   TRUE     TRUE
## 7 xgbTree        subsample           Subsample Percentage   TRUE     TRUE
##   probModel
## 1      TRUE
## 2      TRUE
## 3      TRUE
## 4      TRUE
## 5      TRUE
## 6      TRUE
## 7      TRUE

credit.xgb is constructed to optimal values of parameters for boosting with relatively large value of learning rate.

credit.xgb

## eXtreme Gradient Boosting 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results across tuning parameters:
## 
##   nrounds  ROC        Sens       Spec       Accuracy  Kappa     
##    10      0.7208016  0.9479452  0.1111111  0.722     0.08115829
##    20      0.7296804  0.9424658  0.1555556  0.730     0.12825027
##    30      0.7270421  0.9260274  0.2592593  0.746     0.22287364
##    40      0.7306951  0.9123288  0.2518519  0.734     0.19558872
##    50      0.7301877  0.9068493  0.2962963  0.742     0.23488282
##    60      0.7257230  0.9013699  0.3185185  0.744     0.24911376
##    70      0.7213597  0.8986301  0.3111111  0.740     0.23897364
##    80      0.7260274  0.9068493  0.3111111  0.746     0.25056502
##    90      0.7193303  0.9013699  0.3037037  0.740     0.23593242
##   100      0.7239980  0.8986301  0.3259259  0.744     0.25395998
## 
## Tuning parameter 'max_depth' was held constant at a value of 5
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 5
## 
## Tuning parameter 'subsample' was held constant at a value of 0.8
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 30, max_depth = 5, eta
##  = 0.2, gamma = 1, colsample_bytree = 0.26, min_child_weight = 5 and
##  subsample = 0.8.

Based on the result of the model nrounds = 30 is obtained.

The next steps to determine an optimal number of trees for that value of learning rate.

credit.xgb2 is built to find optimal values of max_depthandmin_child_weight` parameters.

parameters_xgb2 <- expand.grid(nrounds = 30,
                              max_depth = seq(5, 10, 1),
                              eta = c(0.2), 
                              gamma = 1,
                              colsample_bytree = c(0.26),
                              min_child_weight = seq(5, 30, 5),
                              subsample = 0.8)

set.seed(123456789)
credit.xgb2 <- train(model1.formula,
                      data = train,
                      method = "xgbTree",
                      trControl = ctrl_cv5,
                      tuneGrid  = parameters_xgb2)

credit.xgb2 %>% saveRDS(here("output", "credit.xgb2.rds"))

credit.xgb2

## eXtreme Gradient Boosting 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results across tuning parameters:
## 
##   max_depth  min_child_weight  ROC        Sens       Spec       Accuracy
##    5          5                0.7342466  0.8986301  0.3777778  0.758   
##    5         10                0.7417555  0.9123288  0.3481481  0.760   
##    5         15                0.7436327  0.8931507  0.3629630  0.750   
##    5         20                0.7320142  0.9452055  0.2222222  0.750   
##    5         25                0.7202943  0.9643836  0.1259259  0.738   
##    5         30                0.6894977  1.0000000  0.0000000  0.730   
##    6          5                0.7505835  0.9150685  0.3703704  0.768   
##    6         10                0.7438864  0.9150685  0.3407407  0.760   
##    6         15                0.7442922  0.9123288  0.3185185  0.752   
##    6         20                0.7329782  0.9342466  0.2370370  0.746   
##    6         25                0.7090817  0.9643836  0.1259259  0.738   
##    6         30                0.6598681  1.0000000  0.0000000  0.730   
##    7          5                0.7237950  0.8794521  0.3407407  0.734   
##    7         10                0.7300863  0.9013699  0.3185185  0.744   
##    7         15                0.7478945  0.9150685  0.3407407  0.760   
##    7         20                0.7214105  0.9287671  0.2074074  0.734   
##    7         25                0.7183156  0.9726027  0.1185185  0.742   
##    7         30                0.6727549  1.0000000  0.0000000  0.730   
##    8          5                0.7336377  0.8986301  0.3407407  0.748   
##    8         10                0.7349569  0.8876712  0.3333333  0.738   
##    8         15                0.7514460  0.9150685  0.3333333  0.758   
##    8         20                0.7279046  0.9424658  0.1851852  0.738   
##    8         25                0.7130898  0.9671233  0.1037037  0.734   
##    8         30                0.6872146  1.0000000  0.0000000  0.730   
##    9          5                0.7420599  0.9013699  0.3259259  0.746   
##    9         10                0.7500761  0.8986301  0.3407407  0.748   
##    9         15                0.7481989  0.9123288  0.3333333  0.756   
##    9         20                0.7259259  0.9369863  0.2222222  0.744   
##    9         25                0.7168442  0.9643836  0.1407407  0.742   
##    9         30                0.7096398  1.0000000  0.0000000  0.730   
##   10          5                0.7283612  0.8821918  0.3407407  0.736   
##   10         10                0.7372907  0.8986301  0.3185185  0.742   
##   10         15                0.7394723  0.8821918  0.3037037  0.726   
##   10         20                0.7287164  0.9397260  0.2074074  0.742   
##   10         25                0.7053780  0.9698630  0.1037037  0.736   
##   10         30                0.6814307  1.0000000  0.0000000  0.730   
##   Kappa     
##   0.30700479
##   0.29674883
##   0.28586272
##   0.21075458
##   0.11702359
##   0.00000000
##   0.32642829
##   0.29248784
##   0.26983344
##   0.21405909
##   0.11496483
##   0.00000000
##   0.24791049
##   0.25184378
##   0.29392335
##   0.16988246
##   0.12124197
##   0.00000000
##   0.27094719
##   0.24717464
##   0.28962837
##   0.16324392
##   0.09215991
##   0.00000000
##   0.26145267
##   0.27331959
##   0.28761607
##   0.19734882
##   0.13761274
##   0.00000000
##   0.24743238
##   0.25092838
##   0.21343424
##   0.18657620
##   0.09554045
##   0.00000000
## 
## Tuning parameter 'nrounds' was held constant at a value of 70
## Tuning
##  'colsample_bytree' was held constant at a value of 0.26
## Tuning
##  parameter 'subsample' was held constant at a value of 0.8
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 70, max_depth = 6, eta
##  = 0.2, gamma = 1, colsample_bytree = 0.26, min_child_weight = 5 and
##  subsample = 0.8.

Based on the results maxdepth = 6 and min_child_weight = 15 are obtained.

credit.xgb3 is constructed to optimize the colsample_bytree parameter.

parameters_xgb3 <- expand.grid(nrounds = 30,
                              max_depth = 6,
                              eta = c(0.2), 
                              gamma = 1,
                              colsample_bytree = seq(0.1, 0.8, 0.1),
                              min_child_weight = 15,
                              subsample = 0.8)
set.seed(123456789)
credit.xgb3 <- train(model1.formula,
                      data = train,
                      method = "xgbTree",
                      trControl = ctrl_cv5,
                      tuneGrid  = parameters_xgb3)

credit.xgb3 %>% saveRDS(here("output", "credit.xgb3.rds"))

credit.xgb3

## eXtreme Gradient Boosting 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results across tuning parameters:
## 
##   colsample_bytree  ROC        Sens       Spec       Accuracy  Kappa    
##   0.1               0.7345510  0.9589041  0.1333333  0.736     0.1173078
##   0.2               0.7456114  0.9506849  0.2148148  0.752     0.2073743
##   0.3               0.7411974  0.9205479  0.2296296  0.734     0.1833923
##   0.4               0.7513445  0.9342466  0.2740741  0.756     0.2524838
##   0.5               0.7496195  0.9315068  0.2666667  0.752     0.2407831
##   0.6               0.7455099  0.9178082  0.2666667  0.742     0.2210748
##   0.7               0.7441400  0.9123288  0.3333333  0.756     0.2857935
##   0.8               0.7463217  0.9123288  0.2888889  0.744     0.2329054
## 
## Tuning parameter 'nrounds' was held constant at a value of 30
## Tuning
##  held constant at a value of 15
## Tuning parameter 'subsample' was held
##  constant at a value of 0.8
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 30, max_depth = 9, eta
##  = 0.2, gamma = 1, colsample_bytree = 0.4, min_child_weight = 15 and
##  subsample = 0.8.

Based on the results colsample_bytree = 0.4 is obtained.

credit.xgb4 is built to optimal length of the subsample (0.6, 0.7, 0.75, 0.8, 0.85, 0.9), which is used for creating the splits.

parameters_xgb4 <- expand.grid(nrounds = 30,
                              max_depth = 6,
                              eta = c(0.2), 
                              gamma = 1,
                              colsample_bytree = 0.4,
                              min_child_weight = 15,
                              subsample = c(0.6, 0.7, 0.75, 0.8, 0.85, 0.9))

set.seed(123456789)
credit.xgb4 <- train(model1.formula,
                      data = train,
                      method = "xgbTree",
                      trControl = ctrl_cv5,
                      tuneGrid  = parameters_xgb4)


credit.xgb4 %>% saveRDS(here("output", "credit.xgb4.rds"))

credit.xgb4

## eXtreme Gradient Boosting 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results across tuning parameters:
## 
##   subsample  ROC        Sens       Spec       Accuracy  Kappa    
##   0.60       0.7339422  0.9479452  0.1703704  0.738     0.1500713
##   0.70       0.7349061  0.9342466  0.2370370  0.746     0.2068324
##   0.75       0.7409944  0.9260274  0.2814815  0.752     0.2481561
##   0.80       0.7473364  0.9287671  0.2666667  0.750     0.2343006
##   0.85       0.7449011  0.9150685  0.3037037  0.750     0.2570599
##   0.90       0.7515474  0.9178082  0.3333333  0.760     0.2930688
## 
## Tuning parameter 'nrounds' was held constant at a value of 30
## Tuning
##  held constant at a value of 0.6
## Tuning parameter 'min_child_weight' was
##  held constant at a value of 15
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 30, max_depth = 9, eta
##  = 0.2, gamma = 1, colsample_bytree = 0.6, min_child_weight = 15 and
##  subsample = 0.9.

Based on the results, the optimal value subsample = 0.7 is obtained. To sum up, all necessary values of parameters are obtained.

credit.xgb5 is constructed to find the lower the learning rate and proportionally increase number of iterations.

The lower learning rate by half (up to 0.10) is considered, while the number of trees will be doubled (up to 80).

parameters_xgb5 <- expand.grid(nrounds = 80,
                              max_depth = 6,
                              eta = 0.10, 
                              gamma = 1,
                              colsample_bytree = 0.4,
                              min_child_weight = 15,
                              subsample = 0.7)

set.seed(123456789)
credit.xgb5 <- train(model1.formula,
                      data = train,
                      method = "xgbTree",
                      trControl = ctrl_cv5,
                      tuneGrid  = parameters_xgb5)

credit.xgb5 %>% saveRDS(here("output", "credit.xgb5.rds"))

credit.xgb5

## eXtreme Gradient Boosting 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results:
## 
##   ROC        Sens       Spec       Accuracy  Kappa    
##   0.7477423  0.9150685  0.3259259  0.756     0.2800949
## 
## Tuning parameter 'nrounds' was held constant at a value of 80
## Tuning
##  held constant at a value of 15
## Tuning parameter 'subsample' was held
##  constant at a value of 0.8

credit.xgb6 is built to correct the eta and the number of iterations in order to get more stable model.

parameters_xgb6 <- expand.grid(nrounds = 140,
                              max_depth = 6,
                              eta = 0.05, 
                              gamma = 1,
                              colsample_bytree = 0.4,
                              min_child_weight = 15,
                              subsample = 0.7)

set.seed(123456789)
credit.xgb6 <- train(model1.formula,
                      data = train,
                      method = "xgbTree",
                      trControl = ctrl_cv5,
                      tuneGrid  = parameters_xgb6)
credit.xgb6 %>% saveRDS(here("output", "credit.xgb6.rds"))

credit.xgb6

## eXtreme Gradient Boosting 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results:
## 
##   ROC        Sens       Spec       Accuracy  Kappa    
##   0.7510908  0.9178082  0.2962963  0.75      0.2548434
## 
## Tuning parameter 'nrounds' was held constant at a value of 140
## Tuning
##  held constant at a value of 15
## Tuning parameter 'subsample' was held
##  constant at a value of 0.8

To verify the models the Gini coefficient on train and test sample is compared.

Next, into one list are collected and send to the ggroc() function.

On the test sample the model xgb6 performs better than the other models and the model xgb6 is less over-fitted (the differences of Gini confection between train and test sample are equal 9.7%).

Benchmark model

Results of estimation of 3 models are received using Machine Learning techniques. In order to compare received results Benchmark model is constructed: logistic regression.

To receive the results of Benchmark models we use 5-fold cross-validation process.

Logistic regression

credit.logit

## Generalized Linear Model 
## 
## 500 samples
##  14 predictor
##   2 classes: 'Yes', 'No' 
## 
## Pre-processing: centered (41), scaled (41) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 400, 400, 400, 400, 400 
## Resampling results:
## 
##   ROC        Sens       Spec       Accuracy  Kappa    
##   0.7426687  0.8821918  0.4074074  0.754     0.3148479

AUROC statistics is equal 74.3% for logistic regression in train sample.

Ensembling

4 different models are estimated on the training data set using 5-fold cross-validation process:

Logistic regression
Decision Tree
Random Forests
XGBoost

bottom-layer models are train four alternative models (logistic regression, decision tree, random forests, xgboost).

The names of four considered models are defined.

models <- c("logit","tree5", "rf3", "xgb6")

Predicting probabilities are calculated for the training dataset using the dplyr package.

prognoza_Yes <- function(model, dane) {
  prognoza <- 
    model %>% 
    paste0("credit.", .) %>% 
    get() %>% 
    predict(dane,
            type = "prob") %>% 
    .[, "Yes"]
  return(prognoza)
}

The correlation among four alternative models is calculated on training sample.

cor(preds.train)

##           logit     tree5       rf3      xgb6
## logit 1.0000000 0.6384253 0.8584019 0.8220886
## tree5 0.6384253 1.0000000 0.7308375 0.7568983
## rf3   0.8584019 0.7308375 1.0000000 0.9059911
## xgb6  0.8220886 0.7568983 0.9059911 1.0000000

corrplot::corrplot(cor(preds.train))

preds.train$default <- train$default

Based on the correlation matrix it is observed that correlation among four alternative models is high. The highest correlation between RandomForest model and XGboost.

The correlation among four alternative models is calculated on testing sample .

Again, let us analyze correlations among them.

cor(preds.test)

##           logit     tree5       rf3      xgb6
## logit 1.0000000 0.6558471 0.8506493 0.8199487
## tree5 0.6558471 1.0000000 0.7228121 0.7707745
## rf3   0.8506493 0.7228121 1.0000000 0.9179117
## xgb6  0.8199487 0.7707745 0.9179117 1.0000000

corrplot::corrplot(cor(preds.test))

preds.test$default <- test$default

Based on the correlation matrix we can see that correlation among four alternative models on testing sample are similar.

Stacking

For each model, we take out predictions of success for each observation in the cross-validation process.

Predictions of success for each model are saved in the $pred element.

preds_cv <- 
  sapply(models,
         function(x) x %>%
           paste0("credit.", .) %>%
           get() %>% 
           .["pred"] %>% 
           # this is a list
           .[[1]] %>% 
           .[,"Yes"]) %>% 
  data.frame()
head(preds_cv)

##       logit     tree5       rf3      xgb6
## 1 0.7062415 0.8784530 0.4582145 0.4397653
## 2 0.7262053 0.9117647 0.5817320 0.4871082
## 3 0.8698391 0.2500000 0.8520608 0.7764295
## 4 0.7225765 0.8784530 0.8593019 0.5356862
## 5 0.5748870 0.7222222 0.8747574 0.7190288
## 6 0.6328738 0.8784530 0.8215651 0.8002991

The order of the observations is saved in the $pred$RowIndex element.

preds_cv$RowIndex <- credit.logit$pred$rowIndex
head(preds_cv)

##       logit     tree5       rf3      xgb6 RowIndex
## 1 0.7062415 0.8784530 0.4582145 0.4397653       10
## 2 0.7262053 0.9117647 0.5817320 0.4871082       11
## 3 0.8698391 0.2500000 0.8520608 0.7764295       18
## 4 0.7225765 0.8784530 0.8593019 0.5356862       24
## 5 0.5748870 0.7222222 0.8747574 0.7190288       30
## 6 0.6328738 0.8784530 0.8215651 0.8002991       33

Sort observations are used.

preds_cv <- 
  preds_cv %>% 
  arrange(RowIndex)
head(preds_cv)

##       logit     tree5       rf3      xgb6 RowIndex
## 1 0.9602376 0.7878788 0.5602240 0.7938189        1
## 2 0.3075296 0.3684211 0.6964258 0.4639506        2
## 3 0.9889599 0.9246575 0.6632885 0.9513080        3
## 4 0.7699881 0.7368421 0.6567246 0.4384797        4
## 5 0.7073578 0.7878788 0.4971975 0.8521206        5
## 6 0.9070644 0.8842105 0.9327884 0.9133524        6

preds_cv$default <- train$default

Let’s construct the top layer model. In this case the logistic regression model is used.

(stacking.formula <- 
    as.formula(paste0("default ~ ",
                      paste0(models, collapse = " +")))
)

## default ~ logit + tree5 + rf3 + xgb6

logistic regression model is estimated.

set.seed(123456789)
credit.stacking.logit <- 
  train(stacking.formula, 
        data = preds_cv, 
        method = "glm",
        family = "binomial", 
        preProcess = c("center", "scale"),
        trControl = ctrl_cv5)

Predictions in the training dataset are obtained.

preds.train$stacking.logit <- 
  predict(credit.stacking.logit,
          newdata = preds_cv)

Probabilities of success for training sample are added.

preds.train$stacking.logit.p <- 
  predict(credit.stacking.logit,
          newdata = preds_cv, 
          type = "prob")[, "Yes"]
head(preds.train)

##       logit     tree5       rf3      xgb6 default stacking.logit
## 1 0.9583503 0.8333333 0.8398099 0.8605645     Yes            Yes
## 2 0.2831211 0.0000000 0.3518560 0.4337821      No             No
## 3 0.9778150 0.8793103 0.9207844 0.9581582     Yes            Yes
## 4 0.6447733 0.3750000 0.6267033 0.4632163     Yes            Yes
## 5 0.7176260 0.8793103 0.8201274 0.8479191     Yes            Yes
## 6 0.9503780 0.8793103 0.9186552 0.9320341     Yes            Yes
##   stacking.logit.p
## 1        0.8578812
## 2        0.3408507
## 3        0.9208861
## 4        0.5534084
## 5        0.8248563
## 6        0.8980857

Predictions in the testing dataset are calculated.

preds.test$stacking.logit <- 
  predict(credit.stacking.logit,
          newdata = preds.test[, models])

Probabilities of success for testing sample are added.

preds.test$stacking.logit.p <- 
  predict(credit.stacking.logit,
          newdata = preds.test[, models],
          type = "prob")[, "Yes"]

The required element is to verify the quality of the model on training sample.

The required element is to verify the quality of the model on test sample.

To sum up, the stable model for stacking approach is obtained.

Conclusion

Commercial banks are using the IRB approach more and more over than the standard approach due to the fact that banks have achieved significant profits. As part of the IRB approach, the basic ingredients of credit risk should be estimated, including PD (Probability of default) models.

In this study, the following models are estimated: 1. Logistic regression 2. Decision Tree 3. Random Forests 4. XGBoost based on the German scoring data. Additionally, the ensembling procedure is performed.

Amaong the 3 different models featured in this report (Decision Tree, Random Forests, XGBoost), logistic regression as a benchmark model worked best for train samples compared to Decision Tree, XGBoost and also in relation to stacking approach based on the Gini coefficient. For the test sample, the logistic regression achieved better results only in relation to Decision Tree. However, if the stability of the models is considered, then the most stable is stacking approach. Additionally, it should be noted that considered dataset covers a relatively small number of observations.

Classification on the German scoring data

Marian Nehrebecki and Aleksander Partyga

2022-04-04 19:39:10

Introduction

Libraries

Data and problem description

Exploratory data analysis

Data about defaults

Initial analysis of the variable

Information value

Data division, modelling and results

Comparison of prediction accuracy of at least 3 different techniques

Classification trees

Random Forest

XGBoost

Benchmark model

Logistic regression

Ensembling

Stacking

Conclusion