The aim of the project is to perform a classification on a data set and to compare the quality of the models and methods used.
The structure of the project is as follows. The first part presents description of data and problem analyzed (initial descriptive analysis of the data, variable transformations, variable selection methods). The second part contains training / test data division, resampling (e.g. cross-validation). The third part covers comparison of prediction accuracy of at least 3 different machine learning techniques (Decision Tree, Random Forest, XGBoost). In order to compare the obtained results, a logistic regression is constructed as a Benchmark model. Finally, the ensembling procedure is used. The last part presents summary and conclusions.
In order to realize presented aim of the project useful packages should be installed.¶
The German scoring data is about persons, their characteristics and also include the variable responsible for credit risk (variable: default= {(1-default; 0-non-default). The data is collected by Professor Dr. Hans Hofmann from Institut fur Statistik und Okonometrie from Universitat Hamburg. The data is found on UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
Number of Observations: 1000
Number of Attributes: 20 (7 numerical, 13 categorical)
Attribute description for the German scoring data:
Attribute 1: (qualitative) Status of existing checking account
– A11 : … < 0 DM
– A12 : 0 <= … < 200 DM
– A13 : … >= 200 DM / salary assignments for at least 1 year
– A14 : no checking account
Attribute 2: (numerical) Duration in month
Attribute 3: (qualitative) Credit history
– A30 : no credits taken/ all credits paid back duly
– A31 : all credits at this bank paid back duly
– A32 : existing credits paid back duly till now
– A33 : delay in paying off in the past
– A34 : critical account/other credits existing (not at this bank)
Attribute 4: (qualitative) Purpose
– A40 : car (new)
– A41 : car (used)
– A42 : furniture/equipment
– A43 : radio/television
– A44 : domestic appliances
– A45 : repairs
– A46 : education
– A47 : (vacation - does not exist?)
– A48 : retraining
– A49 : business
– A410 : others
Attribute 5: (numerical) Credit amount
Attribute 6: (qualitative) Savings account/bonds
– A61 : … < 100 DM
– A62 : 100 <= … < 500 DM
– A63 : 500 <= … < 1000 DM
– A64 : .. >= 1000 DM
– A65 : unknown/ no savings account
Attribute 7: (qualitative) Present employment since
– A71 : unemployed
– A72 : … < 1 year
– A73 : 1 <= … < 4 years
– A74 : 4 <= … < 7 years
– A75 : .. >= 7 years
Attribute 8: (numerical) Installment rate in percentage of disposable income
Attribute 9: (qualitative) Personal status and sex
– A91 : male : divorced/separated
– A92 : female : divorced/separated/married
– A93 : male : single
– A94 : male : married/widowed
– A95 : female : single
Attribute 10: (qualitative) Other debtors / guarantors
– A101 : none
– A102 : co-applicant
– A103 : guarantor
Attribute 11: (numerical) Present residence since
Attribute 12: (qualitative) Property
– A121 : real estate – A122 : if not A121 : building society savings agreement/ life insurance
– A123 : if not A121/A122 : car or other, not in attribute 6
– A124 : unknown / no property
Attribute 13: (numerical) Age in years
Attribute 14: (qualitative) Other installment plans
– A141 : bank
– A142 : stores
– A143 : none
Attribute 15: (qualitative) Housing
– A151 : rent
– A152 : own
– A153 : for free
Attribute 16: (numerical) Number of existing credits at this bank
Attribute 17: (qualitative) Job
– A171 : unemployed/ unskilled - non-resident
– A172 : unskilled - resident
– A173 : skilled employee / official
– A174 : management/ self-employed/ highly qualified employee/ officer
Attribute 18: (numerical) Number of people being liable to provide maintenance for
Attribute 19: (qualitative) Telephone
– A191 : none
– A192 : yes, registered under the customers name
Attribute 20: (qualitative) foreign worker
– A201 : yes
– A202 : no
Initially, let’s summarize the data with the use of in-built r functions.
credit <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
summary(credit)## V1 V2 V3 V4
## Length:1000 Min. : 4.0 Length:1000 Length:1000
## Class :character 1st Qu.:12.0 Class :character Class :character
## Mode :character Median :18.0 Mode :character Mode :character
## Mean :20.9
## 3rd Qu.:24.0
## Max. :72.0
## V5 V6 V7 V8
## Min. : 250 Length:1000 Length:1000 Min. :1.000
## 1st Qu.: 1366 Class :character Class :character 1st Qu.:2.000
## Median : 2320 Mode :character Mode :character Median :3.000
## Mean : 3271 Mean :2.973
## 3rd Qu.: 3972 3rd Qu.:4.000
## Max. :18424 Max. :4.000
## V9 V10 V11 V12
## Length:1000 Length:1000 Min. :1.000 Length:1000
## Class :character Class :character 1st Qu.:2.000 Class :character
## Mode :character Mode :character Median :3.000 Mode :character
## Mean :2.845
## 3rd Qu.:4.000
## Max. :4.000
## V13 V14 V15 V16
## Min. :19.00 Length:1000 Length:1000 Min. :1.000
## 1st Qu.:27.00 Class :character Class :character 1st Qu.:1.000
## Median :33.00 Mode :character Mode :character Median :1.000
## Mean :35.55 Mean :1.407
## 3rd Qu.:42.00 3rd Qu.:2.000
## Max. :75.00 Max. :4.000
## V17 V18 V19 V20
## Length:1000 Min. :1.000 Length:1000 Length:1000
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :1.000 Mode :character Mode :character
## Mean :1.155
## 3rd Qu.:1.000
## Max. :2.000
## V21
## Min. :1.0
## 1st Qu.:1.0
## Median :1.0
## Mean :1.3
## 3rd Qu.:2.0
## Max. :2.0
## 'data.frame': 1000 obs. of 21 variables:
## $ V1 : chr "A11" "A12" "A14" "A11" ...
## $ V2 : int 6 48 12 42 24 36 24 36 12 30 ...
## $ V3 : chr "A34" "A32" "A34" "A32" ...
## $ V4 : chr "A43" "A43" "A46" "A42" ...
## $ V5 : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ V6 : chr "A65" "A61" "A61" "A61" ...
## $ V7 : chr "A75" "A73" "A74" "A74" ...
## $ V8 : int 4 2 2 2 3 2 3 2 2 4 ...
## $ V9 : chr "A93" "A92" "A93" "A93" ...
## $ V10: chr "A101" "A101" "A101" "A103" ...
## $ V11: int 4 2 3 4 4 4 4 2 4 2 ...
## $ V12: chr "A121" "A121" "A121" "A122" ...
## $ V13: int 67 22 49 45 53 35 53 35 61 28 ...
## $ V14: chr "A143" "A143" "A143" "A143" ...
## $ V15: chr "A152" "A152" "A152" "A153" ...
## $ V16: int 2 1 1 1 2 1 1 1 1 2 ...
## $ V17: chr "A173" "A173" "A172" "A173" ...
## $ V18: int 1 1 2 2 2 2 1 1 1 1 ...
## $ V19: chr "A192" "A191" "A191" "A191" ...
## $ V20: chr "A201" "A201" "A201" "A201" ...
## $ V21: int 1 2 1 1 2 1 1 1 1 2 ...
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## V21
## 0
## [1] FALSE
Data credit consists of 21 variables. Above mentioned variables do not include missing observations. We should mark that our variables do not include the readable name of the variables. Therefore we should give the name of the variables.
colnames(credit) <- c("checking_account", "duration", "credit_history", "purpose", "credit_amount", "saving_account", "employment", "installment_rate", "status_sex", "other_debtors", "residince", "property", "age", "install_plans", "housing", "number_credits","job", "number_people", "telephone", "foreign", "default")Let’s analyze dependent variable - default - in our dataset.
##
## 0 1
## 0.7 0.3
In our sample we have 30.0% entities in default.
Let’s analyze explanatory variables in our dataset.
We should start with the preparation of the variables for further analysis. We start with the preparation of the nominal variables.
#---------------------------------------------------
# nominal variables
#---------------------------------------------------
# Storing qualitative variables as factors
# 1. credit_history
is.character(credit$credit_history)## [1] TRUE
credit$credit_history <- factor(credit$credit_history,
levels = c("A30", "A31", "A32", "A33", "A34"),
labels = c("NoCredit", "CreditsBankPaids", "CreditsPaid", "CreditDelay", "CriticalAccount"))
table(credit$credit_history)##
## NoCredit CreditsBankPaids CreditsPaid CreditDelay
## 40 49 530 88
## CriticalAccount
## 293
# it should be converted into a factor
credit$credit_history <- as.factor(credit$credit_history)
table(credit$credit_history)##
## NoCredit CreditsBankPaids CreditsPaid CreditDelay
## 40 49 530 88
## CriticalAccount
## 293
## [1] "NoCredit" "CreditsBankPaids" "CreditsPaid" "CreditDelay"
## [5] "CriticalAccount"
## Factor w/ 5 levels "NoCredit","CreditsBankPaids",..: 5 3 5 3 4 3 3 3 3 5 ...
table_credit_history <- table(credit$credit_history, credit$default)
mosaicplot(table_credit_history, color = T)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | credit$default
## credit$credit_history | 0 | 1 | Row Total |
## ----------------------|-----------|-----------|-----------|
## NoCredit | 15 | 25 | 40 |
## | 0.375 | 0.625 | 0.040 |
## ----------------------|-----------|-----------|-----------|
## CreditsBankPaids | 21 | 28 | 49 |
## | 0.429 | 0.571 | 0.049 |
## ----------------------|-----------|-----------|-----------|
## CreditsPaid | 361 | 169 | 530 |
## | 0.681 | 0.319 | 0.530 |
## ----------------------|-----------|-----------|-----------|
## CreditDelay | 60 | 28 | 88 |
## | 0.682 | 0.318 | 0.088 |
## ----------------------|-----------|-----------|-----------|
## CriticalAccount | 243 | 50 | 293 |
## | 0.829 | 0.171 | 0.293 |
## ----------------------|-----------|-----------|-----------|
## Column Total | 700 | 300 | 1000 |
## ----------------------|-----------|-----------|-----------|
##
##
##
## Pearson's Chi-squared test
##
## data: credit$default and credit$credit_history
## X-squared = 61.691, df = 4, p-value = 1.279e-12
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
Based on the presented graph and table it is noticed that exists the relation between the credit history and the level of default.
In case of existing credits paid back duly till now it is observed that 361 entities (68.1%) with non-default status and 169 entities (31.9%) with default status.
People with no credits taken/ all credits paid back duly are the most often vulnerable to being in default status.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
In case of the variable purpose it is observed that purpose of taking a loan to buy radio/television 218 entities (77.9%) are in non-default status and 62 entities (22.1%) are in default status. Additionally there is need to mark that people taking a loan for education more often are in default status than other purpose to take the credit.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
In case of the variable personal status and sex it is observed that single male 402 entities (73.4%) are in non-default status and 146 entities (26.6%) are in default status. Male with divorced/separated status more often are in default status in comparison to other status.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
In case of other debtors / guarantors it is observed that the position “none” has 635 entities (70%) are in non-default status and 272 entities (30%) are in default status. Also other debtors / guarantors in position co-applicant more often are in default than in other positions.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
In case of property it is observed that property with position car 230 entities (69.3%) are in non-default status and 102 entities (30.7%) are in default status. Additionally people with no property more often in default in comparison to other properties.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
It is observed that in other installment plans for position “none” 590 entities (72.5%) are in non-default status and 224 entities (27.5%) are in default status. For other installment plans with position bank exists higher probability being in default status than in other positions.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
In case of houses it is observed that own houses 527 entities (73.9%) are in non-default status and 186 entities (26.1%) are in default status. Additionally people with housing for free have high probability being default status in comparison to other positions.
Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level. This variable is removed.
In case of job it is observed that skilled employee / official 444 entities (70.5%) are in non-default status and 186 entities (29.5%) are in default status. People who work in management area have high probability of default than other type of jobs.
Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level. This variable is removed.
In case of having telephone it is observed that 409 (68.6%) people with telephone are in non-default status and 187 entities (31.4%) are in default status. People without telephone have high probability of default than people with telephone.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
In case of foreign people it is observed that 667 (69.3%) entities are in non-default status and 296 entities (30.7%) are in default status. Foreign people have high probability of default than non-foreign people.
Then discrete ordered variables will be prepared.
#--------------------------------------------
# ORDINAL variables
#--------------------------------------------
# 1. saving_account
table(credit$saving_account)##
## A61 A62 A63 A64 A65
## 603 103 63 48 183
# we add the order of values
credit$saving_account<- factor(credit$saving_account,
# levels from lowest to highest
levels = c("A65",
"A61",
"A62",
"A63",
"A64"),
ordered = TRUE) # ordinal
credit$saving_account <- factor(credit$saving_account,
levels = c("A65", "A61", "A62", "A63", "A64"),
labels = c("NoSavings", "<100", "100_500", "500_1000", ">=1000"))
# we check the table of frequencies
table(credit$saving_account)##
## NoSavings <100 100_500 500_1000 >=1000
## 183 603 103 63 48
## [1] "NoSavings" "<100" "100_500" "500_1000" ">=1000"
## Ord.factor w/ 5 levels "NoSavings"<"<100"<..: 1 2 2 2 2 1 4 2 5 2 ...
table_saving_account <- table(credit$saving_account, credit$default)
mosaicplot(table_saving_account, color = T)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | credit$default
## credit$saving_account | 0 | 1 | Row Total |
## ----------------------|-----------|-----------|-----------|
## NoSavings | 151 | 32 | 183 |
## | 0.825 | 0.175 | 0.183 |
## ----------------------|-----------|-----------|-----------|
## <100 | 386 | 217 | 603 |
## | 0.640 | 0.360 | 0.603 |
## ----------------------|-----------|-----------|-----------|
## 100_500 | 69 | 34 | 103 |
## | 0.670 | 0.330 | 0.103 |
## ----------------------|-----------|-----------|-----------|
## 500_1000 | 52 | 11 | 63 |
## | 0.825 | 0.175 | 0.063 |
## ----------------------|-----------|-----------|-----------|
## >=1000 | 42 | 6 | 48 |
## | 0.875 | 0.125 | 0.048 |
## ----------------------|-----------|-----------|-----------|
## Column Total | 700 | 300 | 1000 |
## ----------------------|-----------|-----------|-----------|
##
##
##
## Pearson's Chi-squared test
##
## data: credit$default and credit$saving_account
## X-squared = 36.099, df = 4, p-value = 2.761e-07
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
People with saving account less than 100 have high probability of default than others.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
People with checking account less than 0 have high probability than others.
Based on Chi squared test the null hypothesis is rejected that the differences in groups seem to be not statistically significant on 5% significance level.
People employed in enterprises with less than 1 people have higher probability of default than others.
Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level.
In case of one child being liable to provide maintenance for has higher probability of default than two children being liable to provide maintenance for.
Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level. But if we consider significance level on 15% the null hypothesis is rejected that the differences in groups seem to be not statistically significant.
For 4% installment rate has higher probability of default than other installment rate.
Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level.
People with two present residences have higher probability of default than others.
Based on Chi squared test there is no grounds to reject the null hypothesis that the differences in groups seem to be not statistically significant on 5% significance level.
People with four number of credits have higher probability of default than others.
Next, we prepare the continuous explanatory variables.
#--------------------------------------
# - X - NUMERIC VARIABLES
#--------------------------------------
# To applied to all columns we use sapply()
sapply(credit, is.numeric)## checking_account duration credit_history purpose
## FALSE TRUE FALSE FALSE
## credit_amount saving_account employment installment_rate
## TRUE FALSE FALSE FALSE
## status_sex other_debtors residince property
## FALSE FALSE FALSE FALSE
## age install_plans housing number_credits
## TRUE FALSE FALSE FALSE
## job number_people telephone foreign
## FALSE FALSE FALSE FALSE
## default
## FALSE
# We filter out those variables which are numeric
# We store their list in a vector
credit_numeric_vars <-
sapply(credit, is.numeric) %>%
which() %>%
names()
credit_numeric_vars## [1] "duration" "credit_amount" "age"
# We check how many unique values
sapply(credit[, credit_numeric_vars],
function(x)
unique(x) %>%
length()) %>%
# lets sort variables by increasing
# number of levels in the end
sort()## duration age credit_amount
## 33 53 921
## credit$default: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 12.00 18.00 19.21 24.00 60.00
## ------------------------------------------------------------
## credit$default: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 12.00 24.00 24.86 36.00 72.00
## credit$default: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 27.00 34.00 36.22 42.25 75.00
## ------------------------------------------------------------
## credit$default: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 25.00 31.00 33.96 40.00 74.00
## credit$default: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1376 2244 2985 3635 15857
## ------------------------------------------------------------
## credit$default: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 433 1352 2574 3938 5142 18424
Duration for non-default status is 19 months and for default-status is 25 months. Average age people in default is 36 years and in non-default is 34 years. People with non-default status have 2985 units of credit amount and people with default status have 3938 units of credit amount on average level.
Let’s store their list in a vector as numeric.
## [1] "duration" "credit_amount" "age"
sapply(credit[, credit_numeric_vars],
function(x)
unique(x) %>%
length()) %>%
# lets sort variables by increasing
# number of levels in the end
sort()## duration age credit_amount
## 33 53 921
In order to verify collinearity correlation matrix between numeric variables is calculated.
#---------------------------------------------------
# mutually correlated (irrelevant) variables
#---------------------------------------------------
# we calculate correlations between variables to identify redundant features
credit_correlations <-
cor(credit[,credit_numeric_vars],
use = "pairwise.complete.obs")
#credit_correlationsIn order to avoid the problem of collinearity, there are no variables highly correlated.
# lets decrease the cutoff to 0.75
findCorrelation(credit_correlations,
cutoff = 0.75,
names = TRUE)## character(0)
Let’s consider Information Value for each variables in our dataset.
save(list = c("credit"),
file = "slides/credit.RData")
credit$default <- ifelse(credit$default == '1', 1, 0)
IV <- create_infotables(data = credit, y = "default", bins = 5, parallel = FALSE)
IV$Summary## Variable IV
## 1 checking_account 6.660115e-01
## 3 credit_history 2.932335e-01
## 2 duration 2.183994e-01
## 6 saving_account 1.960096e-01
## 4 purpose 1.691951e-01
## 12 property 1.126383e-01
## 5 credit_amount 9.342540e-02
## 13 age 8.781228e-02
## 7 employment 8.643363e-02
## 15 housing 8.329343e-02
## 14 install_plans 5.761454e-02
## 9 status_sex 4.467068e-02
## 20 foreign 4.387741e-02
## 10 other_debtors 3.201932e-02
## 8 installment_rate 2.632209e-02
## 16 number_credits 1.326652e-02
## 17 job 8.762766e-03
## 19 telephone 6.377605e-03
## 11 residince 3.588773e-03
## 18 number_people 4.339223e-05
Information values for each variables are greater than 0.02 without number_credits, job, telephone, residince and number_people. These variables will be excluded from the model.
credit[c("number_credits", "job", "telephone", "residince", "number_people", "status_sex")] <- list(NULL)Additionally status_sex is dropped because actually commercial banks do not use in models discrimination by sex.
The data is divided into training and test samples in proportions 50:50, because small sample is considered.
credit$default <- as.factor(ifelse(credit$default == 1, "Yes", "No"))
set.seed(123456)
d = sort(sample(nrow(credit), nrow(credit) * .5))
train <- credit[d, ]
test <- credit[-d, ]Before producing the first tree, let us also examine frequencies of the target variable inside the training and the testing set.
##
## No Yes
## 0.73 0.27
##
## No Yes
## 0.67 0.33
Percent of defaults in train sample and test sample are approximately equal.
Let us define the formula of the model.
model1.formula <- default ~ checking_account + duration + credit_history +
purpose + credit_amount + saving_account + employment + installment_rate +
property + age + housing +other_debtors + install_plans + foreignThe tree is created on the basis of all chosen predictors.
credit.tree1 is constructed using the Gini Index split criterion.
save(list = c("train"),
file = "slides/train.RData")
save(list = c("test"),
file = "slides/test.RData")
credit.tree1 <-
rpart(model1.formula, # model formula
data = train, # data
method = "class") # type of the tree: classification
#credit.tree1
fancyRpartPlot(credit.tree1)credit.tree2 is constructed using Information Gain which is based on the concept of entropy.
credit.tree2 <-
rpart(model1.formula,
data = train,
method = "class",
parms = list(split = 'information'))
#credit.tree2
fancyRpartPlot(credit.tree2)Obtained tree no 2 is quite identical as that obtained with Gini coefficient.
credit.tree3 is constructed based on stopping criteria: minsplit, minbucket, maxdepth.
credit.tree3 <-
rpart(model1.formula,
data = train,
method = "class",
minsplit = 100, # minimum number of observations in the given node for
#the considered split (default = 20)
minbucket = 50, # minimum number of observations in the terminal nodes
# (default = round(minsplit/3))
maxdepth = 10, # maximum depth of the tree, assuming that first node
#(root node) is denoted by 0 (default = 30)
cp = 0) # non-negative complexity parameter
fancyRpartPlot(credit.tree3)Based on the criteria used, smaller tree is obtained.
credit.tree4 is constructed based on imposing no restrictions on the growth of trees.
credit.tree4 <-
rpart(model1.formula,
data = train,
method = "class",
# we don't impose any restriction on the tree growth
cp = -1) # cp - complexity parameter
#credit.tree4
fancyRpartPlot(credit.tree4)The resulting tree is unlimited (large) and unreadable. The results obtained, due to the possibility of allowing trees to grow to large sizes, lead to over-fit of trees and poor results in the test sample.
credit.tree4p is constructed based on pruning criteria. In the construction of the tree, the parameter cp with the least error is used.
opt <- which.min(credit.tree4$cptable[, "xerror"])
cp <- credit.tree4$cptable[opt, "CP"]
credit.tree4p <-
prune(credit.tree4, cp = cp)
fancyRpartPlot(credit.tree4p)credit.tree5 is constructed based on choosing the range of cp values for the cross-validation process in relation to the AUC (ROC) values.
In obtaining the above tree, caret package is used.
Let us also define a summary functions which will report 5 standard measures of the model quality.
fiveStats <- function(...) c(twoClassSummary(...),
defaultSummary(...))
ctrl_cv5 <- trainControl(method = "cv",
number = 5,
# saving all predictions
savePredictions = "final",
summaryFunction = fiveStats,
classProbs = TRUE)## model parameter label forReg forClass probModel
## 1 rpart cp Complexity Parameter TRUE TRUE TRUE
cp.grid <- expand.grid(cp = seq(0, 0.03, 0.001))
set.seed(123456789)
credit.tree5 <-
train(model1.formula,
data = train,
method = "rpart",
metric = "ROC",
trControl = ctrl_cv5,
tuneGrid = cp.grid)
credit.tree5## CART
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec Accuracy Kappa
## 0.000 0.7017757 0.8739726 0.3481481 0.732 0.2425358
## 0.001 0.7017757 0.8739726 0.3481481 0.732 0.2425358
## 0.002 0.7017757 0.8739726 0.3481481 0.732 0.2425358
## 0.003 0.7017757 0.8739726 0.3481481 0.732 0.2425358
## 0.004 0.6979198 0.8876712 0.3407407 0.740 0.2557514
## 0.005 0.6968544 0.9068493 0.3333333 0.752 0.2745060
## 0.006 0.6968544 0.9068493 0.3333333 0.752 0.2745060
## 0.007 0.6970066 0.9068493 0.3259259 0.750 0.2667056
## 0.008 0.6970066 0.9068493 0.3259259 0.750 0.2667056
## 0.009 0.6970066 0.9068493 0.3259259 0.750 0.2667056
## 0.010 0.7078133 0.8904110 0.3555556 0.746 0.2769349
## 0.011 0.7078133 0.8904110 0.3555556 0.746 0.2769349
## 0.012 0.7078133 0.8904110 0.3555556 0.746 0.2769349
## 0.013 0.6988838 0.8876712 0.3629630 0.746 0.2812751
## 0.014 0.6988838 0.8876712 0.3629630 0.746 0.2812751
## 0.015 0.6988838 0.8876712 0.3629630 0.746 0.2812751
## 0.016 0.6988838 0.8876712 0.3629630 0.746 0.2812751
## 0.017 0.6988838 0.8876712 0.3629630 0.746 0.2812751
## 0.018 0.6988838 0.8876712 0.3629630 0.746 0.2812751
## 0.019 0.6883815 0.8821918 0.3777778 0.746 0.2885569
## 0.020 0.6883815 0.8821918 0.3777778 0.746 0.2885569
## 0.021 0.6883815 0.8821918 0.3777778 0.746 0.2885569
## 0.022 0.6850330 0.8794521 0.3481481 0.736 0.2539525
## 0.023 0.6850330 0.8794521 0.3481481 0.736 0.2539525
## 0.024 0.6887874 0.8931507 0.3185185 0.738 0.2433087
## 0.025 0.6960426 0.8904110 0.3185185 0.736 0.2391009
## 0.026 0.6960426 0.8904110 0.3185185 0.736 0.2391009
## 0.027 0.6960426 0.8904110 0.3185185 0.736 0.2391009
## 0.028 0.7029427 0.8986301 0.2962963 0.736 0.2247007
## 0.029 0.7029427 0.8986301 0.2962963 0.736 0.2247007
## 0.030 0.7029427 0.8986301 0.2962963 0.736 0.2247007
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.012.
All the necessary trees for verification are assessed. Predictions on the training set and on the test set are created.
The values of the ROC curve for training sample are calculated.
We present the ROC curve on the graph for the obtained models for training sample. Additionally, the Gini coefficients are presented. The higher the value of the Gini coefficient, the better the model classifies.
To sum up, in case of training sample the highest Gini coefficient for tree4 and tree5.
The same calculations are repeated for the testing sample. The values of the ROC curve for test sample are calculated.
The ROC curve is presented on the graph for the obtained models for test sample.
The models perform a bit worse on the test sample.tree4p model is more stable.
Then we indicate VARIABLE IMPORTANCE.
In order to answer the question, the rpart() function which relies on the overall measure of importance of the predictors (the sum of the quality of the divisions for which the predictor is used in the division rule).
tree4p.importance <- credit.tree4p$variable.importance
# margins c(bottom, left, top, right)
# default c(5, 4, 4, 2) + 0.1.
par(mar = c(5.1, 6.1, 4.1, 2.1))
barplot(rev(tree4p.importance), # vector
col = "blue", # colors
main = "importance of variables in model tree4p",
horiz = T, # horizontal type of plot
las = 1, # labels always horizontally
cex.names = 0.6)In this case, five most important predictors are: checking_account, duration, credit_history, purpose, credit_amount.
In addition, the varImp() function from the caret package to determine the importance of the variables is used, which consists in determining the measure: average decrease in node impurity.
## Overall
## age 17.284520
## checking_account 21.380564
## credit_amount 17.951667
## credit_history 26.135652
## duration 27.721887
## employment 1.854236
## installment_rate 5.628966
## other_debtors 5.099142
## property 4.722302
## purpose 7.357860
## saving_account 5.640125
## housing 0.000000
## install_plans 0.000000
## foreign 0.000000
tree4p.importance2 <-
tree4p.importance2 %>%
rownames_to_column("variable") %>%
arrange(desc(Overall))
barplot(rev(tree4p.importance2$Overall), # wector
names.arg = rev(tree4p.importance2$variable), # lables
col = "blue", # colors
main = "importance of variables in tree4p",
horiz = T, # horizontal type of plot
las = 1, # labels always horizontally
cex.names = 0.6)Let us analyze the first 10 variables that are selected on the basis of two approaches.
## [1] "duration" "credit_history" "checking_account" "credit_amount"
## [5] "age" "purpose" "saving_account" "property"
Based on two approaches for choosing important variables it is observed that 8 of them are common, but in different order.
Objects with the models into separate rds files are saved.
credit.tree1 %>% saveRDS(here("output", "credit.tree1.rds"))
credit.tree2 %>% saveRDS(here("output", "credit.tree2.rds"))
credit.tree3 %>% saveRDS(here("output", "credit.tree3.rds"))
credit.tree4 %>% saveRDS(here("output", "credit.tree4.rds"))
credit.tree4p %>% saveRDS(here("output", "credit.tree4p.rds"))
credit.tree5 %>% saveRDS(here("output", "credit.tree5.rds"))Our second approach for classification is Random Forest.
credit.rf1 is our first random forest using randomForests() function.
set.seed(123456789)
credit.rf1 <- randomForest(model1.formula,
data = train, localImp = TRUE)
# Saving the object with the model to the external `rds` file
saveRDS(credit.rf1, file = here("output", "credit.rf.rds"))
print(credit.rf1)##
## Call:
## randomForest(formula = model1.formula, data = train, localImp = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 23.6%
## Confusion matrix:
## Yes No class.error
## Yes 332 33 0.09041096
## No 85 50 0.62962963
Based on the classification table prediction error is lower for predictions of the "Yes" value and much more higher for the "No" value.
OOB estimate of error rate: 23.6%
# the dark solid line: OOB error,
# the red line: the prediction error for "Yes" response,
# the green line: the prediction error for "No" response.Based on the results of the OOB (out-of-bag) error approx. 100 trees is enough for analysis.
In according to above for credit.rf2 is restricted to 100 trees and estimate the model on bootstrap samples from the full data set. Additionally the mtry parameter is optimized: between 2 and 12. To do that the cross-validation process is used.
parameters_rf <- expand.grid(mtry = 2:12)
ctrl_oob <- trainControl(method = "oob", classProbs = TRUE)
set.seed(123456789)
credit.rf2 <-
train(model1.formula,
data = train,
method = "rf",
ntree = 100,
nodesize = 100,
tuneGrid = parameters_rf,
trControl = ctrl_oob,
importance = TRUE)
# saving the object to the external file
saveRDS(object = credit.rf2,
file = here("output", "credit.rf2.rds"))
# loading the object from the external file
credit.rf2 <- readRDS(here("output", "credit.rf2.rds"))
credit.rf2## Random Forest
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.730 0.00000000
## 3 0.730 0.00000000
## 4 0.730 0.00000000
## 5 0.730 0.00000000
## 6 0.730 0.00000000
## 7 0.736 0.03860160
## 8 0.734 0.04727794
## 9 0.740 0.06568923
## 10 0.738 0.04907085
## 11 0.732 0.02403496
## 12 0.736 0.04500072
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 9.
The final value used for the model is mtry = 9.
credit.rf3 is constrained to implement random forests model from the ranger package to optimize mtry.
## model parameter label forReg forClass probModel
## 1 ranger mtry #Randomly Selected Predictors TRUE TRUE TRUE
## 2 ranger splitrule Splitting Rule TRUE TRUE TRUE
## 3 ranger min.node.size Minimal Node Size TRUE TRUE TRUE
Tuning parameter splitrule is held constant at a value of Gini.
parameters_ranger <-
expand.grid(mtry = 2:12,
# split rule
splitrule = "gini",
# minimum size of the terminal node
min.node.size = c(40, 80, 100)) set.seed(123456789)
credit.rf3 <-
train(model1.formula,
data = train,
method = "ranger",
num.trees = 100, # default = 500
# numbers of processor cores to use in computations
num.threads = 3,
# impurity measure
importance = "impurity",
# parameters
tuneGrid = parameters_ranger,
trControl = ctrl_cv5)
# saving the object to the external file
saveRDS(credit.rf3, here("output", "credit.rf3.rds"))
credit.rf3 <- readRDS(here("output", "credit.rf3.rds"))
credit.rf3## Random Forest
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results across tuning parameters:
##
## mtry min.node.size ROC Sens Spec Accuracy
## 2 40 0.7420599 0.9890411 0.007407407 0.724
## 2 80 0.7459158 1.0000000 0.000000000 0.730
## 2 100 0.7348554 1.0000000 0.000000000 0.730
## 3 40 0.7433790 0.9753425 0.066666667 0.730
## 3 80 0.7504820 1.0000000 0.007407407 0.732
## 3 100 0.7404363 1.0000000 0.000000000 0.730
## 4 40 0.7541350 0.9698630 0.096296296 0.734
## 4 80 0.7507864 0.9917808 0.037037037 0.734
## 4 100 0.7535261 0.9890411 0.014814815 0.726
## 5 40 0.7451040 0.9698630 0.140740741 0.746
## 5 80 0.7511923 0.9835616 0.081481481 0.740
## 5 100 0.7512938 0.9972603 0.037037037 0.738
## 6 40 0.7396246 0.9589041 0.237037037 0.764
## 6 80 0.7519026 0.9835616 0.066666667 0.736
## 6 100 0.7532217 0.9890411 0.037037037 0.732
## 7 40 0.7496702 0.9479452 0.207407407 0.748
## 7 80 0.7544394 0.9726027 0.088888889 0.734
## 7 100 0.7614409 0.9863014 0.044444444 0.732
## 8 40 0.7503805 0.9479452 0.207407407 0.748
## 8 80 0.7566717 0.9726027 0.155555556 0.752
## 8 100 0.7534247 0.9780822 0.074074074 0.734
## 9 40 0.7541350 0.9589041 0.207407407 0.756
## 9 80 0.7465246 0.9616438 0.162962963 0.746
## 9 100 0.7569762 0.9808219 0.096296296 0.742
## 10 40 0.7557585 0.9506849 0.259259259 0.764
## 10 80 0.7509893 0.9616438 0.162962963 0.746
## 10 100 0.7505835 0.9753425 0.111111111 0.742
## 11 40 0.7525114 0.9452055 0.251851852 0.758
## 11 80 0.7507864 0.9698630 0.155555556 0.750
## 11 100 0.7504820 0.9780822 0.111111111 0.744
## 12 40 0.7417555 0.9369863 0.244444444 0.750
## 12 80 0.7512938 0.9698630 0.148148148 0.748
## 12 100 0.7540335 0.9643836 0.140740741 0.742
## Kappa
## -0.004715835
## 0.000000000
## 0.000000000
## 0.058580360
## 0.010633649
## 0.000000000
## 0.088894949
## 0.040687679
## 0.005855492
## 0.145600924
## 0.087925390
## 0.049124854
## 0.248023148
## 0.069970616
## 0.036841674
## 0.197158083
## 0.083220385
## 0.043621500
## 0.196814603
## 0.169244049
## 0.071577723
## 0.212924898
## 0.160962963
## 0.102574611
## 0.258746153
## 0.163451815
## 0.115803995
## 0.244267168
## 0.164627056
## 0.120663424
## 0.224380646
## 0.155983717
## 0.137005151
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 6, splitrule = gini
## and min.node.size = 40.
The results of randomForests and ranger are different.
Based on the results the final values used for the model are mtry = 6, splitrule = gini and min.node.size = 40.
Based on the estimated models, the quality is compared to training sample and test sample.
To sum up, the first model rf1 on training sample is strongly over-fitted!
On the test sample the model rf3 performs better than the other models (excluding rf1) but the model rf3 is over-fitted.
VARIABLE IMPORTANCE is indicated.
source(here("funs", "getVarImpPlot4Ranger.R"))
getVarImpPlot4Ranger(credit.rf3$finalModel,
sort = TRUE,
main = "Importance of predictors",
n.var = 15)In this case, five most important predictors are: checking_account, duration, credit_amount, age, checking_account
The third alternative algorithm is eXtreme Gradient Boosting, because popular, quite quick and effective using the xgboost package.
The parameters which might be manipulated.
## model parameter label forReg forClass
## 1 xgbTree nrounds # Boosting Iterations TRUE TRUE
## 2 xgbTree max_depth Max Tree Depth TRUE TRUE
## 3 xgbTree eta Shrinkage TRUE TRUE
## 4 xgbTree gamma Minimum Loss Reduction TRUE TRUE
## 5 xgbTree colsample_bytree Subsample Ratio of Columns TRUE TRUE
## 6 xgbTree min_child_weight Minimum Sum of Instance Weight TRUE TRUE
## 7 xgbTree subsample Subsample Percentage TRUE TRUE
## probModel
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
## 7 TRUE
credit.xgb is constructed to optimal values of parameters for boosting with relatively large value of learning rate.
## eXtreme Gradient Boosting
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results across tuning parameters:
##
## nrounds ROC Sens Spec Accuracy Kappa
## 10 0.7208016 0.9479452 0.1111111 0.722 0.08115829
## 20 0.7296804 0.9424658 0.1555556 0.730 0.12825027
## 30 0.7270421 0.9260274 0.2592593 0.746 0.22287364
## 40 0.7306951 0.9123288 0.2518519 0.734 0.19558872
## 50 0.7301877 0.9068493 0.2962963 0.742 0.23488282
## 60 0.7257230 0.9013699 0.3185185 0.744 0.24911376
## 70 0.7213597 0.8986301 0.3111111 0.740 0.23897364
## 80 0.7260274 0.9068493 0.3111111 0.746 0.25056502
## 90 0.7193303 0.9013699 0.3037037 0.740 0.23593242
## 100 0.7239980 0.8986301 0.3259259 0.744 0.25395998
##
## Tuning parameter 'max_depth' was held constant at a value of 5
## Tuning
## parameter 'min_child_weight' was held constant at a value of 5
##
## Tuning parameter 'subsample' was held constant at a value of 0.8
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 30, max_depth = 5, eta
## = 0.2, gamma = 1, colsample_bytree = 0.26, min_child_weight = 5 and
## subsample = 0.8.
Based on the result of the model nrounds = 30 is obtained.
The next steps to determine an optimal number of trees for that value of learning rate.
credit.xgb2 is built to find optimal values of max_depthandmin_child_weight` parameters.
parameters_xgb2 <- expand.grid(nrounds = 30,
max_depth = seq(5, 10, 1),
eta = c(0.2),
gamma = 1,
colsample_bytree = c(0.26),
min_child_weight = seq(5, 30, 5),
subsample = 0.8)
set.seed(123456789)
credit.xgb2 <- train(model1.formula,
data = train,
method = "xgbTree",
trControl = ctrl_cv5,
tuneGrid = parameters_xgb2)
credit.xgb2 %>% saveRDS(here("output", "credit.xgb2.rds"))## eXtreme Gradient Boosting
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results across tuning parameters:
##
## max_depth min_child_weight ROC Sens Spec Accuracy
## 5 5 0.7342466 0.8986301 0.3777778 0.758
## 5 10 0.7417555 0.9123288 0.3481481 0.760
## 5 15 0.7436327 0.8931507 0.3629630 0.750
## 5 20 0.7320142 0.9452055 0.2222222 0.750
## 5 25 0.7202943 0.9643836 0.1259259 0.738
## 5 30 0.6894977 1.0000000 0.0000000 0.730
## 6 5 0.7505835 0.9150685 0.3703704 0.768
## 6 10 0.7438864 0.9150685 0.3407407 0.760
## 6 15 0.7442922 0.9123288 0.3185185 0.752
## 6 20 0.7329782 0.9342466 0.2370370 0.746
## 6 25 0.7090817 0.9643836 0.1259259 0.738
## 6 30 0.6598681 1.0000000 0.0000000 0.730
## 7 5 0.7237950 0.8794521 0.3407407 0.734
## 7 10 0.7300863 0.9013699 0.3185185 0.744
## 7 15 0.7478945 0.9150685 0.3407407 0.760
## 7 20 0.7214105 0.9287671 0.2074074 0.734
## 7 25 0.7183156 0.9726027 0.1185185 0.742
## 7 30 0.6727549 1.0000000 0.0000000 0.730
## 8 5 0.7336377 0.8986301 0.3407407 0.748
## 8 10 0.7349569 0.8876712 0.3333333 0.738
## 8 15 0.7514460 0.9150685 0.3333333 0.758
## 8 20 0.7279046 0.9424658 0.1851852 0.738
## 8 25 0.7130898 0.9671233 0.1037037 0.734
## 8 30 0.6872146 1.0000000 0.0000000 0.730
## 9 5 0.7420599 0.9013699 0.3259259 0.746
## 9 10 0.7500761 0.8986301 0.3407407 0.748
## 9 15 0.7481989 0.9123288 0.3333333 0.756
## 9 20 0.7259259 0.9369863 0.2222222 0.744
## 9 25 0.7168442 0.9643836 0.1407407 0.742
## 9 30 0.7096398 1.0000000 0.0000000 0.730
## 10 5 0.7283612 0.8821918 0.3407407 0.736
## 10 10 0.7372907 0.8986301 0.3185185 0.742
## 10 15 0.7394723 0.8821918 0.3037037 0.726
## 10 20 0.7287164 0.9397260 0.2074074 0.742
## 10 25 0.7053780 0.9698630 0.1037037 0.736
## 10 30 0.6814307 1.0000000 0.0000000 0.730
## Kappa
## 0.30700479
## 0.29674883
## 0.28586272
## 0.21075458
## 0.11702359
## 0.00000000
## 0.32642829
## 0.29248784
## 0.26983344
## 0.21405909
## 0.11496483
## 0.00000000
## 0.24791049
## 0.25184378
## 0.29392335
## 0.16988246
## 0.12124197
## 0.00000000
## 0.27094719
## 0.24717464
## 0.28962837
## 0.16324392
## 0.09215991
## 0.00000000
## 0.26145267
## 0.27331959
## 0.28761607
## 0.19734882
## 0.13761274
## 0.00000000
## 0.24743238
## 0.25092838
## 0.21343424
## 0.18657620
## 0.09554045
## 0.00000000
##
## Tuning parameter 'nrounds' was held constant at a value of 70
## Tuning
## 'colsample_bytree' was held constant at a value of 0.26
## Tuning
## parameter 'subsample' was held constant at a value of 0.8
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 70, max_depth = 6, eta
## = 0.2, gamma = 1, colsample_bytree = 0.26, min_child_weight = 5 and
## subsample = 0.8.
Based on the results maxdepth = 6 and min_child_weight = 15 are obtained.
credit.xgb3 is constructed to optimize the colsample_bytree parameter.
parameters_xgb3 <- expand.grid(nrounds = 30,
max_depth = 6,
eta = c(0.2),
gamma = 1,
colsample_bytree = seq(0.1, 0.8, 0.1),
min_child_weight = 15,
subsample = 0.8)
set.seed(123456789)
credit.xgb3 <- train(model1.formula,
data = train,
method = "xgbTree",
trControl = ctrl_cv5,
tuneGrid = parameters_xgb3)
credit.xgb3 %>% saveRDS(here("output", "credit.xgb3.rds"))## eXtreme Gradient Boosting
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results across tuning parameters:
##
## colsample_bytree ROC Sens Spec Accuracy Kappa
## 0.1 0.7345510 0.9589041 0.1333333 0.736 0.1173078
## 0.2 0.7456114 0.9506849 0.2148148 0.752 0.2073743
## 0.3 0.7411974 0.9205479 0.2296296 0.734 0.1833923
## 0.4 0.7513445 0.9342466 0.2740741 0.756 0.2524838
## 0.5 0.7496195 0.9315068 0.2666667 0.752 0.2407831
## 0.6 0.7455099 0.9178082 0.2666667 0.742 0.2210748
## 0.7 0.7441400 0.9123288 0.3333333 0.756 0.2857935
## 0.8 0.7463217 0.9123288 0.2888889 0.744 0.2329054
##
## Tuning parameter 'nrounds' was held constant at a value of 30
## Tuning
## held constant at a value of 15
## Tuning parameter 'subsample' was held
## constant at a value of 0.8
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 30, max_depth = 9, eta
## = 0.2, gamma = 1, colsample_bytree = 0.4, min_child_weight = 15 and
## subsample = 0.8.
Based on the results colsample_bytree = 0.4 is obtained.
credit.xgb4 is built to optimal length of the subsample (0.6, 0.7, 0.75, 0.8, 0.85, 0.9), which is used for creating the splits.
parameters_xgb4 <- expand.grid(nrounds = 30,
max_depth = 6,
eta = c(0.2),
gamma = 1,
colsample_bytree = 0.4,
min_child_weight = 15,
subsample = c(0.6, 0.7, 0.75, 0.8, 0.85, 0.9))
set.seed(123456789)
credit.xgb4 <- train(model1.formula,
data = train,
method = "xgbTree",
trControl = ctrl_cv5,
tuneGrid = parameters_xgb4)
credit.xgb4 %>% saveRDS(here("output", "credit.xgb4.rds"))## eXtreme Gradient Boosting
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results across tuning parameters:
##
## subsample ROC Sens Spec Accuracy Kappa
## 0.60 0.7339422 0.9479452 0.1703704 0.738 0.1500713
## 0.70 0.7349061 0.9342466 0.2370370 0.746 0.2068324
## 0.75 0.7409944 0.9260274 0.2814815 0.752 0.2481561
## 0.80 0.7473364 0.9287671 0.2666667 0.750 0.2343006
## 0.85 0.7449011 0.9150685 0.3037037 0.750 0.2570599
## 0.90 0.7515474 0.9178082 0.3333333 0.760 0.2930688
##
## Tuning parameter 'nrounds' was held constant at a value of 30
## Tuning
## held constant at a value of 0.6
## Tuning parameter 'min_child_weight' was
## held constant at a value of 15
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 30, max_depth = 9, eta
## = 0.2, gamma = 1, colsample_bytree = 0.6, min_child_weight = 15 and
## subsample = 0.9.
Based on the results, the optimal value subsample = 0.7 is obtained. To sum up, all necessary values of parameters are obtained.
credit.xgb5 is constructed to find the lower the learning rate and proportionally increase number of iterations.
The lower learning rate by half (up to 0.10) is considered, while the number of trees will be doubled (up to 80).
parameters_xgb5 <- expand.grid(nrounds = 80,
max_depth = 6,
eta = 0.10,
gamma = 1,
colsample_bytree = 0.4,
min_child_weight = 15,
subsample = 0.7)
set.seed(123456789)
credit.xgb5 <- train(model1.formula,
data = train,
method = "xgbTree",
trControl = ctrl_cv5,
tuneGrid = parameters_xgb5)
credit.xgb5 %>% saveRDS(here("output", "credit.xgb5.rds"))## eXtreme Gradient Boosting
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results:
##
## ROC Sens Spec Accuracy Kappa
## 0.7477423 0.9150685 0.3259259 0.756 0.2800949
##
## Tuning parameter 'nrounds' was held constant at a value of 80
## Tuning
## held constant at a value of 15
## Tuning parameter 'subsample' was held
## constant at a value of 0.8
credit.xgb6 is built to correct the eta and the number of iterations in order to get more stable model.
parameters_xgb6 <- expand.grid(nrounds = 140,
max_depth = 6,
eta = 0.05,
gamma = 1,
colsample_bytree = 0.4,
min_child_weight = 15,
subsample = 0.7)
set.seed(123456789)
credit.xgb6 <- train(model1.formula,
data = train,
method = "xgbTree",
trControl = ctrl_cv5,
tuneGrid = parameters_xgb6)
credit.xgb6 %>% saveRDS(here("output", "credit.xgb6.rds"))## eXtreme Gradient Boosting
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results:
##
## ROC Sens Spec Accuracy Kappa
## 0.7510908 0.9178082 0.2962963 0.75 0.2548434
##
## Tuning parameter 'nrounds' was held constant at a value of 140
## Tuning
## held constant at a value of 15
## Tuning parameter 'subsample' was held
## constant at a value of 0.8
To verify the models the Gini coefficient on train and test sample is compared.
Next, into one list are collected and send to the ggroc() function.
On the test sample the model xgb6 performs better than the other models and the model xgb6 is less over-fitted (the differences of Gini confection between train and test sample are equal 9.7%).
Results of estimation of 3 models are received using Machine Learning techniques. In order to compare received results Benchmark model is constructed: logistic regression.
To receive the results of Benchmark models we use 5-fold cross-validation process.
## Generalized Linear Model
##
## 500 samples
## 14 predictor
## 2 classes: 'Yes', 'No'
##
## Pre-processing: centered (41), scaled (41)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 400, 400, 400, 400, 400
## Resampling results:
##
## ROC Sens Spec Accuracy Kappa
## 0.7426687 0.8821918 0.4074074 0.754 0.3148479
AUROC statistics is equal 74.3% for logistic regression in train sample.
4 different models are estimated on the training data set using 5-fold cross-validation process:
bottom-layer models are train four alternative models (logistic regression, decision tree, random forests, xgboost).
The names of four considered models are defined.
Predicting probabilities are calculated for the training dataset using the dplyr package.
prognoza_Yes <- function(model, dane) {
prognoza <-
model %>%
paste0("credit.", .) %>%
get() %>%
predict(dane,
type = "prob") %>%
.[, "Yes"]
return(prognoza)
} The correlation among four alternative models is calculated on training sample.
## logit tree5 rf3 xgb6
## logit 1.0000000 0.6384253 0.8584019 0.8220886
## tree5 0.6384253 1.0000000 0.7308375 0.7568983
## rf3 0.8584019 0.7308375 1.0000000 0.9059911
## xgb6 0.8220886 0.7568983 0.9059911 1.0000000
Based on the correlation matrix it is observed that correlation among four alternative models is high. The highest correlation between RandomForest model and XGboost.
The correlation among four alternative models is calculated on testing sample .
Again, let us analyze correlations among them.
## logit tree5 rf3 xgb6
## logit 1.0000000 0.6558471 0.8506493 0.8199487
## tree5 0.6558471 1.0000000 0.7228121 0.7707745
## rf3 0.8506493 0.7228121 1.0000000 0.9179117
## xgb6 0.8199487 0.7707745 0.9179117 1.0000000
Based on the correlation matrix we can see that correlation among four alternative models on testing sample are similar.
For each model, we take out predictions of success for each observation in the cross-validation process.
Predictions of success for each model are saved in the $pred element.
preds_cv <-
sapply(models,
function(x) x %>%
paste0("credit.", .) %>%
get() %>%
.["pred"] %>%
# this is a list
.[[1]] %>%
.[,"Yes"]) %>%
data.frame()
head(preds_cv)## logit tree5 rf3 xgb6
## 1 0.7062415 0.8784530 0.4582145 0.4397653
## 2 0.7262053 0.9117647 0.5817320 0.4871082
## 3 0.8698391 0.2500000 0.8520608 0.7764295
## 4 0.7225765 0.8784530 0.8593019 0.5356862
## 5 0.5748870 0.7222222 0.8747574 0.7190288
## 6 0.6328738 0.8784530 0.8215651 0.8002991
The order of the observations is saved in the $pred$RowIndex element.
## logit tree5 rf3 xgb6 RowIndex
## 1 0.7062415 0.8784530 0.4582145 0.4397653 10
## 2 0.7262053 0.9117647 0.5817320 0.4871082 11
## 3 0.8698391 0.2500000 0.8520608 0.7764295 18
## 4 0.7225765 0.8784530 0.8593019 0.5356862 24
## 5 0.5748870 0.7222222 0.8747574 0.7190288 30
## 6 0.6328738 0.8784530 0.8215651 0.8002991 33
Sort observations are used.
## logit tree5 rf3 xgb6 RowIndex
## 1 0.9602376 0.7878788 0.5602240 0.7938189 1
## 2 0.3075296 0.3684211 0.6964258 0.4639506 2
## 3 0.9889599 0.9246575 0.6632885 0.9513080 3
## 4 0.7699881 0.7368421 0.6567246 0.4384797 4
## 5 0.7073578 0.7878788 0.4971975 0.8521206 5
## 6 0.9070644 0.8842105 0.9327884 0.9133524 6
Let’s construct the top layer model. In this case the logistic regression model is used.
## default ~ logit + tree5 + rf3 + xgb6
logistic regression model is estimated.
set.seed(123456789)
credit.stacking.logit <-
train(stacking.formula,
data = preds_cv,
method = "glm",
family = "binomial",
preProcess = c("center", "scale"),
trControl = ctrl_cv5)Predictions in the training dataset are obtained.
Probabilities of success for training sample are added.
preds.train$stacking.logit.p <-
predict(credit.stacking.logit,
newdata = preds_cv,
type = "prob")[, "Yes"]
head(preds.train)## logit tree5 rf3 xgb6 default stacking.logit
## 1 0.9583503 0.8333333 0.8398099 0.8605645 Yes Yes
## 2 0.2831211 0.0000000 0.3518560 0.4337821 No No
## 3 0.9778150 0.8793103 0.9207844 0.9581582 Yes Yes
## 4 0.6447733 0.3750000 0.6267033 0.4632163 Yes Yes
## 5 0.7176260 0.8793103 0.8201274 0.8479191 Yes Yes
## 6 0.9503780 0.8793103 0.9186552 0.9320341 Yes Yes
## stacking.logit.p
## 1 0.8578812
## 2 0.3408507
## 3 0.9208861
## 4 0.5534084
## 5 0.8248563
## 6 0.8980857
Predictions in the testing dataset are calculated.
Probabilities of success for testing sample are added.
preds.test$stacking.logit.p <-
predict(credit.stacking.logit,
newdata = preds.test[, models],
type = "prob")[, "Yes"]The required element is to verify the quality of the model on training sample.
The required element is to verify the quality of the model on test sample.
To sum up, the stable model for stacking approach is obtained.
Commercial banks are using the IRB approach more and more over than the standard approach due to the fact that banks have achieved significant profits. As part of the IRB approach, the basic ingredients of credit risk should be estimated, including PD (Probability of default) models.
In this study, the following models are estimated: 1. Logistic regression 2. Decision Tree 3. Random Forests 4. XGBoost based on the German scoring data. Additionally, the ensembling procedure is performed.
Amaong the 3 different models featured in this report (Decision Tree, Random Forests, XGBoost), logistic regression as a benchmark model worked best for train samples compared to Decision Tree, XGBoost and also in relation to stacking approach based on the Gini coefficient. For the test sample, the logistic regression achieved better results only in relation to Decision Tree. However, if the stability of the models is considered, then the most stable is stacking approach. Additionally, it should be noted that considered dataset covers a relatively small number of observations.