loan_data <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/162/datasets/8f48a2cbb6150e7ae32435e55f271cad5b4b8ecf/loan_data_ch1.rds")))
##################################
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))
loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"
loan_data$ir_cat <- as.factor(loan_data$ir_cat)
#Question
# Make the necessary replacements in the coarse classification example below
loan_data$emp_cat <- rep(NA, length(loan_data$emp_length))
loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"
loan_data$emp_cat <- as.factor(loan_data$emp_cat)
# Look at your new variable using plot()
str(loan_data)
## 'data.frame': 29092 obs. of 10 variables:
## $ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
## $ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
## $ int_rate : num 10.6 NA 13.5 NA NA ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
## $ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
## $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
## $ annual_inc : num 24000 12252 49200 36000 48000 ...
## $ age : int 33 31 24 39 24 28 22 22 28 22 ...
## $ ir_cat : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
## $ emp_cat : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...
##################################
# Set seed of 567
set.seed(567)
# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2/3*nrow(loan_data))
# Create training set: training_set
training_set <- loan_data[index_train, ]
# Create test set: test_set
test_set <- loan_data[-index_train, ]
str(loan_data)
## 'data.frame': 29092 obs. of 10 variables:
## $ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
## $ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
## $ int_rate : num 10.6 NA 13.5 NA NA ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
## $ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
## $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
## $ annual_inc : num 24000 12252 49200 36000 48000 ...
## $ age : int 33 31 24 39 24 28 22 22 28 22 ...
## $ ir_cat : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
## $ emp_cat : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Another way of classifying groups (default on loan or no-default on loan) is with decision trees.
You answer the questions from the top: Root, through the bottom: Leaf node, and a decision is made to classify case as default or non-default.
Quick Summary:
The tree is designed to have a spitting decision for each leaf node. But how are these decisions decided? Why not age greater then 45 instead of 40? Well, these splitting decisions are made by calculating the impurity using the Gini measure. We calculate the impurity for each splitting decision (age>40, age>41, age>42…) and choose the split that has the least amount of impurity.
For each node of the tree, the splitting decision will affect the final structure of the tree.
Example below we’ll go over 2 possible splitting decisions for categorical variables.
We can answer these questions by measuring the Impurity.
Splitting decisions are made by measuring the impurity of each node (decision) of the tree.
Our goal is to minimize the impurity in each node of the tree.
A popular & default measure is the gini measure.
Quick Summary:
The gini measure is used to calculate the impurity of the root and of each leaf node within a node. One node can have many cases, right? (Exp: age > 40, age > 45, age > 50, and so forth).
We’re focusing on one node in this example here: is age greater than 40? This specific node has 2 leaf, yes & no.
Once we calculate the gini measure for each leaf node, we use those values to calculate the Purity gain.
The Purity gain value from each node is what we compare with the other nodes (cases) and choose the node with the highest (maximum) Purity Gain (aka minimum amount of Impurity).
Assume you have a training set of data with 500 cases (250 default on loan cases & 250 non-default on loan).
The actual number of non-default on loan is on the left numbers and default on loan on the right numbers.
The perfect scenario would have the leaf node split into 50/50 like in the picture below, but since this is never the case, we can compute the impurity using the gini measure.
Equation: Gini (impurity) = 2 * prop(default) * prop(non-default)
The impurity in a certain node is given by 2 x the proportion of default in the node x the proportion of non-default in the node.
Purity Gain = Gini_Root - (prop(cases in N1) * Gini_N1) - (prop(cases in N2) * Gini_N2)
From our example: Purity Gain = 0.5 - (270/500 * 0.4664) - (230/500 * 0.4536) = 0.039488
The algorithm will compute the possibilities and give us the leaf node that has the highest purity gain.
Instructions:
- The computation for the Gini of the root node is given.
- Compute the Gini measure for the left leaf node.
- Compute the Gini measure for the right leaf node.
- Compute the gain by taking the difference between the root node Gini and the weighted leaf node Gini measures. - Information regarding the split in this tree can be found using $split and the tree object, small_tree. small_tree is given and is not available here.
Instead of gain, you should look at the improve column here. improve is an alternative metric for gain, simply obtained by multiplying gain by the number of cases in the data set. Make sure that the object improve (code given) has the same value as in small_tree$split.
# The Gini-measure of the root node is given below
gini_root <- 2 * (89 / 500) * (411 / 500)
# Compute the Gini measure for the left leaf node
gini_ll <- 2 * (401/446) * (45/446)
# Compute the Gini measure for the right leaf node
gini_rl <- 2 * (10/54) * (44/54)
# Compute the gain
gain <- gini_root - (446 / 500) * gini_ll - (54 / 500) * gini_rl
# compare the gain-column in small_tree$splits with our computed gain, multiplied by 500, and assure they are the same
# small_tree is only available in counsel.
small_tree$splits
improve <- gain * 500
Imagine if we had to calculate all splits, that would take us an unimaginable amount of time.
This is where the rpart() package comes in!
Warning:
Using the default setting on the training_set of loan_data, including all variables " ~ ." and specifying the method to “class” as a response variable (loan_status is categorical), we get a warning message that only the root is created.
The reason for the warning is that, (similarly to what we had observed when setting a cut-off to a logistic regression model), the highest accuracy is achieved by simply predicting all cases to be “non-defaults”.
library(rpart)
fit_default <- rpart(loan_status ~ ., method = "class", data = training_set)
#**observe the warning message!
plot(fit_default)
There are 3 main things we can do to overcome the unbalanced data (default values are very low in data set).
1) Over-sampling OR Under-sampling
+ You can over-sample the under-represented group (number of defaults on loan) OR under-sample the over-represented group (number of non-defaults on loan).
+ Balancing the training-set will have a positive effect on accuracy.
+ Note: over OR under sampling should ONLY be applied the training_set (NOT the test_set).
2) Changing the prior probabilities in the rpart() function
+ Make the prior probabilities of “default vs non-default” set equal to the proportions in the training_set.
+ By making the prior probabilities for “default” bigger, you can trick R into attaching more importance to “defaults” leading to a better decision tree.
3) Include a Loss Matrix using rpart function
+ As a third option, the rpart() function allows a specification of a loss-matrix.
+ In this Loss Matrix, different costs can be associated with a misclassification of a “default” as a “non-default” versus the misclassification of a “non-default” as a “default”.
+ By increasing the misclassification costs of the former, more attention is given to the correct classification of “default” improving the quality of the decision tree.
Validate model to see what’s best!
All three methods are intended to overcome the problem of class imbalance, but validation is very important.
In some cases, under or over sampling may work very well for some data sets, it may work poorly for others.
The same could be true for the other way around.
What the best method for the data at hand would only become clear upon proper validation.
Exercise) Under-sampling the training set
To overcome the unbalanced data problem, you can use under- or oversampling. The training set has been under-sampled for you, such that 1/3 of the training set consists of defaults, and 2/3 of non-defaults. The resulting data set is available in your workspace and named undersampled_training_set, and contains less observations (6570 instead of 19394). In this exercise, you will create a decision tree using the undersampled data set.
You will notice that the trees in this and the next exercises are very big, so big that you cannot really read them anymore. Don’t worry about this for now, we will tell you how you can make them more manageable in the next video!
Instructions:
Using ovun.sample() function from ROSE package.
Instructions:
- First, use the table() function to identify how many default and non-default observations are in the training_set data frame.
- Next, save the loan default count (2138) in a variable called: n_loan_status.
- Save the percentage amount of loan default observation we would like in the under-sampled dataset.
table(training_set$loan_status)
##
## 0 1
## 17256 2138
2138/.33
## [1] 6478.788
n_loan_status <- 2138
new_frac_loan_status <- 0.33
new_n_total <- n_loan_status / new_frac_loan_status # = 2138/.33 = 6478 (total obsv count)
library(ROSE)
## Warning: package 'ROSE' was built under R version 3.6.3
## Loaded ROSE 0.0-3
undersampled_result <- ovun.sample(loan_status ~ .,
data = training_set,
method = "under",
N = new_n_total,
seed = 1234)
undersampled_training_set <- undersampled_result$data
table(undersampled_training_set$loan_status)
##
## 0 1
## 4624 1854
# Load package rpart in your workspace.
library(rpart)
# Change the code provided in the video such that a decision tree is constructed using the undersampled training set. Include rpart.control to relax the complexity parameter to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
data = undersampled_training_set,
control = rpart.control(cp = 0.001))
# Plot the decision tree
plot(tree_undersample, uniform = TRUE)
# Add labels to the decision tree
text(tree_undersample, use.n = TRUE)
Exercise) Changing the prior probabilities
You can also change the prior probabilities to obtain a decision tree. This is an indirect way of adjusting the importance of misclassifications for each class. You can specify another argument inside rpart() to include prior probabilities. The argument you are looking for has the following form
parms = list(prior=c(non_default_proportion, default_proportion))
Instructions:
# Load package rpart in your workspace.
library(rpart)
# Change the code below such that a tree is constructed with adjusted prior probabilities.
tree_prior <- rpart(loan_status ~ ., method = "class",
data = training_set,
parms = list(prior = c(0.7, 0.3)),
control = rpart.control(cp = 0.001))
# Plot the decision tree
plot(tree_prior, uniform = TRUE)
# Add labels to the decision tree
text(tree_prior)
Exercise 3) Including a Loss Matrix
Thirdly, you can include a loss matrix, changing the relative importance of misclassifying a default as non-default versus a non-default as a default.
You want to stress that misclassifying a default as a non-default should be penalized more heavily.
Including a loss matrix can again be done in the argument parms in the loss matrix.
parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2))
Doing this, you are constructing a 2x2-matrix with zeroes on the diagonal and changed loss penalties off-diagonal. The default loss matrix is all ones off-diagonal.
Instructions:
# Change the code below such that a decision tree is constructed using a loss matrix penalizing 10 times more heavily for misclassified defaults.
tree_loss_matrix <- rpart(loan_status ~ ., method = "class",
data = training_set,
# parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2))
parms = list(loss = matrix(c(0, 10, 1, 0), ncol=2)),
control = rpart.control(cp = 0.001))
# Plot the decision tree
plot(tree_loss_matrix, uniform = TRUE)
# Add labels to the decision tree
text(tree_loss_matrix)
The decision trees in the previous exercises are very large.
It’s generally not advised to construct enormous decision trees.
Not only are they harder to interpret, but over-fitting may occur, leading to inferior results when applying the model to the test set.
The rpart package provides tools that make pruning of the decision fairly easy.
We’ll be using functions plotcp() and printcp() for pruning purpose.
Problem with large decision trees
- Too complex: not clear anymore
- Overfitting when applying to test set
Solution: use plotcp(), printcp() for pruning purposes
Let’s have a look at what these functions do when applying them to the tree we constructed using the “undersampled” training set (this is not available).
We’ll look at printcp() first.
Applying this function to our object: tree_undersample, you get an overview of how to treat rows using more splits with the column nsplit [reference pic].
You can change the CP (complexity parameter) [reference pic] which was fixed to .001 before to obtain derived complexity level.
But what level should be chosen?
We want to minimize the cross validation error of the decision tree. These results are given in column xerror [reference pic]. Cross-validation inside the training set is used to obtain this error measure.
Since there is a random element to the cross-validation process, it is necessary to set a seed if you want your results to be reproducible.
printcp(tree_undersample)
##
## Classification tree:
## rpart(formula = loan_status ~ ., data = undersampled_training_set,
## method = "class", control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] age annual_inc emp_length grade
## [5] home_ownership int_rate loan_amnt
##
## Root node error: 1854/6478 = 0.2862
##
## n= 6478
##
## CP nsplit rel error xerror xstd
## 1 0.0048544 0 1.00000 1.0000 0.019622
## 2 0.0032362 5 0.97519 1.0027 0.019637
## 3 0.0024272 6 0.97195 1.0119 0.019691
## 4 0.0021575 13 0.95092 1.0113 0.019688
## 5 0.0018878 15 0.94660 1.0124 0.019694
## 6 0.0016181 24 0.92826 1.0119 0.019691
## 7 0.0013484 41 0.89320 1.0167 0.019718
## 8 0.0012585 46 0.88619 1.0259 0.019771
## 9 0.0010787 55 0.87433 1.0512 0.019910
## 10 0.0010017 66 0.86192 1.0653 0.019985
## 11 0.0010000 74 0.85329 1.0690 0.020005
You can also plot a cross-validation error as a function of the complexity parameter CP and size of the tree using the function plotcp().
The minimum cross-validated error can be found at the red error in the picture where the cp = 0.003653. For this exact cp value, you can have a look at printcp() again.
plotcp(tree_undersample)
To prune the tree, the function prune() can be used with the initial unpruned tree: “tree_undersample” as the first argument and a cp value that minimize the cross-validated error as a second argument.
Plotting the pruned tree, you get a smaller decision tree.
Note that this tree does not give information on the true number of default vs non-default that are present in each leaf.
If you want this information, use the argument: use.n = TRUE in the text function.
Keep in-mind that a “yes” to the test statement in a node will lead you to follow the left-hand branch, and a “no” to the test statement will lead you follow the right-hand branch.
This is not very intuitive for the rpart() package. Addition the “home_ownership=cd” is not very informative.
ptree_undersample = prune(tree_undersample, cp = 0.0032362)
plot(ptree_undersample, uniform = TRUE)
text(ptree_undersample, use.n = TRUE)
prp() is a much more intuitive rpart package than plot().
When plotting with prp() you receive guidance on which branch to take depending on the answer to the split question.
You’ll also get the value name for the factor variables (in this example: home_own = OWN or REN).
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.2
prp(ptree_undersample)
If you want to have the number of ‘default vs non-default’ cases, add the argument: extra = 1
library(rpart.plot)
prp(ptree_undersample, extra=1)
Exercise 1) Pruning the tree with changed prior probabilities
Pruning a tree is necessary to avoid over-fitting.
There were some big trees in the previous exercises and now you will put what you have learned into practice, and prune the previously constructed tree with the changed prior probabilities.
You will first set a seed to make sure the results are reproducible, because you will be examining cross-validated error results.
Results involve randomness and could differ slightly upon running the function again with a different seed.
In this exercise you will learn to identify which complexity parameter (CP) will minimize the cross-validated error results, then prune your tree based on this value.
Instructions:
library(rpart.plot)
# tree_prior is loaded in your workspace
# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_prior)
# Use printcp() to identify for which complexity parameter the cross-validated error rate is minimized.
printcp(tree_prior)
##
## Classification tree:
## rpart(formula = loan_status ~ ., data = training_set, method = "class",
## parms = list(prior = c(0.7, 0.3)), control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] age annual_inc emp_cat emp_length
## [5] grade home_ownership int_rate loan_amnt
##
## Root node error: 5818.2/19394 = 0.3
##
## n= 19394
##
## CP nsplit rel error xerror xstd
## 1 0.0052744 0 1.00000 1.00000 0.020400
## 2 0.0047527 3 0.98418 1.00502 0.020255
## 3 0.0035727 4 0.97942 0.99878 0.019929
## 4 0.0029150 8 0.96269 0.99305 0.019578
## 5 0.0022178 9 0.95977 0.99212 0.019421
## 6 0.0021406 15 0.94462 0.99118 0.019412
## 7 0.0018931 18 0.93820 0.98593 0.019340
## 8 0.0016159 19 0.93630 0.98877 0.019298
## 9 0.0015332 20 0.93469 0.99140 0.019252
## 10 0.0013233 23 0.93009 0.99615 0.019248
## 11 0.0012569 24 0.92877 0.99705 0.019131
## 12 0.0012458 25 0.92751 0.99795 0.019119
## 13 0.0012447 27 0.92502 0.99761 0.019115
## 14 0.0011704 30 0.92108 0.99761 0.019115
## 15 0.0011582 33 0.91757 0.99675 0.019102
## 16 0.0011387 37 0.91283 0.99694 0.019062
## 17 0.0010817 48 0.89841 0.99410 0.019014
## 18 0.0010804 49 0.89733 0.99578 0.019020
## 19 0.0010766 58 0.88299 0.99578 0.019020
## 20 0.0010596 65 0.87405 0.99566 0.019011
## 21 0.0010374 67 0.87193 0.99540 0.019048
## 22 0.0010341 69 0.86986 0.99481 0.019035
## 23 0.0010000 71 0.86779 0.99723 0.019045
# Create an index for of the row with the minimum xerror
index <- which.min(tree_prior$cptabl[ , "xerror"])
# Create tree_min
tree_min <- tree_prior$cptable[index, "CP"]
# Prune the tree using tree_min
ptree_prior <- prune(tree_prior, cp = tree_min)
# Use prp() to plot the pruned tree
prp(ptree_prior)
Exercise 2) Pruning the tree with the loss matrix
In this exercise, you will prune the tree that was built using a loss matrix in order to penalize misclassified defaults more than misclassified non-defaults.
Instructions:
# set a seed and run the code to construct the tree with the loss matrix again
set.seed(345)
tree_loss_matrix <- rpart(loan_status ~ ., method = "class", data = training_set,
parms = list(loss=matrix(c(0, 10, 1, 0), ncol = 2)),
control = rpart.control(cp = 0.001))
# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_loss_matrix)
# Prune the tree using cp = 0.0012788
ptree_loss_matrix <- prune(tree_loss_matrix, cp = 0.0012788)
# Use prp() and argument extra = 1 to plot the pruned tree
prp(ptree_loss_matrix, extra = 1)
We’ve covered several options for facilitating a construction of a tree, especially on unbalanced data.
Off course, the range of possibilities is endless and there are other arguments you can pass through the rpart() function to shape your loan default decision tree.
We’ll go over some of the most important ones.
An important and interesting argument in rpart() is weights.
The other interesting argument mentioned here are in the rpart.control function that can be passed down to the argument “control”. We’ve used this argument before to change the “CP” (complexity parameter).
An interesting argument here is minsplit.
- minsplit: Allows you to change the minimum number of observations that must exist in a node in order for a split to be attempted. The default argument for this equals to 20, but for unbalanced data it might be useful lower this value.
The last thing we need to do is to evaluate our Decision Tree.
Until now, we only used the training set, but now we’ll use our test set to evaluate the result.
This can be done using the predict() function.
The code to make predictions using the model with the undersampled data set is shown below:
We include an additional argument: type = class, we get a vector of predictions right away without using a cut-off value anymore (like in Logistic Regression).
However, in some cases we have a non-binary prediction and we want to choose a cut-off value ourselves. For this case, leave out the type = class argument and you’ll get probabilities instead of binary predictions.
pred_undersample_class = predict(ptree_undersample, newdata = test_set, type = "class")
str(pred_undersample_class)
## Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## - attr(*, "names")= chr [1:9698] "2" "6" "7" "10" ...
OR
pred_undersample = predict(ptree_undersample, newdata = test_set)
str(pred_undersample)
## num [1:9698, 1:2] 0.589 0.665 0.37 0.589 0.665 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:9698] "2" "6" "7" "10" ...
## ..$ : chr [1:2] "0" "1"
Here a confusion matrix is given for the decision tree that was constructed using undersampled data and binary predictions.
Even though we used undersampling, very few defaults are predicted for our test set. It could be useful here to use probability predictions instead of binary predictions to lower the curt-off.
table(test_set$loan_status, pred_undersample_class)
## pred_undersample_class
## 0 1
## 0 8449 160
## 1 1046 43
Exercise 1) One final tree using more options
In this exercise, you will use some final arguments that were discussed above. Some specifications in the rpart.control() function will be changed, and some weights will be included using the weights argument in rpart().
The vector case_weights has been constructed for you and is loaded in your workspace. This vector contains weights of 1 for the non-defaults in the training set, and weights of 3 for defaults in the training sets. By specifying higher weights for default, the model will assign higher importance to classifying defaults correctly.
Instructions:
#Creating the case weight:
#code found on Stack overflow
#weights: optional case weights
#cost: a vector of non-negative costs, one for each variable in the model. Defaults to one for all variables. These are scalings to be applied when considering splits, so the improvement on splitting on a variable is divided by its cost in deciding which split to choose.
#The weights is for rows (e.g. give higher weight to smaller class), the cost is for columns.
#subset() function isolates the data.frame: training_set to a group of that meets condition.
positiveWeight = 1.0 / (nrow(subset(training_set, loan_status == 1)) / nrow(training_set))
negativeWeight = 1.0 / (nrow(subset(training_set, loan_status != 1)) / nrow(training_set))
modelWeights <- ifelse(training_set$loan_status== 1, positiveWeight, negativeWeight)
library(rpart.plot)
# set a seed and run the code to obtain a tree using weights, minsplit and minbucket
set.seed(345)
tree_weights <- rpart(loan_status ~ ., method = "class",
data = training_set,
#case_weights is a vector that's not available here.
modelWeights,
control = rpart.control(minsplit = 5, minbucket = 2, cp = 0.001))
# Plot the cross-validated error rate for a changing cp
plotcp(tree_weights)
# Create an index for of the row with the minimum xerror
index <- which.min(tree_weights$cp[ , "xerror"])
# Create tree_min
tree_min <- tree_weights$cp[index, "CP"]
# Prune the tree using tree_min
ptree_weights <- prune(tree_weights, cp = tree_min)
# Plot the pruned tree using the rpart.plot()-package
prp(ptree_weights, extra = 1)
Exercise 2) Confusion matrices and accuracy of our final trees Over the past few exercises, you have constructed quite a few pruned decision trees, with four in total. As you can see, the eventual number of splits varies quite a bit from one tree to another:
Now it is important to know which tree performs best in terms of accuracy.
In order to get the accuracy, you will start off by making predictions using the test set, and construct the confusion matrix for each of these trees. You will add the argument type = “class” when doing these predictions. By doing this there is no need to set a cut-off.
Nevertheless, it is important to be aware of the fact that not only the accuracy is important, but also the sensitivity and specificity. Additionally, predicting probabilities instead of binary values (0 or 1) has the advantage that the cut-off can be moved along. Then again, the difficulty here is the choice of the cut-off. You will return to this in the next chapter.
In case you needed a reminder, here is how to compute the accuracy: $ Classification Accuracy = (TP + TN) / (TP + FP + TN + FN) $
Instructions:
# Make predictions for each of the pruned trees using the test set.
pred_undersample <- predict(ptree_undersample, newdata = test_set, type = "class")
pred_prior <- predict(ptree_prior, newdata = test_set, type = "class")
pred_loss_matrix <- predict(ptree_loss_matrix, newdata = test_set, type = "class")
pred_weights <- predict(ptree_weights, newdata = test_set, type = "class")
# construct confusion matrices using the predictions.
confmat_undersample <- table(test_set$loan_status, pred_undersample)
confmat_prior <- table(test_set$loan_status, pred_prior)
confmat_loss_matrix <- table(test_set$loan_status, pred_loss_matrix)
confmat_weights <- table(test_set$loan_status, pred_weights)
# Compute the accuracies
acc_undersample <- sum(diag(confmat_undersample)) / nrow(test_set)
acc_prior <- sum(diag(confmat_prior)) / nrow(test_set)
acc_loss_matrix <- sum(diag(confmat_loss_matrix)) / nrow(test_set)
acc_weights <-sum(diag(confmat_weights)) / nrow(test_set)
print(acc_undersample)
print(acc_prior)
print(acc_loss_matrix)
print(acc_weights)
Section 4
In the previous chapters, we gave you an overview of how Logistic Regression models and Decision Trees can be used to predict probability of default.
Using the predict() function on the logistic regression model with type = “response”, we get a response, we get a vector of predicted probabilities of defaults.
Using the predict() function on the decision tree model, we will get a matrix with two columns.
Select the 2nd column to make them comparable to the predictions from the logistic regression model.
Until now, we’ve set cut-offs based on best guesses or a gut feeling in order to make a accuracy maxtrixes for each of the models used.
The choice of a cut-off is important, but this changes the validation matrix.
Usually using these models on future applicants that come in, the cut-off could be a way of deciding which loan applicants would get a loan and which applicants will not.
It is clear that this model will never be perfect, and no matter how many applicants a bank refuses, there will still be debtors that default on a loan.
log_reg_model <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt + home_ownership + annual_inc,
family = binomial, data = training_set)
pred_log_regression_model <- predict(log_reg_model, newdata = test_set, type = "response")
The good thing is that model can help a bank decide how many loans they should approve if they don’t want exceed a certain percentage of defaults in their portfolio of customers.
We’ll start with a simple example using the full logistic regression model from before.
Let’s assume the test set contains new applicants and the bank decide to reject 20% of the new applicants based on their fitted probability of default.
This means that the 20% with the highest predicted probability of default will be rejected by the bank.
To obtain the cut-off value that would lead to predicted “1” (default) for 20% cases in a test set, you need to look at the 80% quantile of the predictions vector.
Having used this cut-off, you know which test set loan applicants would have been accepted using the 80% acceptance rate.
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predictions_all_full <- predict(log_reg_model, newdata = test_set, type = "response")
cutoff <- quantile(predictions_all_full, 0.8)
pred_full_20 <- ifelse(predictions_all_full > cutoff, 1, 0)
We want to have a look now at the true status of the loans that would have been accepted using this cut-off, and see what percentage of the accepted loans actually defaulted which is also referred to as bad rate.
> Having accepted 80% of the loan, we see that the bad rate is 0.897 %.
true_and_predval <- cbind(test_set$loan_status, pred_full_20)
str(true_and_predval)
## num [1:9698, 1:2] 0 0 1 0 0 0 0 1 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:9698] "2" "6" "7" "10" ...
## ..$ : chr [1:2] "" "pred_full_20"
accepted_loans <- true_and_predval[pred_full_20 == 0, 1]
bad_rate <- sum(accepted_loans)/length(accepted_loans)
bad_rate
## [1] 0.09474091
If you repeat this for several acceptance rate, you can construct an entire table with acceptance rates, corresponding cut-off values, and the percentage of bad_rate.
This is a useful tool for a bank because banks can adapt their own acceptable rate depending on the percentage of bad loans they can allow in their loan portfolio.
Suppose a bank can only allow an 8% of bad rate. Using this model (reference picture), a bank will decide to accept 65% of the loans.
The strategy curve provides us with a visual tool as well.
It shows the bad rates as a function of the acceptance rate.
library(rpart)
printcp(tree_undersample)
summary(tree_undersample)
ptree_undersample = prune(tree_undersample, cp = 0.003653)
Exercise 1) Computing a bad rate given a fixed acceptance rate
We learned how to compute the bad rate (or, the percentage of defaults) in the loan portfolio of a bank when given:
In this exercise, you will compute the bad rate that a bank can expect when using the pruned tree “ptree_prior” that you fitted before, and an acceptance rate of 80%. As a reminder, the tree is plotted on your right hand side.
Instructions:
#Goal: Calculate the percentage of default (bad rate)
#step1) Create the prediction using pruned decision tree.
#step2) Create the cut-off for the predicited values.
#step3) Change the predicited values to binary values (1 = yes, 0 = no) using the cut-off.
#step4) Isolate the observations that are "default"
#step5) Calculate sum(default)/total # of instance
# Make predictions for the probability of default using the pruned tree and the test set.
prob_default_prior <- predict(ptree_prior, newdata = test_set)[ ,2]
# Obtain the cutoff for acceptance rate 80%
cutoff_prior <- quantile(prob_default_prior, 0.8)
# Obtain the binary predictions.
bin_pred_prior_80 <- ifelse(prob_default_prior > cutoff_prior, 1, 0)
# Obtain the actual default status for the accepted loans
accepted_status_prior_80 <- test_set$loan_status[bin_pred_prior_80 == 0]
# Obtain the bad rate for the accepted loans
sum(accepted_status_prior_80) / length(accepted_status_prior_80)
## [1] 0.09694009
Exercise 2) The strategy table and strategy curve Repeating the calculations you did in the previous exercise for several acceptance rates, you can obtain a strategy table. This table can be a useful tool for banks, as they can give them a better insight to define an acceptance strategy.
You know how to compute a bad rate for a certain acceptance rate by now, so the function strategy_bank was written and loaded into your workspace to speed things up (not available here). This function computes the cut-off and bad rate for the acceptance rates that are multiples of 5% (0%, 5%, 10%, …).
Instructions:
# Have a look at the function strategy_bank
strategy_bank
# Apply the function strategy_bank to both predictions_cloglog and predictions_loss_matrix
strategy_cloglog <- strategy_bank(predictions_cloglog)
strategy_loss_matrix <- strategy_bank(predictions_loss_matrix)
# Obtain the strategy tables for both prediction-vectors
strategy_cloglog$table
strategy_loss_matrix $table
# Plot the strategy functions
par(mfrow = c(1,2))
plot(strategy_cloglog$accept_rate, strategy_cloglog$bad_rate,
type = "l", xlab = "Acceptance rate", ylab = "Bad rate",
lwd = 2, main = "logistic regression")
plot(strategy_loss_matrix$accept_rate, strategy_loss_matrix$bad_rate,
type = "l", xlab = "Acceptance rate",
ylab = "Bad rate", lwd = 2, main = "tree")
Exercise 2) continued question
Imagine you have 2 banks: Bank A, and Bank B. These banks are restricted to using a certain acceptance rate.
Based on the strategy tables, which model will Bank A and Bank B prefer to use?
Answer: Bank A will prefer to use the classification tree model, but for Bank B, the logistic regression model leads to lower bad rates.
In the previous exercise, we saw how banks prefers one model over another depending on the minimum bad rate they are allowed to have OR turning it around depending on their acceptance rate.
Often, banks want to know what is the best model overall without having to make any assumption regarding acceptance rate or cut-off.
To illustrate this, lets go back to the Decision Matrix Metric.
Previously we used classification accuracy that includes all the test set elements in the denominator, and as a result it’s tempting to use.
But Classification accuracy is problematic in the credit-risk (and claim risk) context it is not the best to use because generally accuracy is maximized when a very high cut-off is chosen.
Accuracy is maximized when a very high cut-off is chosen, not the best measurement.
This is why most credit risk models are based on the measurement of sensitivity and Specificity.
Definition example: In medical diagnosis, test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate), whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).
One of the most popular method of evaluating credit risk model is based on sensitivity and specificity.
This evaluation method is called the ROC (Receiver Operating Characteristic) curve.
The ROC-curve is constructed by plotting Sensitivity (y-axis) against \(1 - Specificity\) (x-axis) for each possible cut-off (cut-off = 1).
The curve always starts in the lower left corner where:
Which makes the cut-off equal to 1 (makes all cases “Non-defaults”).
The other end, upper-right on graph, is where the cut-off = 0:
Generally speaking, the closer the ROC curve is to the top left of the corner, THE BETTER.
On the other extreme, the 45 degree red line would be the result of assigning cases to the default versus non-default group randomly rather than through a model.
Additionally ROC-curve may have cross-over which makes it hard to tell which model is better.
This problem is illustrated where with an ROC-Curve with 2 models.
A measure that banks like to use when comparing ROC curves is called AUC - Area Under the Curve.
Computing the AUC here, we can see that Model B is our better choice.
ROC-curves can easily be created using the pROC-package in R.
Let’s have a look if there is a big difference between ROC-curves for the four logistic regression-models previously used throughout this course.
A small heads up:
You will first draw the ROC-curves for these four models in one plot. Afterwards, you will look at the area under the curve.
Instructions:
- Load the pROC-package in your R-console.
- Construct the ROC-objects for the four logistic regression models using function roc(response, predictor).
+ Remember that the response is the loan status indicator in the test_set, which can be obtained through test_set$loan_status.
- Use the previously created objects to construct ROC-curves. To draw them all on one plot, use plot() for the first ROC-curve drawn (for ROC_logit), and use lines() for the other three models to the same plot.
- Use the col-argument to change the color of the curve of ROC_probit to “blue”, ROC_cloglog to “red” and ROC_all_full to “green”.
+ Note: the x-axis label is Specificity and not “1-Specificity”, resulting in an axis that goes from 1 on the left-hand side to 0 on the right-hand side.
- It seems that the link function does not have a big impact on the ROC here, and the main trigger of a better ROC is the inclusion of more variables in a model. To get an exact idea of the performance of the ROC-curves, have a look at the AUC’s, using function auc().
# Fit the logit, probit and cloglog-link logistic regression models
log_model_logit <- glm(loan_status ~ age + ir_cat + loan_amnt,
family = binomial(link = logit), data = training_set)
log_model_probit <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
family = binomial(link = probit), data = training_set)
log_model_cloglog <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
family = binomial(link = cloglog), data = training_set)
log_model_all_full <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt + home_ownership + annual_inc,
family = binomial, data = training_set)
# Make predictions for all models using the test set
predictions_logit <- predict(log_model_logit, newdata = test_set, type = "response")
predictions_probit <- predict(log_model_probit, newdata = test_set, type = "response")
predictions_cloglog <- predict(log_model_cloglog, newdata = test_set, type = "response")
predictions_all_full <- predict(log_model_all_full, newdata = test_set, type = "response")
# Use a cut-off of 14% to make binary predictions-vectors
cutoff <- 0.14
class_pred_logit <- ifelse(predictions_logit > cutoff, 1, 0)
class_pred_probit <- ifelse(predictions_probit > cutoff, 1, 0)
class_pred_cloglog <- ifelse(predictions_cloglog > cutoff, 1, 0)
class_pred_all_full<- ifelse(predictions_all_full > cutoff, 1, 0)
# Load the pROC-package
library(pROC)
## Warning: package 'pROC' was built under R version 3.6.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Construct the objects containing ROC-information
ROC_logit <- roc(test_set$loan_status, predictions_logit)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ROC_probit <- roc(test_set$loan_status, predictions_probit)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ROC_cloglog <- roc(test_set$loan_status, predictions_cloglog)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ROC_all_full <- roc(test_set$loan_status, predictions_all_full)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Draw all ROCs on one plot
plot(ROC_logit)
lines(ROC_probit, col="blue")
lines(ROC_cloglog, col="red")
lines(ROC_all_full, col="green")
# Compute the AUCs
auc(ROC_logit)
## Area under the curve: 0.6193
auc(ROC_probit)
## Area under the curve: 0.6297
auc(ROC_cloglog)
## Area under the curve: 0.6295
auc(ROC_all_full)
## Area under the curve: 0.6435
Let’s have a look again for the ROC Curve of the Logistic Regression model.
The ROC Curve clearly improves when increasing the variable amount by 4 to 7 in the model.
In other words, the ROC Curve, or more specifically the AUC can be used for variable selection.
ROC Curve (AUC measure) can be used for variable selection.
In fact, banks (and insurance) are particularly interested in knowing which variables are important for predicting loan default.
When evaluating models based on classification performance, it might be worth looking at the AUC performance of this model for variable selection rather than p-value.
This former method is referred to as AUC-based pruning.
Step 1) Start with a model including all variables (in our case, 7) and compute AUC
log_model_full <- glm(loan_status ~ grade + age + emp_cat + ir_cat + loan_amnt + home_ownership + annual_inc,
family = binomial, data = training_set)
predictions_model_full <- predict(log_model_full, newdata = test_set, type = "response")
AUC_model_full <- auc(test_set$loan_status, predictions_model_full)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
AUC_model_full
## Area under the curve: 0.6518
Step 2) Build 7 new models, where each time one of the variables is removed, and make PD-predictions using the test set.
Step 3) Keep the model that led to the best AUC measure
In this example, we get the highest AUC by removing home_ownership.
In the video you saw how the “full” logistic regression model with a logit link was being pruned based on the AUC. You saw how the variable home_ownership was deleted from the model, as it improved the overall AUC. After repeating this process for two additional rounds, the variables age and ir_cat were deleted, leading to the model:
log_3_remove_ir <- glm(loan_status ~ loan_amnt + grade + annual_inc + emp_cat, family = binomial, data = training_set)
with an AUC of 0.6545.
Now, it’s your turn to see whether the AUC can still be improved by deleting another variable from the model.
Instructions:
# Build four models each time deleting one variable in log_3_remove_ir
log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat,
family = binomial, data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
log_4_remove_grade <- glm(loan_status ~ loan_amnt + annual_inc + emp_cat, family = binomial, data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
log_4_remove_inc <- glm(loan_status ~ loan_amnt + grade + emp_cat, family = binomial, data = training_set)
log_4_remove_emp <- glm(loan_status ~ loan_amnt + grade + annual_inc, family = binomial, data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Make PD-predictions for each of the models
pred_4_remove_amnt <- predict(log_4_remove_amnt, newdata = test_set, type = "response")
pred_4_remove_grade <- predict(log_4_remove_grade, newdata = test_set, type = "response")
pred_4_remove_inc <- predict(log_4_remove_inc, newdata = test_set, type = "response")
pred_4_remove_emp <- predict(log_4_remove_emp, newdata = test_set, type = "response")
# Compute the AUCs
auc(test_set$loan_status, pred_4_remove_amnt)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6526
auc(test_set$loan_status, pred_4_remove_grade)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.5901
auc(test_set$loan_status, pred_4_remove_inc)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6381
auc(test_set$loan_status, pred_4_remove_emp)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6443
Deleting the variable loan_amnt, the AUC can be further improved to 0.6548! The resulting model is
log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, family = binomial, data = training_set) Is it possible to reduce the logistic regression model to only two variable without reducing the AUC? In this exercise you will see if it is possible!
Instructions:
# Build three models each time deleting one variable in log_4_remove_amnt
log_5_remove_grade <- glm(loan_status ~ annual_inc + emp_cat, family = binomial, data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
log_5_remove_inc <- glm(loan_status ~ grade + emp_cat , family = binomial, data = training_set)
log_5_remove_emp <- glm(loan_status ~ grade + annual_inc, family = binomial, data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Make PD-predictions for each of the models
pred_5_remove_grade <- predict(log_5_remove_grade, newdata = test_set, type = "response")
pred_5_remove_inc <- predict(log_5_remove_inc, newdata = test_set, type = "response")
pred_5_remove_emp <- predict(log_5_remove_emp, newdata = test_set, type = "response")
# Compute the AUCs
auc(test_set$loan_status, pred_5_remove_grade)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.5853
auc(test_set$loan_status, pred_5_remove_inc)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6338
auc(test_set$loan_status, pred_5_remove_emp)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6457
# Plot the ROC-curve for the best model here
plot(roc(test_set$loan_status,pred_4_remove_amnt))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Other methods of classifying default vs non-default.
All models referred here focuses on classification.
An important short coming of these models is that timing of default is completely neglected.
Survival Analysis
Using Survival Analysis, one can obtain probability of loan default that change over time.
Another advantage of Survival Analysis is that variables that change over time can be included (time-varying covariates).