Loading data for course:

loan_data <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/162/datasets/8f48a2cbb6150e7ae32435e55f271cad5b4b8ecf/loan_data_ch1.rds")))

##################################
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))

loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"

loan_data$ir_cat <- as.factor(loan_data$ir_cat)

#Question
# Make the necessary replacements in the coarse classification example below 
loan_data$emp_cat <- rep(NA, length(loan_data$emp_length))

loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"

loan_data$emp_cat <- as.factor(loan_data$emp_cat)

# Look at your new variable using plot()

str(loan_data)

## 'data.frame':    29092 obs. of  10 variables:
##  $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ int_rate      : num  10.6 NA 13.5 NA NA ...
##  $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ emp_length    : int  10 25 13 3 9 11 0 3 3 0 ...
##  $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ annual_inc    : num  24000 12252 49200 36000 48000 ...
##  $ age           : int  33 31 24 39 24 28 22 22 28 22 ...
##  $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
##  $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...

##################################

# Set seed of 567
set.seed(567)

# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2/3*nrow(loan_data))

# Create training set: training_set
training_set <- loan_data[index_train, ]

# Create test set: test_set
test_set <- loan_data[-index_train, ]

str(loan_data)

## 'data.frame':    29092 obs. of  10 variables:
##  $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ int_rate      : num  10.6 NA 13.5 NA NA ...
##  $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ emp_length    : int  10 25 13 3 9 11 0 3 3 0 ...
##  $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ annual_inc    : num  24000 12252 49200 36000 48000 ...
##  $ age           : int  33 31 24 39 24 28 22 22 28 22 ...
##  $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
##  $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...

log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Decision Tree

Another way of classifying groups (default on loan or no-default on loan) is with decision trees.

You answer the questions from the top: Root, through the bottom: Leaf node, and a decision is made to classify case as default or non-default.

How are “splitting decisions” made?

Quick Summary:
The tree is designed to have a spitting decision for each leaf node. But how are these decisions decided? Why not age greater then 45 instead of 40? Well, these splitting decisions are made by calculating the impurity using the Gini measure. We calculate the impurity for each splitting decision (age>40, age>41, age>42…) and choose the split that has the least amount of impurity.

For each node of the tree, the splitting decision will affect the final structure of the tree.

Example below we’ll go over 2 possible splitting decisions for categorical variables.

Categorical Variable: How do we decide, whether the question being asked should be:
- “Home ownership = RENT”
- OR
- “Home ownership = RENT or OTHER” ?
Continuous variable: How do we define the cut-off value for Age that defines the boundaries between default or not.
- “age > 40”
- OR
- “age > 45” ?

We can answer these questions by measuring the Impurity.

Splitting decisions are made by measuring the impurity of each node (decision) of the tree.

Our goal is to minimize the impurity in each node of the tree.

A popular & default measure is the gini measure.

Gini measure example

Quick Summary:
The gini measure is used to calculate the impurity of the root and of each leaf node within a node. One node can have many cases, right? (Exp: age > 40, age > 45, age > 50, and so forth).
We’re focusing on one node in this example here: is age greater than 40? This specific node has 2 leaf, yes & no.
Once we calculate the gini measure for each leaf node, we use those values to calculate the Purity gain.
The Purity gain value from each node is what we compare with the other nodes (cases) and choose the node with the highest (maximum) Purity Gain (aka minimum amount of Impurity).

Assume you have a training set of data with 500 cases (250 default on loan cases & 250 non-default on loan).
The actual number of non-default on loan is on the left numbers and default on loan on the right numbers.

The perfect scenario would have the leaf node split into 50/50 like in the picture below, but since this is never the case, we can compute the impurity using the gini measure.

Equation: Gini (impurity) = 2 * prop(default) * prop(non-default)

The impurity in a certain node is given by 2 x the proportion of default in the node x the proportion of non-default in the node.

First, calculate the gini measure to the root node:

Gini_Root = 2 * 250/500 * 250/500 = 0.5
This is the maximum possible amount with exactly as many defaults as non-defaults.

Next, calculate the gini measure for the lead nodes.

Gini_N1 = 2 * 170/270 * 100/270 = 0.4664
Gini_N2 = 2 * 80/230 * 150/230 = 0.4536

Finally, calculate the PURITY GAIN that is achieved going from the root node to the 2 node: N1 & N2

To calculate the Purity gain, subtract the weighted gini measure in the N1 node and N2 node from the gini measure in the Root.

Purity Gain = Gini_Root - (prop(cases in N1) * Gini_N1) - (prop(cases in N2) * Gini_N2)

From our example: Purity Gain = 0.5 - (270/500 * 0.4664) - (230/500 * 0.4536) = 0.039488

The algorithm will compute the possibilities and give us the leaf node that has the highest purity gain.

Exercise 1)

Instructions:
- The computation for the Gini of the root node is given.
- Compute the Gini measure for the left leaf node.
- Compute the Gini measure for the right leaf node.
- Compute the gain by taking the difference between the root node Gini and the weighted leaf node Gini measures. - Information regarding the split in this tree can be found using $split and the tree object, small_tree. small_tree is given and is not available here.
Instead of gain, you should look at the improve column here. improve is an alternative metric for gain, simply obtained by multiplying gain by the number of cases in the data set. Make sure that the object improve (code given) has the same value as in small_tree$split.

# The Gini-measure of the root node is given below
gini_root <- 2 * (89 / 500) * (411 / 500)

# Compute the Gini measure for the left leaf node
gini_ll <- 2 * (401/446) * (45/446)

# Compute the Gini measure for the right leaf node
gini_rl <- 2 * (10/54) * (44/54)

# Compute the gain
gain <- gini_root - (446 / 500) * gini_ll - (54 / 500) * gini_rl

# compare the gain-column in small_tree$splits with our computed gain, multiplied by 500, and assure they are the same
# small_tree is only available in counsel.
small_tree$splits
improve <- gain * 500

Building decision trees using rpart() package

Imagine if we had to calculate all splits, that would take us an unimaginable amount of time.

This is where the rpart() package comes in!

Warning:

rpart() is a useful package for building decision trees, but it’s important to warn you upfront that building decision trees in the credit risk context (and claim risk context) can be very challenging.
The main reason for this is that the credit risk data are very unbalanced due to low percentage of defaults on loans.

Using the default setting on the training_set of loan_data, including all variables " ~ ." and specifying the method to “class” as a response variable (loan_status is categorical), we get a warning message that only the root is created.

The reason for the warning is that, (similarly to what we had observed when setting a cut-off to a logistic regression model), the highest accuracy is achieved by simply predicting all cases to be “non-defaults”.

library(rpart)

fit_default <- rpart(loan_status ~ ., method = "class", data = training_set)

#**observe the warning message!
plot(fit_default)

Three techniques to overcome unbalance

There are 3 main things we can do to overcome the unbalanced data (default values are very low in data set).

1) Over-sampling OR Under-sampling
+ You can over-sample the under-represented group (number of defaults on loan) OR under-sample the over-represented group (number of non-defaults on loan).
+ Balancing the training-set will have a positive effect on accuracy.
+ Note: over OR under sampling should ONLY be applied the training_set (NOT the test_set).

2) Changing the prior probabilities in the rpart() function
+ Make the prior probabilities of “default vs non-default” set equal to the proportions in the training_set.
+ By making the prior probabilities for “default” bigger, you can trick R into attaching more importance to “defaults” leading to a better decision tree.

3) Include a Loss Matrix using rpart function
+ As a third option, the rpart() function allows a specification of a loss-matrix.
+ In this Loss Matrix, different costs can be associated with a misclassification of a “default” as a “non-default” versus the misclassification of a “non-default” as a “default”.
+ By increasing the misclassification costs of the former, more attention is given to the correct classification of “default” improving the quality of the decision tree.

Validate model to see what’s best!

All three methods are intended to overcome the problem of class imbalance, but validation is very important.

In some cases, under or over sampling may work very well for some data sets, it may work poorly for others.
The same could be true for the other way around.
What the best method for the data at hand would only become clear upon proper validation.

Exercise) Under-sampling the training set
To overcome the unbalanced data problem, you can use under- or oversampling. The training set has been under-sampled for you, such that 1/3 of the training set consists of defaults, and 2/3 of non-defaults. The resulting data set is available in your workspace and named undersampled_training_set, and contains less observations (6570 instead of 19394). In this exercise, you will create a decision tree using the undersampled data set.

You will notice that the trees in this and the next exercises are very big, so big that you cannot really read them anymore. Don’t worry about this for now, we will tell you how you can make them more manageable in the next video!

Instructions:

The rpart package has been installed for you. Load the package in your workspace.
Change the code provided such that a decision tree is constructed using the undersampled training set instead of training_set. Additionally, add the argument control = rpart.control(cp = 0.001).
- cp, which is the complexity parameter, is the threshold value for a decrease in overall lack of fit for any split. If cp is not met, further splits will no longer be pursued.
- cp’s default value is 0.01, but for complex problems, it is advised to relax cp.
Plot the decision tree using the function plot and the tree object name. Add a second argument uniform = TRUE to get equal-sized branches.
The previous command simply creates a tree with some nodes and edges, but without any text (or so-called “labels”) on it. Use function text() with sole argument tree_undersample to add labels.

Under-sampling example:

Using ovun.sample() function from ROSE package.

Instructions:
- First, use the table() function to identify how many default and non-default observations are in the training_set data frame.
- Next, save the loan default count (2138) in a variable called: n_loan_status.
- Save the percentage amount of loan default observation we would like in the under-sampled dataset.

table(training_set$loan_status)

## 
##     0     1 
## 17256  2138

2138/.33

## [1] 6478.788

n_loan_status <- 2138
new_frac_loan_status <- 0.33
new_n_total <- n_loan_status / new_frac_loan_status # = 2138/.33 = 6478 (total obsv count)

library(ROSE)

## Warning: package 'ROSE' was built under R version 3.6.3

## Loaded ROSE 0.0-3

undersampled_result <- ovun.sample(loan_status ~ .,
                                         data = training_set,
                                         method = "under",
                                         N = new_n_total,
                                         seed = 1234)

undersampled_training_set <- undersampled_result$data

table(undersampled_training_set$loan_status)

## 
##    0    1 
## 4624 1854

# Load package rpart in your workspace.
library(rpart)

# Change the code provided in the video such that a decision tree is constructed using the undersampled training set. Include rpart.control to relax the complexity parameter to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
                          data =  undersampled_training_set,
                          control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_undersample, uniform = TRUE)

# Add labels to the decision tree
text(tree_undersample, use.n = TRUE)

Exercise) Changing the prior probabilities
You can also change the prior probabilities to obtain a decision tree. This is an indirect way of adjusting the importance of misclassifications for each class. You can specify another argument inside rpart() to include prior probabilities. The argument you are looking for has the following form

parms = list(prior=c(non_default_proportion, default_proportion))

Instructions:

Change the code provided such that a decision tree is constructed, including the argument parms and changing the proportion of non-defaults to 0.7, and of defaults to 0.3 (they should always sum up to 1).
Additionally, include control = rpart.control(cp = 0.001) as well.
Plot the decision tree using the function plot and the tree object name.
Add a second argument “uniform=TRUE” to get equal-sized branches.
Add labels to the tree using function text() and the decision tree object name.

# Load package rpart in your workspace.
library(rpart)

# Change the code below such that a tree is constructed with adjusted prior probabilities.
tree_prior <- rpart(loan_status ~ ., method = "class",
                    data = training_set, 
                    parms = list(prior = c(0.7, 0.3)),
                    control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_prior, uniform = TRUE)

# Add labels to the decision tree
text(tree_prior)

Exercise 3) Including a Loss Matrix

Thirdly, you can include a loss matrix, changing the relative importance of misclassifying a default as non-default versus a non-default as a default.

You want to stress that misclassifying a default as a non-default should be penalized more heavily.

Including a loss matrix can again be done in the argument parms in the loss matrix.

parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2))

Doing this, you are constructing a 2x2-matrix with zeroes on the diagonal and changed loss penalties off-diagonal. The default loss matrix is all ones off-diagonal.

Instructions:

Change the code provided such a loss matrix is included, with a penalization that is 10 times bigger when misclassifying an actual default as a non-default.
- This can be done replacing cost_def_as_nondef by 10, and cost_nondef_as_def by 1.
- Similar to what you’ve done in the previous exercises, include rpart.control to relax the complexity parameter to 0.001.
Plot the decision tree using the function plot and the tree object name.
Add a second argument uniform = TRUE to get equal-sized branches, and add labels to the tree using text() with the tree object name.

# Change the code below such that a decision tree is constructed using a loss matrix penalizing 10 times more heavily for misclassified defaults.
tree_loss_matrix <- rpart(loan_status ~ ., method = "class",
                          data =  training_set,
                        # parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2))
                          parms = list(loss = matrix(c(0, 10, 1, 0), ncol=2)),
                          control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_loss_matrix, uniform = TRUE)


# Add labels to the decision tree
text(tree_loss_matrix)

Pruning the Decision Tree

The decision trees in the previous exercises are very large.
It’s generally not advised to construct enormous decision trees.
Not only are they harder to interpret, but over-fitting may occur, leading to inferior results when applying the model to the test set.

The rpart package provides tools that make pruning of the decision fairly easy.
We’ll be using functions plotcp() and printcp() for pruning purpose.

Problem with large decision trees
- Too complex: not clear anymore
- Overfitting when applying to test set

Solution: use plotcp(), printcp() for pruning purposes

Let’s have a look at what these functions do when applying them to the tree we constructed using the “undersampled” training set (this is not available).

We’ll look at printcp() first.

printcp()

Applying this function to our object: tree_undersample, you get an overview of how to treat rows using more splits with the column nsplit [reference pic].

You can change the CP (complexity parameter) [reference pic] which was fixed to .001 before to obtain derived complexity level.
But what level should be chosen?
We want to minimize the cross validation error of the decision tree. These results are given in column xerror [reference pic]. Cross-validation inside the training set is used to obtain this error measure.
Since there is a random element to the cross-validation process, it is necessary to set a seed if you want your results to be reproducible.

printcp(tree_undersample)

## 
## Classification tree:
## rpart(formula = loan_status ~ ., data = undersampled_training_set, 
##     method = "class", control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] age            annual_inc     emp_length     grade         
## [5] home_ownership int_rate       loan_amnt     
## 
## Root node error: 1854/6478 = 0.2862
## 
## n= 6478 
## 
##           CP nsplit rel error xerror     xstd
## 1  0.0048544      0   1.00000 1.0000 0.019622
## 2  0.0032362      5   0.97519 1.0027 0.019637
## 3  0.0024272      6   0.97195 1.0119 0.019691
## 4  0.0021575     13   0.95092 1.0113 0.019688
## 5  0.0018878     15   0.94660 1.0124 0.019694
## 6  0.0016181     24   0.92826 1.0119 0.019691
## 7  0.0013484     41   0.89320 1.0167 0.019718
## 8  0.0012585     46   0.88619 1.0259 0.019771
## 9  0.0010787     55   0.87433 1.0512 0.019910
## 10 0.0010017     66   0.86192 1.0653 0.019985
## 11 0.0010000     74   0.85329 1.0690 0.020005

plotcp()

You can also plot a cross-validation error as a function of the complexity parameter CP and size of the tree using the function plotcp().

The minimum cross-validated error can be found at the red error in the picture where the cp = 0.003653. For this exact cp value, you can have a look at printcp() again.

plotcp(tree_undersample)

plot the pruned tree

To prune the tree, the function prune() can be used with the initial unpruned tree: “tree_undersample” as the first argument and a cp value that minimize the cross-validated error as a second argument.
Plotting the pruned tree, you get a smaller decision tree.
Note that this tree does not give information on the true number of default vs non-default that are present in each leaf.
If you want this information, use the argument: use.n = TRUE in the text function.

Note: these values refer to the training set only.

Keep in-mind that a “yes” to the test statement in a node will lead you to follow the left-hand branch, and a “no” to the test statement will lead you follow the right-hand branch.

This is not very intuitive for the rpart() package. Addition the “home_ownership=cd” is not very informative.

ptree_undersample = prune(tree_undersample, cp = 0.0032362)

plot(ptree_undersample, uniform = TRUE)

text(ptree_undersample, use.n = TRUE)

prp() in the rpart.plot-package

prp() is a much more intuitive rpart package than plot().
When plotting with prp() you receive guidance on which branch to take depending on the answer to the split question.
You’ll also get the value name for the factor variables (in this example: home_own = OWN or REN).

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.6.2

prp(ptree_undersample)

If you want to have the number of ‘default vs non-default’ cases, add the argument: extra = 1

library(rpart.plot)
prp(ptree_undersample, extra=1)

Exercise 1) Pruning the tree with changed prior probabilities
Pruning a tree is necessary to avoid over-fitting.
There were some big trees in the previous exercises and now you will put what you have learned into practice, and prune the previously constructed tree with the changed prior probabilities.

You will first set a seed to make sure the results are reproducible, because you will be examining cross-validated error results.
Results involve randomness and could differ slightly upon running the function again with a different seed.

In this exercise you will learn to identify which complexity parameter (CP) will minimize the cross-validated error results, then prune your tree based on this value.

Instructions:

tree_prior is loaded in your workspace.
Use plotcp() to visualize cross-validated error (X-val Relative Error) in relation to the complexity parameter for tree_prior.
Use printcp() to print a table of information about CP, splits, and errors. See if you can identify which split has the minimum cross-validated error in tree_prior.
Use which.min() to identify which row in tree_prior$cptable has the minimum cross-validated error “xerror”. Assign this to index.
Create tree_min by selecting the index of tree_prior$cptable within the column “CP”.
Use the prune() function to obtain the pruned tree. Call the pruned tree ptree_prior.
Package rpart.plot is loaded in your workspace. Plot the pruned tree using function prp() (default setting).

library(rpart.plot)

# tree_prior is loaded in your workspace

# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_prior)

# Use printcp() to identify for which complexity parameter the cross-validated error rate is minimized.
printcp(tree_prior)

## 
## Classification tree:
## rpart(formula = loan_status ~ ., data = training_set, method = "class", 
##     parms = list(prior = c(0.7, 0.3)), control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] age            annual_inc     emp_cat        emp_length    
## [5] grade          home_ownership int_rate       loan_amnt     
## 
## Root node error: 5818.2/19394 = 0.3
## 
## n= 19394 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0052744      0   1.00000 1.00000 0.020400
## 2  0.0047527      3   0.98418 1.00502 0.020255
## 3  0.0035727      4   0.97942 0.99878 0.019929
## 4  0.0029150      8   0.96269 0.99305 0.019578
## 5  0.0022178      9   0.95977 0.99212 0.019421
## 6  0.0021406     15   0.94462 0.99118 0.019412
## 7  0.0018931     18   0.93820 0.98593 0.019340
## 8  0.0016159     19   0.93630 0.98877 0.019298
## 9  0.0015332     20   0.93469 0.99140 0.019252
## 10 0.0013233     23   0.93009 0.99615 0.019248
## 11 0.0012569     24   0.92877 0.99705 0.019131
## 12 0.0012458     25   0.92751 0.99795 0.019119
## 13 0.0012447     27   0.92502 0.99761 0.019115
## 14 0.0011704     30   0.92108 0.99761 0.019115
## 15 0.0011582     33   0.91757 0.99675 0.019102
## 16 0.0011387     37   0.91283 0.99694 0.019062
## 17 0.0010817     48   0.89841 0.99410 0.019014
## 18 0.0010804     49   0.89733 0.99578 0.019020
## 19 0.0010766     58   0.88299 0.99578 0.019020
## 20 0.0010596     65   0.87405 0.99566 0.019011
## 21 0.0010374     67   0.87193 0.99540 0.019048
## 22 0.0010341     69   0.86986 0.99481 0.019035
## 23 0.0010000     71   0.86779 0.99723 0.019045

# Create an index for of the row with the minimum xerror
index <- which.min(tree_prior$cptabl[ , "xerror"])

# Create tree_min
tree_min <- tree_prior$cptable[index, "CP"]

#  Prune the tree using tree_min
ptree_prior <- prune(tree_prior, cp = tree_min)

# Use prp() to plot the pruned tree
prp(ptree_prior)

Exercise 2) Pruning the tree with the loss matrix

In this exercise, you will prune the tree that was built using a loss matrix in order to penalize misclassified defaults more than misclassified non-defaults.

Instructions:

Run the code to set a seed and construct tree_loss_matrix again.
Use function plotcp() to examine the cross-validated error-structure.
Looking at the cp-plot, you will notice that pruning the tree using the minimum cross-validated error will lead to a tree that is as big as the unpruned tree, as the cross-validated error reaches its minimum for cp = 0.001. Because you would like to make the tree somewhat smaller, try pruning the tree using cp = 0.0012788. For this complexity parameter, the cross-validated error approaches the minimum observed error. Call the pruned tree ptree_loss_matrix.
Package rpart.plot is loaded in your workspace. Plot the pruned tree using function prp() (including argument: extra = 1).

# set a seed and run the code to construct the tree with the loss matrix again
set.seed(345)
tree_loss_matrix  <- rpart(loan_status ~ ., method = "class", data = training_set,
                           parms = list(loss=matrix(c(0, 10, 1, 0), ncol = 2)),
                           control = rpart.control(cp = 0.001))

# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_loss_matrix)

# Prune the tree using cp = 0.0012788
ptree_loss_matrix <- prune(tree_loss_matrix, cp = 0.0012788)

# Use prp() and argument extra = 1 to plot the pruned tree
prp(ptree_loss_matrix, extra = 1)

Other tree options & confusion matrices

We’ve covered several options for facilitating a construction of a tree, especially on unbalanced data.
Off course, the range of possibilities is endless and there are other arguments you can pass through the rpart() function to shape your loan default decision tree.

We’ll go over some of the most important ones.

An important and interesting argument in rpart() is weights.

weights: allows you to include case weights for each of the cases in the training set.
Increasing the weights of the cases would have been another plausible strategy to counter the balance in a decision tree.

The other interesting argument mentioned here are in the rpart.control function that can be passed down to the argument “control”. We’ve used this argument before to change the “CP” (complexity parameter).
An interesting argument here is minsplit.
- minsplit: Allows you to change the minimum number of observations that must exist in a node in order for a split to be attempted. The default argument for this equals to 20, but for unbalanced data it might be useful lower this value.

minbucket: the argument minbucket specifies the minimum number of observations in any leaf node.
- the default here is 1/3 of the value specified in minsplit(), but it may be useful to try other values.
- warning: lowering this value (# of observation) can lead to over-fitting.

Making Predictions using the Decision Tree

The last thing we need to do is to evaluate our Decision Tree.

Until now, we only used the training set, but now we’ll use our test set to evaluate the result.
This can be done using the predict() function.

The code to make predictions using the model with the undersampled data set is shown below:

note: we use the pruned version of the tree

We include an additional argument: type = class, we get a vector of predictions right away without using a cut-off value anymore (like in Logistic Regression).

However, in some cases we have a non-binary prediction and we want to choose a cut-off value ourselves. For this case, leave out the type = class argument and you’ll get probabilities instead of binary predictions.

These probabilities are derived from the proportion of the leaf nodes where a particular case ends up given its co-variance values.

pred_undersample_class = predict(ptree_undersample, newdata = test_set, type = "class")
str(pred_undersample_class)

##  Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
##  - attr(*, "names")= chr [1:9698] "2" "6" "7" "10" ...

pred_undersample = predict(ptree_undersample, newdata = test_set)
str(pred_undersample)

##  num [1:9698, 1:2] 0.589 0.665 0.37 0.589 0.665 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:9698] "2" "6" "7" "10" ...
##   ..$ : chr [1:2] "0" "1"

Constructing a confusion matrix

Here a confusion matrix is given for the decision tree that was constructed using undersampled data and binary predictions.

Even though we used undersampling, very few defaults are predicted for our test set. It could be useful here to use probability predictions instead of binary predictions to lower the curt-off.

table(test_set$loan_status, pred_undersample_class)

##    pred_undersample_class
##        0    1
##   0 8449  160
##   1 1046   43

Exercise 1) One final tree using more options

In this exercise, you will use some final arguments that were discussed above. Some specifications in the rpart.control() function will be changed, and some weights will be included using the weights argument in rpart().
The vector case_weights has been constructed for you and is loaded in your workspace. This vector contains weights of 1 for the non-defaults in the training set, and weights of 3 for defaults in the training sets. By specifying higher weights for default, the model will assign higher importance to classifying defaults correctly.

Instructions:

Set a seed of 345.
Add to the provided code by passing case_weights to the weights argument of rpart().
Change the minimum number of splits that are allowed in a node to 5, and the minimum number of observations allowed in leaf nodes to 2 by using the arguments minsplit and minbucket in rpart.control respectively.
Use function plotcp() to investigate where the cross-validated error rate can be minimized.
Use which.min() to identify the row with the minimum “xerror” in tree_weights$cp. Assign this to index.
Use the provided code to select the cp for which the cross-validated error is minimized
Prune the tree using the complexity parameter where the cross-validated error rate is minimized. Store the pruned tree in ptree_weights.
Plot the pruned tree using function prp(). Include a second argument extra and set it equal to 1.

#Creating the case weight:
#code found on Stack overflow

#weights: optional case weights
#cost: a vector of non-negative costs, one for each variable in the model. Defaults to one for all variables. These are scalings to be applied when considering splits, so the improvement on splitting on a variable is divided by its cost in deciding which split to choose.

#The weights is for rows (e.g. give higher weight to smaller class), the cost is for columns.

#subset() function isolates the data.frame: training_set to a group of that meets condition.

positiveWeight = 1.0 / (nrow(subset(training_set, loan_status == 1)) / nrow(training_set))
negativeWeight = 1.0 / (nrow(subset(training_set, loan_status != 1)) / nrow(training_set))

modelWeights <- ifelse(training_set$loan_status== 1, positiveWeight, negativeWeight)

library(rpart.plot)

# set a seed and run the code to obtain a tree using weights, minsplit and minbucket
set.seed(345)
tree_weights <- rpart(loan_status ~ ., method = "class",
                      data = training_set,
                      #case_weights is a vector that's not available here.
                      modelWeights,
                      control = rpart.control(minsplit = 5, minbucket = 2, cp = 0.001))

# Plot the cross-validated error rate for a changing cp
plotcp(tree_weights)

# Create an index for of the row with the minimum xerror
index <- which.min(tree_weights$cp[ , "xerror"])

# Create tree_min
tree_min <- tree_weights$cp[index, "CP"]

# Prune the tree using tree_min
ptree_weights <- prune(tree_weights, cp = tree_min)

# Plot the pruned tree using the rpart.plot()-package
prp(ptree_weights, extra = 1)

Exercise 2) Confusion matrices and accuracy of our final trees Over the past few exercises, you have constructed quite a few pruned decision trees, with four in total. As you can see, the eventual number of splits varies quite a bit from one tree to another:

ptree_undersample # 7 splits
ptree_prior # 9 splits
ptree_loss_matrix # 24 splits
ptree_weights # 6 splits

Now it is important to know which tree performs best in terms of accuracy.
In order to get the accuracy, you will start off by making predictions using the test set, and construct the confusion matrix for each of these trees. You will add the argument type = “class” when doing these predictions. By doing this there is no need to set a cut-off.

Nevertheless, it is important to be aware of the fact that not only the accuracy is important, but also the sensitivity and specificity. Additionally, predicting probabilities instead of binary values (0 or 1) has the advantage that the cut-off can be moved along. Then again, the difficulty here is the choice of the cut-off. You will return to this in the next chapter.

In case you needed a reminder, here is how to compute the accuracy: $ Classification Accuracy = (TP + TN) / (TP + FP + TN + FN) $

Instructions:

Use predict() to make predictions for all four trees. The test_set should be included in the argument newdata. Don’t forget to include type = “class”!
Construct confusion matrices for each of these decision trees. Use the function table(), and include the “true” status (using test_set$loan_status) first, followed by the prediction.
Compute the accuracy using each of the confusion matrices.

# Make predictions for each of the pruned trees using the test set.
pred_undersample <- predict(ptree_undersample, newdata = test_set,  type = "class")
pred_prior <- predict(ptree_prior, newdata = test_set,  type = "class")
pred_loss_matrix <- predict(ptree_loss_matrix, newdata = test_set,  type = "class")
pred_weights <- predict(ptree_weights, newdata = test_set,  type = "class")

# construct confusion matrices using the predictions.
confmat_undersample <- table(test_set$loan_status, pred_undersample)
confmat_prior <- table(test_set$loan_status, pred_prior)
confmat_loss_matrix <- table(test_set$loan_status, pred_loss_matrix)
confmat_weights <- table(test_set$loan_status, pred_weights)

# Compute the accuracies
acc_undersample <- sum(diag(confmat_undersample)) / nrow(test_set)
acc_prior <- sum(diag(confmat_prior)) / nrow(test_set)
acc_loss_matrix <- sum(diag(confmat_loss_matrix)) / nrow(test_set)
acc_weights <-sum(diag(confmat_weights)) / nrow(test_set)

print(acc_undersample)
print(acc_prior)
print(acc_loss_matrix)
print(acc_weights)

Section 4

Finding the right cut-off: the strategy curve

In the previous chapters, we gave you an overview of how Logistic Regression models and Decision Trees can be used to predict probability of default.

Constructing a confusion matrix

Using the predict() function on the logistic regression model with type = “response”, we get a response, we get a vector of predicted probabilities of defaults.

Using the predict() function on the decision tree model, we will get a matrix with two columns.

The numbers on the left column (under 0) = predicted provability of non-default
The numbers on the right column (under 1) = predicted provability of default

Select the 2nd column to make them comparable to the predictions from the logistic regression model.

Cut-off?

Until now, we’ve set cut-offs based on best guesses or a gut feeling in order to make a accuracy maxtrixes for each of the models used.
The choice of a cut-off is important, but this changes the validation matrix.

Usually using these models on future applicants that come in, the cut-off could be a way of deciding which loan applicants would get a loan and which applicants will not.
It is clear that this model will never be perfect, and no matter how many applicants a bank refuses, there will still be debtors that default on a loan.

log_reg_model <-  glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt + home_ownership + annual_inc,
                       family = binomial, data = training_set)
  
pred_log_regression_model <- predict(log_reg_model, newdata = test_set, type = "response")

The good thing is that model can help a bank decide how many loans they should approve if they don’t want exceed a certain percentage of defaults in their portfolio of customers.

We’ll start with a simple example using the full logistic regression model from before.

Let’s assume the test set contains new applicants and the bank decide to reject 20% of the new applicants based on their fitted probability of default.
This means that the 20% with the highest predicted probability of default will be rejected by the bank.

A certain strategy…

To obtain the cut-off value that would lead to predicted “1” (default) for 20% cases in a test set, you need to look at the 80% quantile of the predictions vector.
Having used this cut-off, you know which test set loan applicants would have been accepted using the 80% acceptance rate.

log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predictions_all_full <- predict(log_reg_model, newdata =  test_set, type = "response")

cutoff <- quantile(predictions_all_full, 0.8)

pred_full_20 <- ifelse(predictions_all_full > cutoff, 1, 0)

We want to have a look now at the true status of the loans that would have been accepted using this cut-off, and see what percentage of the accepted loans actually defaulted which is also referred to as bad rate.
> Having accepted 80% of the loan, we see that the bad rate is 0.897 %.

true_and_predval <- cbind(test_set$loan_status, pred_full_20)

str(true_and_predval)

##  num [1:9698, 1:2] 0 0 1 0 0 0 0 1 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:9698] "2" "6" "7" "10" ...
##   ..$ : chr [1:2] "" "pred_full_20"

accepted_loans <- true_and_predval[pred_full_20 == 0, 1]

bad_rate <- sum(accepted_loans)/length(accepted_loans)
bad_rate

## [1] 0.09474091

If you repeat this for several acceptance rate, you can construct an entire table with acceptance rates, corresponding cut-off values, and the percentage of bad_rate.
This is a useful tool for a bank because banks can adapt their own acceptable rate depending on the percentage of bad loans they can allow in their loan portfolio.

Suppose a bank can only allow an 8% of bad rate. Using this model (reference picture), a bank will decide to accept 65% of the loans.

The strategy curve provides us with a visual tool as well.
It shows the bad rates as a function of the acceptance rate.

library(rpart)

printcp(tree_undersample)

summary(tree_undersample)

ptree_undersample = prune(tree_undersample, cp = 0.003653)

Exercise 1) Computing a bad rate given a fixed acceptance rate

We learned how to compute the bad rate (or, the percentage of defaults) in the loan portfolio of a bank when given:

a specific model
the acceptance rate

In this exercise, you will compute the bad rate that a bank can expect when using the pruned tree “ptree_prior” that you fitted before, and an acceptance rate of 80%. As a reminder, the tree is plotted on your right hand side.

Instructions:

In the script, you are provided the code to make predictions for the probability of default using the pruned tree and test_set. Remember that if you use the predict() function for a tree, the probability of default can be found in the second column. Therefore [,2] was pasted to the predict() function.
Obtain the cut-off that leads to an acceptance rate of 80%, using prob_default_prior. You can use the quantile() function to do this, setting the second argument to 0.8. Assign the name cutoff_prior.
The code to obtain the actual binary default predictions (0 or 1) is provided. ifelse() here. Name the object bin_pred_prior_80.
The code to select the default indicators of test_set for the accepted loans cording to a 80% acceptance rate is provided.
Compute the percentage of defaults (or the “bad rate”) for the accepted loans. This is the number of occurrences of 1 in accepted_status_prior_80, divided by the total number of instances in this vector. Print the solution to your R-console.

#Goal: Calculate the percentage of default (bad rate)
#step1) Create the prediction using pruned decision tree.
#step2) Create the cut-off for the predicited values.
#step3) Change the predicited values to binary values (1 = yes, 0 = no) using the cut-off.
#step4) Isolate the observations that are "default"
#step5) Calculate sum(default)/total # of instance

# Make predictions for the probability of default using the pruned tree and the test set.
prob_default_prior <- predict(ptree_prior, newdata = test_set)[ ,2]

# Obtain the cutoff for acceptance rate 80%
cutoff_prior <- quantile(prob_default_prior, 0.8)
  
# Obtain the binary predictions.
bin_pred_prior_80 <- ifelse(prob_default_prior > cutoff_prior, 1, 0)

# Obtain the actual default status for the accepted loans
accepted_status_prior_80 <- test_set$loan_status[bin_pred_prior_80 == 0]

# Obtain the bad rate for the accepted loans
sum(accepted_status_prior_80) / length(accepted_status_prior_80)

## [1] 0.09694009

Exercise 2) The strategy table and strategy curve Repeating the calculations you did in the previous exercise for several acceptance rates, you can obtain a strategy table. This table can be a useful tool for banks, as they can give them a better insight to define an acceptance strategy.

You know how to compute a bad rate for a certain acceptance rate by now, so the function strategy_bank was written and loaded into your workspace to speed things up (not available here). This function computes the cut-off and bad rate for the acceptance rates that are multiples of 5% (0%, 5%, 10%, …).

Instructions:

Have a look at the function strategy_bank (not available here).
The vector predictions_cloglog contains predicted probabilities of default using the cloglog model you used in chapter 2, the vector predictions_loss_matrix contains the predicted probabilities of default using the pruned tree including a loss matrix (previously constructed in chapter 3). Apply function strategy_bank to each of the prediction-vectors, assign the name strategy_cloglog and strategy_loss_matrix respectively.
The strategy tables can be obtained using the object names in combination with $table.
The strategy curves have been plotted for you. The strategy curve of the tree model shows pretty strange behavior. Because of the structure of classification trees, you might have a bigger chance for weird “jumps” here. Additionally, the tree with a loss matrix was a very big one, so this might be the result of overfitting!

# Have a look at the function strategy_bank
strategy_bank

# Apply the function strategy_bank to both predictions_cloglog and predictions_loss_matrix
strategy_cloglog <- strategy_bank(predictions_cloglog)
strategy_loss_matrix <- strategy_bank(predictions_loss_matrix)

# Obtain the strategy tables for both prediction-vectors
strategy_cloglog$table
strategy_loss_matrix $table

# Plot the strategy functions
par(mfrow = c(1,2))
plot(strategy_cloglog$accept_rate, strategy_cloglog$bad_rate, 
     type = "l", xlab = "Acceptance rate", ylab = "Bad rate", 
     lwd = 2, main = "logistic regression")

plot(strategy_loss_matrix$accept_rate, strategy_loss_matrix$bad_rate, 
     type = "l", xlab = "Acceptance rate", 
     ylab = "Bad rate", lwd = 2, main = "tree")

Exercise 2) continued question

Imagine you have 2 banks: Bank A, and Bank B. These banks are restricted to using a certain acceptance rate.

Bank A has to use an acceptance rate of 45%
Bank B has to use an acceptance rate of 85%

Based on the strategy tables, which model will Bank A and Bank B prefer to use?

Answer: Bank A will prefer to use the classification tree model, but for Bank B, the logistic regression model leads to lower bad rates.

The ROC-Curve (Receiver Operating Characteristic)

In the previous exercise, we saw how banks prefers one model over another depending on the minimum bad rate they are allowed to have OR turning it around depending on their acceptance rate.

strategy table/curve: still make assumptions

Often, banks want to know what is the best model overall without having to make any assumption regarding acceptance rate or cut-off.

To illustrate this, lets go back to the Decision Matrix Metric.

Previously we used classification accuracy that includes all the test set elements in the denominator, and as a result it’s tempting to use.
But Classification accuracy is problematic in the credit-risk (and claim risk) context it is not the best to use because generally accuracy is maximized when a very high cut-off is chosen.

Accuracy is maximized when a very high cut-off is chosen, not the best measurement.

In other words, when all test-set elements are classified as non-defaults.

This is why most credit risk models are based on the measurement of sensitivity and Specificity.

Using measurement: Sensitivity & Specificity

Definition example: In medical diagnosis, test sensitivity is the ability of a test to correctly identify those with the disease (true positive rate), whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate).

One of the most popular method of evaluating credit risk model is based on sensitivity and specificity.

This evaluation method is called the ROC (Receiver Operating Characteristic) curve.

The ROC-curve is constructed by plotting Sensitivity (y-axis) against $1 - Specificity$ (x-axis) for each possible cut-off (cut-off = 1).

The curve always starts in the lower left corner where:

Sensitivity = 0
Specificity = 1

Which makes the cut-off equal to 1 (makes all cases “Non-defaults”).

The other end, upper-right on graph, is where the cut-off = 0:

Sensitivity = 1
Specificity = 0

Generally speaking, the closer the ROC curve is to the top left of the corner, THE BETTER.

This curve has higher sensitivity associated with higher specificity.

On the other extreme, the 45 degree red line would be the result of assigning cases to the default versus non-default group randomly rather than through a model.

In reality, when comparing models you’ll often see that the differences between several ROC curves are not as clear as shown in the previous slides.

AUC measure (Area Under the Curve)

Additionally ROC-curve may have cross-over which makes it hard to tell which model is better.

This problem is illustrated where with an ROC-Curve with 2 models.

Black ROC-curve for Model A
Blue ROC-curve for Model B

A measure that banks like to use when comparing ROC curves is called AUC - Area Under the Curve.

The AUC of a model is between 0.5 (diagonal red line in prev pic) and 1 (corresponds to a perfect model).

Computing the AUC here, we can see that Model B is our better choice.

Exercise 1) ROC-curves for comparison of logistic regression models

ROC-curves can easily be created using the pROC-package in R.
Let’s have a look if there is a big difference between ROC-curves for the four logistic regression-models previously used throughout this course.

A small heads up:

predictions_logit contains probability of default (PD) predictions using the default logit link and containing variables age, emp_cat, ir_cat and loan_amnt.
predictions_probit contains PD-predictions using the probit and containing variables age, emp_cat, ir_cat and loan_amnt.
predictions_cloglog contains PD-predictions using the cloglog link and containing variables age, emp_cat, ir_cat and loan_amnt.
predictions_all_full contains PD-predictions using the default logit link and containing all seven variables in the data set.

You will first draw the ROC-curves for these four models in one plot. Afterwards, you will look at the area under the curve.

Instructions:
- Load the pROC-package in your R-console.
- Construct the ROC-objects for the four logistic regression models using function roc(response, predictor).
+ Remember that the response is the loan status indicator in the test_set, which can be obtained through test_set$loan_status.
- Use the previously created objects to construct ROC-curves. To draw them all on one plot, use plot() for the first ROC-curve drawn (for ROC_logit), and use lines() for the other three models to the same plot.
- Use the col-argument to change the color of the curve of ROC_probit to “blue”, ROC_cloglog to “red” and ROC_all_full to “green”.
+ Note: the x-axis label is Specificity and not “1-Specificity”, resulting in an axis that goes from 1 on the left-hand side to 0 on the right-hand side.
- It seems that the link function does not have a big impact on the ROC here, and the main trigger of a better ROC is the inclusion of more variables in a model. To get an exact idea of the performance of the ROC-curves, have a look at the AUC’s, using function auc().

# Fit the logit, probit and cloglog-link logistic regression models
log_model_logit <- glm(loan_status ~ age  + ir_cat + loan_amnt,
                       family = binomial(link = logit), data = training_set)
log_model_probit <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
                       family = binomial(link = probit), data = training_set)

log_model_cloglog <-  glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
                       family = binomial(link = cloglog), data = training_set)

log_model_all_full <-  glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt + home_ownership + annual_inc,
                       family = binomial, data = training_set)
  
# Make predictions for all models using the test set
predictions_logit <- predict(log_model_logit, newdata = test_set, type = "response")
predictions_probit <- predict(log_model_probit, newdata = test_set, type = "response")
predictions_cloglog <- predict(log_model_cloglog, newdata = test_set, type = "response")
predictions_all_full <- predict(log_model_all_full, newdata = test_set, type = "response")
  
# Use a cut-off of 14% to make binary predictions-vectors
cutoff <- 0.14
class_pred_logit <- ifelse(predictions_logit > cutoff, 1, 0)
class_pred_probit <- ifelse(predictions_probit > cutoff, 1, 0)
class_pred_cloglog <- ifelse(predictions_cloglog > cutoff, 1, 0)
class_pred_all_full<- ifelse(predictions_all_full > cutoff, 1, 0)

# Load the pROC-package
library(pROC)

## Warning: package 'pROC' was built under R version 3.6.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Construct the objects containing ROC-information
ROC_logit <- roc(test_set$loan_status, predictions_logit)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

ROC_probit <- roc(test_set$loan_status, predictions_probit)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

ROC_cloglog <- roc(test_set$loan_status, predictions_cloglog)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

ROC_all_full <- roc(test_set$loan_status, predictions_all_full)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# Draw all ROCs on one plot
plot(ROC_logit)
lines(ROC_probit, col="blue")
lines(ROC_cloglog, col="red")
lines(ROC_all_full, col="green")

# Compute the AUCs
auc(ROC_logit)

## Area under the curve: 0.6193

auc(ROC_probit)

## Area under the curve: 0.6297

auc(ROC_cloglog)

## Area under the curve: 0.6295

auc(ROC_all_full)

## Area under the curve: 0.6435

Input selection based on the AUC

Let’s have a look again for the ROC Curve of the Logistic Regression model.

The ROC Curve clearly improves when increasing the variable amount by 4 to 7 in the model.

In other words, the ROC Curve, or more specifically the AUC can be used for variable selection.

ROC Curve (AUC measure) can be used for variable selection.

In fact, banks (and insurance) are particularly interested in knowing which variables are important for predicting loan default.

When evaluating models based on classification performance, it might be worth looking at the AUC performance of this model for variable selection rather than p-value.

This former method is referred to as AUC-based pruning.

AUC-based pruning

Step 1) Start with a model including all variables (in our case, 7) and compute AUC

log_model_full <-  glm(loan_status ~ grade + age + emp_cat + ir_cat + loan_amnt + home_ownership + annual_inc,
                       family = binomial, data = training_set)

predictions_model_full <- predict(log_model_full, newdata = test_set, type = "response")

AUC_model_full <- auc(test_set$loan_status, predictions_model_full)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

AUC_model_full

## Area under the curve: 0.6518

Step 2) Build 7 new models, where each time one of the variables is removed, and make PD-predictions using the test set.

Step 3) Keep the model that led to the best AUC measure

In this example, we get the highest AUC by removing home_ownership.

Exercise) Another round of pruning based on AUC

In the video you saw how the “full” logistic regression model with a logit link was being pruned based on the AUC. You saw how the variable home_ownership was deleted from the model, as it improved the overall AUC. After repeating this process for two additional rounds, the variables age and ir_cat were deleted, leading to the model:

log_3_remove_ir <- glm(loan_status ~ loan_amnt + grade + annual_inc + emp_cat, family = binomial, data = training_set)
with an AUC of 0.6545.
Now, it’s your turn to see whether the AUC can still be improved by deleting another variable from the model.

Instructions:

Delete one variable at a time in the model log_3_remove_ir, Remember that you should be using the default link function (logit).
Make probability of default-predictions for each of the models you created.
Use function auc() using test_set$loan_status as a first argument and the predictions for each of the four models as a second argument to obtain the AUC’s for each model.
Copy the name of the object (as given in the first question of this exercise) that represents the model with the best AUC.

# Build four models each time deleting one variable in log_3_remove_ir
log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, 
                        family = binomial, data = training_set)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

log_4_remove_grade <- glm(loan_status ~ loan_amnt + annual_inc + emp_cat, family = binomial, data = training_set)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

log_4_remove_inc <- glm(loan_status ~ loan_amnt + grade + emp_cat, family = binomial, data = training_set)

log_4_remove_emp <- glm(loan_status ~ loan_amnt + grade + annual_inc, family = binomial, data = training_set)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Make PD-predictions for each of the models
pred_4_remove_amnt <- predict(log_4_remove_amnt, newdata = test_set, type = "response")

pred_4_remove_grade <- predict(log_4_remove_grade, newdata = test_set, type = "response")

pred_4_remove_inc <- predict(log_4_remove_inc, newdata = test_set, type = "response")

pred_4_remove_emp <- predict(log_4_remove_emp, newdata = test_set, type = "response")

# Compute the AUCs
auc(test_set$loan_status, pred_4_remove_amnt)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Area under the curve: 0.6526

auc(test_set$loan_status, pred_4_remove_grade)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.5901

auc(test_set$loan_status, pred_4_remove_inc)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.6381

auc(test_set$loan_status, pred_4_remove_emp)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.6443

Exercise) Further model reduction?

Deleting the variable loan_amnt, the AUC can be further improved to 0.6548! The resulting model is

log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, family = binomial, data = training_set) Is it possible to reduce the logistic regression model to only two variable without reducing the AUC? In this exercise you will see if it is possible!

Instructions:

Again, delete one variable at a time in the model log_4_remove_amnt, Remember that you should be using the default link function (logit).
Use predict() to make probability of default predictions for each model you created.
Obtain the AUC’s for each of the three models, using test_set$loan_status as a first argument and the predictions for each of the three models as a second argument.
Plot the ROC-curve for the model with the highest AUC in your workspace, using plot(roc()) where the content of roc() is the same as the content for the function auc() with the highest AUC. Note that it is possible that the AUC cannot be reduced anymore with respect to model log_4_remove_amnt. The predictions for this model are loaded in your workspace as pred_4_remove_amnt, in the case that this model leads to the highest AUC.

# Build three models each time deleting one variable in log_4_remove_amnt
log_5_remove_grade <- glm(loan_status ~ annual_inc + emp_cat, family = binomial, data = training_set)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

log_5_remove_inc <- glm(loan_status ~ grade  + emp_cat , family = binomial, data = training_set)
log_5_remove_emp <- glm(loan_status ~ grade + annual_inc, family = binomial, data = training_set)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Make PD-predictions for each of the models
pred_5_remove_grade <- predict(log_5_remove_grade, newdata = test_set, type = "response")
pred_5_remove_inc <- predict(log_5_remove_inc, newdata = test_set, type = "response")
pred_5_remove_emp <- predict(log_5_remove_emp, newdata = test_set, type = "response")

# Compute the AUCs
auc(test_set$loan_status, pred_5_remove_grade)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Area under the curve: 0.5853

auc(test_set$loan_status, pred_5_remove_inc)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.6338

auc(test_set$loan_status, pred_5_remove_emp)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.6457

# Plot the ROC-curve for the best model here
plot(roc(test_set$loan_status,pred_4_remove_amnt))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Course Wrap up

Other methods

Other methods of classifying default vs non-default.

Discriminant Analysis
- A discriminant function is used to assign data one group or another (default or non-default) in a credits modeling case.
Random Forest (very popular and effective method)
- random forest are based on decision trees using an ensemble of trees to classify cases.
Neural networks
Support vector machines
- Neural networks and support vector machine techniques are extremely flexible & effective,but the disadvantage is that these methods are complex and black box which makes it hard to understand how certain variables effect the outcome.

All models referred here focuses on classification.

An important short coming of these models is that timing of default is completely neglected.

No distinction is made of when the default was made in the early stages of the repayment and one that defaults after many months of repayments.
A method that has become increasing popular (in credit risk) that overcomes the problem above is the survival analysis.

Survival Analysis

Using Survival Analysis, one can obtain probability of loan default that change over time.
Another advantage of Survival Analysis is that variables that change over time can be included (time-varying covariates).
- This enables us to include factors that reflect the state of the economy, such as house prices and employment level.
- As a number of loan default typical goes up in hard economic times, including these factors can significantly improve the predictive power.

Part 3, Decision Tree

Hikaru

12/31/2019

Loading data for course:

Decision Tree

How are “splitting decisions” made?

Gini measure example

Exercise 1)

Building decision trees using rpart() package

Three techniques to overcome unbalance

Under-sampling example:

Pruning the Decision Tree

printcp()

plotcp()

plot the pruned tree

prp() in the rpart.plot-package

Other tree options & confusion matrices

Making Predictions using the Decision Tree

Constructing a confusion matrix

Finding the right cut-off: the strategy curve

Constructing a confusion matrix

Cut-off?

A certain strategy…

The ROC-Curve (Receiver Operating Characteristic)

Using measurement: Sensitivity & Specificity

AUC measure (Area Under the Curve)

Exercise 1) ROC-curves for comparison of logistic regression models

Input selection based on the AUC

AUC-based pruning

Exercise) Another round of pruning based on AUC

Exercise) Further model reduction?

Course Wrap up

Other methods