1 Introduction and data preprocessing

1.1 Introduction and data structure

1.1.1 Exploring the credit data

We will be examining the dataset loan_data:

loan_data <- readRDS("data/loan_data_ch1.rds")

After being given loan_data, you are particularly interested about the defaulted loans in the data set. You want to get an idea of the number, and percentage of defaults. Defaults are rare, so you always want to check what the proportion of defaults is in a loan dataset. The CrossTable() function is very useful here.

Remember that default information is stored in the response variable loan_status, where 1 represents a default, and 0 represents non-default.

To learn more about variable structures and spot unexpected tendencies in the data, you should examine the relationship between loan_status and certain factor variables. For example, you would expect that the proportion of defaults in the group of customers with grade G (worst credit rating score) is substantially higher than the proportion of defaults in the grade A group (best credit rating score).

Conveniently, CrossTable() can also be applied on two categorical variables.

# View the structure of loan_data
str(loan_data)
'data.frame':   29092 obs. of  8 variables:
 $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
 $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
 $ int_rate      : num  10.6 NA 13.5 NA NA ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
 $ emp_length    : int  10 25 13 3 9 11 0 3 3 0 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
 $ annual_inc    : num  24000 12252 49200 36000 48000 ...
 $ age           : int  33 31 24 39 24 28 22 22 28 22 ...
# Load the gmodels package 
library(gmodels)

# Call CrossTable() on loan_status
CrossTable(loan_data$loan_status)

 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  29092 

 
          |         0 |         1 | 
          |-----------|-----------|
          |     25865 |      3227 | 
          |     0.889 |     0.111 | 
          |-----------|-----------|



 

Call CrossTable() with x argument loan_data$grade and y argument loan_data$loan_status. We only want row-wise proportions, so set prop.r to TRUE, but prop.c , prop.t and prop.chisq to FALSE (default values here are TRUE, and this would lead to inclusion of column proportions, table proportions and chi-square contributions for each cell. We do not need these here.)

# Call CrossTable() on grade and loan_status
CrossTable(loan_data$grade, loan_data$loan_status, prop.r = TRUE, prop.c = FALSE, prop.t  = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  29092 

 
                | loan_data$loan_status 
loan_data$grade |         0 |         1 | Row Total | 
----------------|-----------|-----------|-----------|
              A |      9084 |       565 |      9649 | 
                |     0.941 |     0.059 |     0.332 | 
----------------|-----------|-----------|-----------|
              B |      8344 |       985 |      9329 | 
                |     0.894 |     0.106 |     0.321 | 
----------------|-----------|-----------|-----------|
              C |      4904 |       844 |      5748 | 
                |     0.853 |     0.147 |     0.198 | 
----------------|-----------|-----------|-----------|
              D |      2651 |       580 |      3231 | 
                |     0.820 |     0.180 |     0.111 | 
----------------|-----------|-----------|-----------|
              E |       692 |       176 |       868 | 
                |     0.797 |     0.203 |     0.030 | 
----------------|-----------|-----------|-----------|
              F |       155 |        56 |       211 | 
                |     0.735 |     0.265 |     0.007 | 
----------------|-----------|-----------|-----------|
              G |        35 |        21 |        56 | 
                |     0.625 |     0.375 |     0.002 | 
----------------|-----------|-----------|-----------|
   Column Total |     25865 |      3227 |     29092 | 
----------------|-----------|-----------|-----------|

 

The proportion of defaults increases when the credit rating moves from A to G (worse credit scores).

1.2 Histograms and outliers

1.2.1 Histograms

You previously explored categorical variables using the CrossTable() function. Now you would like to explore continuous variables to identify potential outliers or unexpected data structures.

To do this, let’s experiment with the function hist() to understand the distribution of the number of loans for different customers.

# Create histogram of loan_amnt: hist_1
hist_1 <- hist(loan_data$loan_amnt)

Use $breaks along with the object hist_1 to get more information on the histogram breaks. Knowing the location of the breaks is important because if they are poorly chosen, the histogram may be misleading.

# Print locations of the breaks in hist_1
hist_1$breaks
 [1]     0  2000  4000  6000  8000 10000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000
[17] 32000 34000 36000

Change the number of breaks in hist_1 to 200 by specifying the breaks argument. Additionally, name the x-axis “Loan amount” using the xlab argument and title it “Histogram of the loan amount” using the main argument. Save the result to hist_2.

# Change number of breaks and add labels: hist_2
hist_2 <- hist(loan_data$loan_amnt, breaks = 200, xlab = "Loan amount", 
               main = "Histogram of the loan amount")

Why do the peaks occur where they occur?

Note that there are some high peaks at round values: 5000, 10000, 15000,… People tend to borrow round numbers.

1.2.2 Outliers

Now it’s time to look at the structure of the variable age.

hist(loan_data$age)

There is a lot of blank space on the right-hand side of the plot. This is an indication of possible outliers. You will look at a scatterplot to verify this. If you find any outliers you will delete them.

If outliers are observed for several variables, it might be useful to look at bivariate plots. It’s possible the outliers belong to the same observation. If so, there is even more reason to delete the observation because it is more likely that some information stored in it is wrong.

# Plot the age variable
plot(loan_data$age, ylab = "Age")

The oldest person in this data set is older than 122 years! Get the index of this outlier using which() and the age of 122 as a cutoff (you can do this using loan_data$age > 122). Assign it to the object index_highage.

# Save the outlier's index to index_highage
index_highage <- which(loan_data$age > 122)

Create a new data set new_data, after removing the observation with the high age using the object index_highage.

# Create data set new_data with outlier deleted
new_data <- loan_data[-index_highage, ]

Have a look at the bivariate scatterplot, with age on the x-axis and annual income on the y-axis.

# Make bivariate scatterplot of age and annual income
plot(loan_data$age, loan_data$annual_inc, xlab = "Age", ylab = "Annual Income")

Did you notice that you have a bivariate outlier? The person with the huge annual wage of $6 million appeared to be 144 years old. This must be a mistake, so you will definitely delete this observation from the data in the following exercises.

1.3 Missing data and coarse classification

1.3.1 Deleting missing data

You saw before that the interest rate (int_rate) in the data set loan_data depends on the customer. Unfortunately some observations are missing interest rates. You now need to identify how many interest rates are missing and then delete them.

# Look at summary of loan_data
summary(loan_data$int_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   5.42    7.90   10.99   11.00   13.47   23.22    2776 
# Get indices of missing interest rates: na_index
na_index <- which(is.na(loan_data$int_rate))

# Remove observations with missing interest rates: loan_data_delrow_na
loan_data_delrow_na <- loan_data[-na_index, ]

# Make copy of loan_data
loan_data_delcol_na <- loan_data

# Delete interest rate column from loan_data_delcol_na
loan_data_delcol_na$int_rate <- NULL

1.3.2 Replacing missing data

Rather than deleting the missing interest rates, you may want to replace them instead.

Create an object called median_ir, containing the median of the interest rates in loan_data using the median() function. Don’t forget to include the argument na.rm = TRUE.

# Compute the median of int_rate
median_ir <- median(loan_data$int_rate, na.rm = TRUE)

# Make copy of loan_data
loan_data_replace <- loan_data

# Replace missing interest rates with median
loan_data_replace$int_rate[na_index] <- median_ir

# Check if the NAs are gone
summary(loan_data_replace$int_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.42    8.49   10.99   11.00   13.11   23.22 

1.3.3 Keeping missing data

In some situations, the fact that an input is missing is important information in itself. NAs can be kept in a separate “missing” category using coarse classification.

Coarse classification allows you to simplify your data and improve the interpretability of your model. Coarse classification requires you to bin your responses into groups that contain ranges of values. You can use this binning technique to place all NAs in their own bin.

# Make the necessary replacements in the coarse classification example below 
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))

loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"

loan_data$ir_cat <- as.factor(loan_data$ir_cat)

# Look at your new variable using plot()
plot(loan_data$ir_cat)

1.4 Data splitting and confusion matrices

1.4.1 Splitting the data set

To make your training and test sets, you should first set a seed using set.seed(). Seeds allow you to create a starting point for randomly generated numbers, so that each time your code is run the same answer is generated. The advantage of doing this in your sampling is that you or anyone else can recreate the exact same training and test sets by using the same seed.

Using sample(), you can randomly assign observations to the training and test set.

For this exercise you will use the two first arguments in the sample() function: - The first argument is the vector from which we will sample values. We will randomly pick row numbers as indices; you can use 1:nrow(loan_data) to create the vector of row numbers. - The second argument is the number of items to choose. We will enter 2 / 3 * nrow(loan_data), as we construct the training set first.

# Set seed of 567
set.seed(567)

# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2 / 3 * nrow(loan_data))

# Create training set: training_set
training_set <- loan_data[index_train, ]

# Create test set: test_set
test_set <- loan_data[-index_train, ]

1.4.2 Creating a confusion matrix

In this example, assume that you have run a model and stored the predicted outcomes in a vector called model_pred. You want to see how the model performed so you will construct a confusion matrix. You will compare the actual loan status column (loan_status) to the predicted values (model_pred), using the table() function, where the arguments are the true values and the predicted values. Recall the confusion matrix structure:

\(Classification accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)}\)

\(Sensitivity=\frac{TP}{(TP+FN)}\)

\(Specificity=\frac{TN}{(TN+FP)}\)

model_pred <- sample(c(0, 1), replace = TRUE, size = 9698)
# Create confusion matrix
conf_matrix <- table(test_set$loan_status, model_pred)
conf_matrix
   model_pred
       0    1
  0 4268 4341
  1  535  554
# Compute classification accuracy
(conf_matrix[1, 1] + conf_matrix[2, 2]) / (conf_matrix[1, 1] + conf_matrix[1, 2] + conf_matrix[2, 1] + conf_matrix[2, 2])
[1] 0.4972159
# Compute sensitivity
conf_matrix[2, 2] / (conf_matrix[2, 1] + conf_matrix[2, 2])
[1] 0.5087236

2 Logistic regression

2.1 Logistic regression: introduction

2.1.1 Basic logistic regression

In the video, you looked at a logistic regression model including the variable age as a predictor. Now, you will include a categorical variable, and learn how to interpret its parameter estimates.

When you include a categorical variable in a logistic regression model in R, you will obtain a parameter estimate for all but one of its categories. This category for which no parameter estimate is given is called the reference category. The parameter for each of the other categories represents the odds ratio in favor of a loan default between the category of interest and the reference category.

loan_data <- readRDS("data/loan_data_ch2.rds")
# Set seed of 567
set.seed(567)

# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2 / 3 * nrow(loan_data))

# Create training set: training_set
training_set <- loan_data[index_train, ]

# Create test set: test_set
test_set <- loan_data[-index_train, ]
# Build a glm model with variable ir_cat as a predictor
log_model_cat <- glm(formula = loan_status ~ ir_cat, family = "binomial",
                     data = training_set)

# Print the parameter estimates 
log_model_cat

Call:  glm(formula = loan_status ~ ir_cat, family = "binomial", data = training_set)

Coefficients:
  (Intercept)  ir_cat11-13.5    ir_cat13.5+     ir_cat8-11  ir_catMissing  
      -2.8893         0.9849         1.3805         0.5496         0.7225  

Degrees of Freedom: 19393 Total (i.e. Null);  19389 Residual
Null Deviance:      13440 
Residual Deviance: 13030    AIC: 13040
# Look at the different categories in ir_cat using table()
table(loan_data$ir_cat)

    0-8 11-13.5   13.5+    8-11 Missing 
   7130    6953    6002    6230    2776 

How do you interpret the parameter estimate for the interest rates that are between 8% and 11%? Compared to the reference category with interest rates between 0% and 8%, the odds in favor of default change by a multiple of…

exp(0.5496)
[1] 1.73256

2.1.2 Multiple variables in a logistic regression model

The interpretation of a single parameter still holds when including several variables in a model. When you do include several variables and ask for the interpretation when a certain variable changes, it is assumed that the other variables remain constant, or unchanged. There is a fancy latin phrase for this, ceteris paribus, literally meaning “keeping all others the same”.

To build a logistic regression model with multiple variables, you can use the + sign to add variables. Your formula will look something like:

\(y ~ x1 + ... + xk\)

In order to evaluate the model there are a number of things to be aware of. You already looked at the parameter values, but that is not the only thing of importance. Also important is the statistical significance of a certain parameter estimate. The significance of a parameter is often referred to as a p-value, however in a model output you will see it denoted as \(Pr(>|t|)\). In glm, mild significance is denoted by a “.” to very strong significance denoted by "***". When a parameter is not significant, this means you cannot assure that this parameter is significantly different from 0. Statistical significance is important. In general, it only makes sense to interpret the effect on default for significant parameters.

# Build the logistic regression model
log_model_multi <- glm(loan_status ~ age + ir_cat + grade + loan_amnt + annual_inc, family = "binomial", data = training_set)


# Obtain significance levels using summary()
summary(log_model_multi)

Call:
glm(formula = loan_status ~ age + ir_cat + grade + loan_amnt + 
    annual_inc, family = "binomial", data = training_set)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1799  -0.5321  -0.4329  -0.3342   3.4786  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.400e+00  1.273e-01 -18.851  < 2e-16 ***
age           -2.505e-03  3.815e-03  -0.657 0.511375    
ir_cat11-13.5  7.048e-01  1.319e-01   5.345 9.06e-08 ***
ir_cat13.5+    7.012e-01  1.466e-01   4.784 1.72e-06 ***
ir_cat8-11     4.624e-01  1.159e-01   3.992 6.57e-05 ***
ir_catMissing  4.587e-01  1.285e-01   3.571 0.000356 ***
gradeB         1.108e-01  1.049e-01   1.056 0.290875    
gradeC         4.542e-01  1.209e-01   3.757 0.000172 ***
gradeD         7.549e-01  1.373e-01   5.497 3.85e-08 ***
gradeE         8.481e-01  1.662e-01   5.103 3.34e-07 ***
gradeF         1.484e+00  2.326e-01   6.381 1.75e-10 ***
gradeG         1.954e+00  3.331e-01   5.866 4.47e-09 ***
loan_amnt     -6.373e-06  4.219e-06  -1.511 0.130904    
annual_inc    -5.890e-06  7.631e-07  -7.719 1.17e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13443  on 19393  degrees of freedom
Residual deviance: 12854  on 19380  degrees of freedom
AIC: 12882

Number of Fisher Scoring iterations: 5

2.2 Logistic regression: predicting the probability of default

2.2.1 Predicting the probability of default

you can predict the probability for all the test set cases at once using the predict() function.

After having obtained all the predictions for the test set elements, it is useful to get an initial idea of how good the model is at discriminating by looking at the range of predicted probabilities. A small range means that predictions for the test set cases do not lie far apart, and therefore the model might not be very good at discriminating good from bad customers. With low default percentages, you will notice that in general, very low probabilities of default are predicted. It’s time to have a look at a first model.

log_model_small <- glm(formula = loan_status ~ age + ir_cat, family = "binomial", data = training_set)
# Build the logistic regression model
predictions_all_small <- predict(log_model_small, newdata = test_set, type = "response")

# Look at the range of the object "predictions_all_small"
range(predictions_all_small)
[1] 0.03853433 0.18866941

2.2.2 Making more discriminative models

In the previous exercise, the range for predicted probabilities of default was rather small. As discussed, small predicted default probabilities are to be expected with low default rates, but building bigger models (which basically means: including more predictors) can expand the range of your predictions.

Whether this will eventually lead to better predictions still needs to be validated and depends on the quality of the newly included predictors. But first, have a look at how bigger models can expand the range.

# Change the code below to construct a logistic regression model using all available predictors in the data set
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)

# Make PD-predictions for all test set elements using the the full logistic regression model
predictions_all_full <- predict(log_model_full, newdata = test_set, type = "response")

# Look at the predictions range
range(predictions_all_full)
[1] 6.471741e-06 5.436561e-01

2.3 Evaluating the logistic regression model result

2.3.1 Specifying a cut-off

We have shown you how the specification of a cut-off can make the difference to obtain a good confusion matrix. Now, you will learn how to transform the prediction vector to a vector of binary values indicating the status of the loan. The ifelse() function in R can help you here.

Applying the ifelse() function in the context of a cut-off, you would have something like

ifelse(predictions > 0.3, 1, 0) In the first argument, you are testing whether a certain value in the predictions-vector is bigger than 0.3. If this is TRUE, R returns “1” (specified in the second argument), if FALSE, R returns “0” (specified in the third argument), representing “default” and “no default”, respectively.

# The code for the logistic regression model and the predictions is given below
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
predictions_all_full <- predict(log_model_full, newdata = test_set, type = "response")

# Make a binary predictions-vector using a cut-off of 15%
pred_cutoff_15 <- ifelse(predictions_all_full > 0.15, 1, 0)

# Construct a confusion matrix
table(test_set$loan_status, pred_cutoff_15)
   pred_cutoff_15
       0    1
  0 6726 1878
  1  683  410

2.3.2 Comparing two cut-offs

Have a look again at the confusion matrix using the full model and a cut-off of 15%, which is stored in the object conf_matrix_15, and another confusion matrix using a cut-off of 20% and the same model, stored in conf_matrix_20.

Accuracy increases, sensitivity decreases and specificity increases.

2.4 Wrap-up and remarks

3 Decision trees

3.1 What is a decision tree?

3.1.1 Computing the gain for a tree

We looked at how the Gini-measure is used to create the perfect split for a tree. Now, you will compute the gain for the tree loaded in your workspace.

The data set contains 500 cases, 89 of these cases are defaults. This led to a Gini of 0.292632 in the root node. As a small reminder, remember that Gini of a certain node = 2 * proportion of defaults in this node * proportion of non-defaults in this node. Have a look at the code for a refresher.

gini_root <- 2 * (89 / 500) * (411 / 500)

You will use these Gini measures to help you calculate the gain of the leaf nodes with respect to the root node. Look at the following code to get an idea of how you can use the gini measures you created to calculate the gain of a node.

Gain = gini_root - (prop(cases left leaf) * gini_left) - (prop(cases right leaf * gini_right))

Compute the gini in the left hand and the right hand node, and the gain of the two leaf nodes with respect to the root node. The object containing the tree is small_tree.

# Load package rpart in your workspace.
library(rpart)

# The Gini-measure of the root node is given below
gini_root <- 2 * 89 / 500 * 411 / 500

# Compute the Gini measure for the left leaf node
gini_ll <- 2 * (401/446)*(45/446) 

# Compute the Gini measure for the right leaf node
gini_rl <-2 * (10/54)*(44/54) 

# Compute the gain
gain <- gini_root - 446 / 500 * gini_ll - 54 / 500 * gini_rl
gain
[1] 0.09820084
# compare the gain-column in small_tree$splits with our computed gain, multiplied by 500, and assure they are the same

small_set <- readRDS("small_set.rds")
small_tree <- rpart(formula = status ~ age, data = small_set, method = "class", 
    control = rpart.control(minsplit = 5, cp = 0.001, maxdepth = 1))
small_tree$splits
    count ncat  improve index adj
age   500    1 49.10042  32.5   0
improve <- gain * 500
improve
[1] 49.10042

3.2 Building decision trees using the rpart()-package

3.2.1 Undersampling the training set

In the video, you saw that to overcome the unbalanced data problem, you can use under- or oversampling. The training set has been undersampled for you, such that 1/3 of the training set consists of defaults, and 2/3 of non-defaults. The resulting data set is available in your workspace and named undersampled_training_set, and contains less observations (6570 instead of 19394). In this exercise, you will create a decision tree using the undersampled data set.

You will notice that the trees in this and the next exercises are very big, so big that you cannot really read them anymore.

# Load package rpart in your workspace.
library(rpart)

# Change the code provided in the video such that a decision tree is constructed using the undersampled training set. Include rpart.control to relax the complexity parameter to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
                          data =  training_set)

Change the code provided such that a decision tree is constructed using the undersampled training set instead of training_set. Additionally, add the argument control = rpart.control(cp = 0.001). cp, which is the complexity parameter, is the threshold value for a decrease in overall lack of fit for any split. If cp is not met, further splits will no longer be pursued. cp’s default value is 0.01, but for complex problems, it is advised to relax cp.

Plot the decision tree using the function plot and the tree object name. Add a second argument uniform = TRUE to get equal-sized branches.

library(ROSE)
undersampled_training_set <- ovun.sample(loan_status ~ ., data = training_set, method = "under", N = 6570, seed = 1)$data
table(undersampled_training_set$loan_status )

   0    1 
4436 2134 
# Change the code provided in the video such that a decision tree is constructed using the undersampled training set. Include rpart.control to relax the complexity parameter to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
                          data =  undersampled_training_set,  control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_undersample, uniform = TRUE)

# Add labels to the decision tree
text(tree_undersample)

3.2.2 Changing the prior probabilities

As mentioned in the video, you can also change the prior probabilities to obtain a decision tree. This is an indirect way of adjusting the importance of misclassifications for each class. You can specify another argument inside rpart() to include prior probabities. The argument you are looking for has the following form

parms = list(prior=c(non_default_proportion, default_proportion))

# Change the code below such that a tree is constructed with adjusted prior probabilities.
tree_prior <- rpart(loan_status ~ ., method = "class",
                    data = training_set)

Change the code provided such that a decision tree is constructed , including the argument parms and changing the proportion of non-defaults to 0.7, and of defaults to 0.3 (they should always sum up to 1). Additionally, include control = rpart.control(cp = 0.001) as well.

# Change the code below such that a tree is constructed with adjusted prior probabilities.
tree_prior <- rpart(loan_status ~ ., method = "class",
                    data = training_set, parms = list(prior = c(0.7, 0.3)),
                    control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_prior, uniform = TRUE)

# Add labels to the decision tree
text(tree_prior)

3.2.3 Including a loss matrix

Thirdly, you can include a loss matrix, changing the relative importance of misclassifying a default as non-default versus a non-default as a default. You want to stress that misclassifying a default as a non-default should be penalized more heavily. Including a loss matrix can again be done in the argument parms in the loss matrix.

parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2))

Doing this, you are constructing a 2x2-matrix with zeroes on the diagonal and changed loss penalties off-diagonal. The default loss matrix is all ones off-diagonal.

tree_loss_matrix <- rpart(loan_status ~ ., method = "class",
                          data =  training_set)

Change the code provided such a loss matrix is included, with a penalization that is 10 times bigger when misclassifying an actual default as a non-default. This can be done replacing cost_def_as_nondef by 10, and cost_nondef_as_def by 1. Similar to what you’ve done in the previous exercises, include rpart.control to relax the complexity parameter to 0.001.

# Change the code below such that a decision tree is constructed using a loss matrix penalizing 10 times more heavily for misclassified defaults.
tree_loss_matrix <- rpart(loan_status ~ ., method = "class",
                          data =  training_set,
                          parms = list(loss = matrix(c(0, 10, 1, 0), ncol = 2)),
                          control = rpart.control(cp = 0.001))


# Plot the decision tree
plot(tree_loss_matrix, uniform = TRUE)

# Add labels to the decision tree
text(tree_loss_matrix)

3.3 Pruning the decision tree

3.3.1 Pruning the tree with changed prior probabilities

Pruning a tree is necessary to avoid overfitting. There were some big trees in the previous exercises and now you will put what you have learned into practice, and prune the previously constructed tree with the changed prior probabilities. The rpart package is already loaded in your workspace.

You will first set a seed to make sure the results are reproducible as mentioned in the video, because you will be examining cross-validated error results. Results involve randomness and could differ slightly upon running the function again with a different seed.

In this exercise you will learn to identify which complexity parameter (CP) will minimize the cross-validated error results, then prune your tree based on this value.

tree_undersample tree_prior tree_loss_matrix

Use plotcp() to visualize cross-vaidated error (X-val Relative Error) in relation to the complexity parameter for tree_prior. Use printcp() to print a table of information about CP, splits, and errors. See if you can identify which split has the minimum cross-validated error in tree_prior.

set.seed(345)
# tree_prior is loaded in your workspace

# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_prior)


# Use printcp() to identify for which complexity parameter the cross-validated error rate is minimized.
printcp(tree_prior)

Classification tree:
rpart(formula = loan_status ~ ., data = training_set, method = "class", 
    parms = list(prior = c(0.7, 0.3)), control = rpart.control(cp = 0.001))

Variables actually used in tree construction:
[1] age            annual_inc     emp_cat        grade          home_ownership ir_cat        
[7] loan_amnt     

Root node error: 5818.2/19394 = 0.3

n= 19394 

          CP nsplit rel error  xerror     xstd
1  0.0055301      0   1.00000 1.00000 0.020422
2  0.0047484      5   0.97235 0.99718 0.020347
3  0.0043786      6   0.96760 0.99623 0.020180
4  0.0028281      7   0.96322 0.97751 0.019459
5  0.0026498      9   0.95757 0.97588 0.019344
6  0.0018569     10   0.95492 0.97468 0.019267
7  0.0014322     11   0.95306 0.97759 0.019326
8  0.0013610     20   0.93861 0.98244 0.019346
9  0.0013529     24   0.93317 0.98163 0.019346
10 0.0011536     27   0.92793 0.98505 0.019308
11 0.0010000     28   0.92677 0.98986 0.019247
# Create an index for of the row with the minimum xerror
index <- which.min(tree_prior$cptable[ , "xerror"])

# Create tree_min
tree_min <- tree_prior$cptable[index, "CP"]

#  Prune the tree using tree_min
ptree_prior <- prune(tree_prior, cp = tree_min)

# Use prp() to plot the pruned tree
library(rpart.plot)
prp(ptree_prior) 

3.3.2 Pruning the tree with the loss matrix

In this exercise, you will prune the tree that was built using a loss matrix in order to penalize misclassified defaults more than misclassified non-defaults.

# set a seed and run the code to construct the tree with the loss matrix again
set.seed(345)
tree_loss_matrix  <- rpart(loan_status ~ ., method = "class", data = training_set,
                           parms = list(loss=matrix(c(0, 10, 1, 0), ncol = 2)),
                           control = rpart.control(cp = 0.001))

# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_loss_matrix)

Looking at the cp-plot, you will notice that pruning the tree using the minimum cross-validated error will lead to a tree that is as big as the unpruned tree, as the cross-validated error reaches its minimum for cp = 0.001. Because you would like to make the tree somewhat smaller, try pruning the tree using cp = 0.0012788. For this complexity parameter, the cross-validated error approaches the minimum observed error. Call the pruned tree ptree_loss_matrix.

# Prune the tree using cp = 0.0012788
ptree_loss_matrix <- prune(tree_loss_matrix, cp = 0.0012788)

# Use prp() and argument extra = 1 to plot the pruned tree
prp(ptree_loss_matrix, extra = 1)

# Create an index for of the row with the minimum xerror
index <- which.min(tree_loss_matrix$cptable[ , "xerror"])

# Create tree_min
tree_min <- tree_loss_matrix$cptable[index, "CP"]
# Prune the tree using cp = 0.0012788
ptree_loss_matrix <- prune(tree_loss_matrix, cp = tree_min)

# Use prp() and argument extra = 1 to plot the pruned tree
prp(ptree_loss_matrix, extra = 1)

3.4 Other tree options and the construction of confusion matrices

3.4.1 One final tree using more options

You will use some final arguments that were discussed in the video. Some specifications in the rpart.control()-function will be changed, and some weights will be included using the weights argument in rpart(). The vector case_weights has been constructed for you and is loaded in your workspace. This vector contains weights of 1 for the non-defaults in the training set, and weights of 3 for defaults in the training sets. By specifying higher weights for default, the model will assign higher importance to classifying defaults correctly.

case_weights <- ifelse(training_set$loan_status == 0, 1, 3)
# set a seed and run the code to obtain a tree using weights, minsplit and minbucket
set.seed(345)
tree_weights <- rpart(loan_status ~ ., method = "class",
                      data = training_set,
                      weights = case_weights, 
                      control = rpart.control(minsplit = 5, minbucket = 2, cp = 0.001))

# Plot the cross-validated error rate for a changing cp
plotcp(tree_weights)


# Create an index for of the row with the minimum xerror
index <- which.min(tree_weights$cp[ , "xerror"])

# Create tree_min
tree_min <- tree_weights$cp[index, "CP"]

# Prune the tree using tree_min
ptree_weights <- prune(tree_weights, cp = tree_min)

# Plot the pruned tree using the rpart.plot()-package
prp(ptree_weights, extra = 1)

3.4.2 Confusion matrices and accuracy of our final trees

Over the past few exercises, you have constructed quite a few pruned decision trees, with four in total. As you can see, the eventual number of splits varies quite a bit from one tree to another:

ptree_undersample  # 7 splits
ptree_prior  # 9 splits
ptree_loss_matrix  # 24 splits
ptree_weights  # 6 splits

Now it is important to know which tree performs best in terms of accuracy. In order to get the accuracy, you will start off by making predictions using the test set, and construct the confusion matrix for each of these trees. You will add the argument type = “class” when doing these predictions. By doing this there is no need to set a cut-off. Nevertheless, it is important to be aware of the fact that not only the accuracy is important, but also the sensitivity and specificity. Additionally, predicting probabilities instead of binary values (0 or 1) has the advantage that the cut-off can be moved along. Then again, the difficulty here is the choice of the cut-off.

\(Classification accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)}\)

ptree_undersample <- tree_undersample
ptree_prior <- tree_prior 
ptree_loss_matrix <- tree_loss_matrix
ptree_weights <- tree_weights
# Make predictions for each of the pruned trees using the test set.
pred_undersample <- predict(ptree_undersample, newdata = test_set,  type = "class")
pred_prior <- predict(ptree_prior, newdata = test_set,  type = "class")
pred_loss_matrix <- predict(ptree_loss_matrix, newdata = test_set,  type = "class")
pred_weights <- predict(ptree_weights, newdata = test_set,  type = "class")

# construct confusion matrices using the predictions.
confmat_undersample <- table(test_set$loan_status, pred_undersample)
confmat_prior <- table(test_set$loan_status, pred_prior)
confmat_loss_matrix <- table(test_set$loan_status, pred_loss_matrix)
confmat_weights <- table(test_set$loan_status, pred_weights)

# Compute the accuracies
acc_undersample <- sum(diag(confmat_undersample)) / nrow(test_set)
acc_undersample
[1] 0.8210787
acc_prior <- sum(diag(confmat_prior)) / nrow(test_set)
acc_prior
[1] 0.848613
acc_loss_matrix <- sum(diag(confmat_loss_matrix)) / nrow(test_set)
acc_loss_matrix
[1] 0.5271734
acc_weights <- sum(diag(confmat_weights)) / nrow(test_set)
acc_weights
[1] 0.8627411

4 Evaluating a credit risk model

4.1 Finding the right cut-off: the strategy curve

In the video, you learned how to compute the bad rate (or, the percentage of defaults) in the loan portfolio of a bank when given:

  • a specific model
  • the acceptance rate

In this exercise, you will compute the bad rate that a bank can expect when using the pruned tree ptree_prior that you fitted before, and an acceptance rate of 80%.

n the script, you are provided the code to make predictions for the probability of default using the pruned tree and test_set. Remember that if you use the predict() function for a tree, the probability of default can be found in the second column. Therefore [,2] was pasted to the predict() function.

Obtain the cut-off that leads to an acceptance rate of 80%, using prob_default_prior. You can use the quantile()- function to do this, setting the second argument to 0.8. Assign the name cutoff_prior. The code to obtain the actual binary default predictions (0 or 1) is provided. ifelse() here. Name the object bin_pred_prior_80.

The code to select the default indicators of test_set for the accepted loans acording to a 80% acceptance rate is provided.

Compute the percentage of defaults (or the “bad rate”) for the accepted loans. This is the number of occurences of 1 in accepted_status_prior_80, divided by the total number of instances in this vector. Print the solution to your R-console.

# Make predictions for the probability of default using the pruned tree and the test set.
prob_default_prior <- predict(ptree_prior, newdata = test_set)[ ,2]

# Obtain the cutoff for acceptance rate 80%
cutoff_prior <- quantile(prob_default_prior, 0.8)  

# Obtain the binary predictions.
bin_pred_prior_80 <- ifelse(prob_default_prior > cutoff_prior, 1, 0)

# Obtain the actual default status for the accepted loans
accepted_status_prior_80 <- test_set$loan_status[bin_pred_prior_80 == 0]

# Obtain the bad rate for the accepted loans
sum(accepted_status_prior_80) / length(accepted_status_prior_80)
[1] 0.102678

4.1.1 The strategy table and strategy curve

Repeating the calculations you did in the previous exercise for several acceptance rates, you can obtain a strategy table. This table can be a useful tool for banks, as they can give them a better insight to define an acceptance strategy.

You know how to compute a bad rate for a certain acceptance rate by now, so the function strategy_bank was written and loaded into your workspace to speed things up. This function computes the cut-off and bad rate for the acceptance rates that are multiples of 5% (0%, 5%, 10%, …).

strategy_bank <- function(prob_of_def){
    cutoff = rep(NA, 21)
    bad_rate = rep(NA, 21)
    accept_rate = seq(1, 0, by = -0.05)
    for (i in 1:21) {
        cutoff[i] = quantile(prob_of_def, accept_rate[i])
        pred_i = ifelse(prob_of_def > cutoff[i], 1, 0)
        pred_as_good = test_set$loan_status[pred_i == 0]
        bad_rate[i] = sum(pred_as_good)/length(pred_as_good)
    }
        table = cbind(accept_rate, cutoff = round(cutoff, 4), bad_rate = round(bad_rate, 
        4))
    return(list(table = table, bad_rate = bad_rate, accept_rate = accept_rate, 
        cutoff = cutoff))
}
  • Have a look at the function strategy_bank.
  • The vector predictions_cloglog contains predicted probabilities of default using the cloglog model you used in chapter 2, the vector predictions_loss_matrixcontains the predicted probabilities of default using the pruned tree including a loss matrix (previously constructed in chapter 3). Apply function strategy_bank to each of the prediction-vectors, assign the name strategy_cloglog and strategy_loss_matrix respectively.
  • The strategy tables can be obtained using the object names in combination with $table.
  • The strategy curves have been plotted for you. The strategy curve of the tree model shows pretty strange behavior. Because of the structure of classification trees, you might have a bigger chance for weird “jumps” here. Additionally, the tree with a loss matrix was a very big one, so this might be the result of overfitting!
#withougth type = "class" to get probabilities
predictions_loss_matrix <- predict(ptree_loss_matrix, newdata = test_set)[,2]
# Apply the function strategy_bank to both predictions_cloglog and predictions_loss_matrix
strategy_cloglog <- strategy_bank(predictions_cloglog)
strategy_loss_matrix <- strategy_bank(predictions_loss_matrix)

# Obtain the strategy tables for both prediction-vectors
strategy_cloglog$table
      accept_rate cutoff bad_rate
 [1,]        1.00 0.3404   0.1127
 [2,]        0.95 0.1939   0.1090
 [3,]        0.90 0.1832   0.1047
 [4,]        0.85 0.1702   0.1003
 [5,]        0.80 0.1477   0.0973
 [6,]        0.75 0.1388   0.0936
 [7,]        0.70 0.1329   0.0894
 [8,]        0.65 0.1267   0.0855
 [9,]        0.60 0.1185   0.0823
[10,]        0.55 0.1093   0.0802
[11,]        0.50 0.1012   0.0773
[12,]        0.45 0.0949   0.0721
[13,]        0.40 0.0913   0.0686
[14,]        0.35 0.0875   0.0632
[15,]        0.30 0.0831   0.0622
[16,]        0.25 0.0730   0.0519
[17,]        0.20 0.0552   0.0499
[18,]        0.15 0.0531   0.0437
[19,]        0.10 0.0508   0.0489
[20,]        0.05 0.0481   0.0468
[21,]        0.00 0.0327   0.0000
strategy_loss_matrix$table
      accept_rate cutoff bad_rate
 [1,]        1.00 0.3913   0.1127
 [2,]        0.95 0.1911   0.1110
 [3,]        0.90 0.1911   0.1110
 [4,]        0.85 0.1911   0.1110
 [5,]        0.80 0.1911   0.1110
 [6,]        0.75 0.1818   0.0942
 [7,]        0.70 0.1545   0.0899
 [8,]        0.65 0.1503   0.0901
 [9,]        0.60 0.1299   0.0872
[10,]        0.55 0.1299   0.0872
[11,]        0.50 0.1263   0.0759
[12,]        0.45 0.0635   0.0729
[13,]        0.40 0.0635   0.0729
[14,]        0.35 0.0630   0.0690
[15,]        0.30 0.0630   0.0690
[16,]        0.25 0.0486   0.0570
[17,]        0.20 0.0471   0.0589
[18,]        0.15 0.0325   0.0471
[19,]        0.10 0.0298   0.0445
[20,]        0.05 0.0298   0.0445
[21,]        0.00 0.0000   0.1039
# Plot the strategy functions
par(mfrow = c(1,2))
plot(strategy_cloglog$accept_rate, strategy_cloglog$bad_rate, 
     type = "l", xlab = "Acceptance rate", ylab = "Bad rate", 
     lwd = 2, main = "logistic regression")

plot(strategy_loss_matrix$accept_rate, strategy_loss_matrix$bad_rate, 
     type = "l", xlab = "Acceptance rate", 
     ylab = "Bad rate", lwd = 2, main = "tree")

4.2 The ROC-curve

4.2.1 ROC-curves for comparison of logistic regression models

ROC-curves can easily be created using the pROC-package in R. Let’s have a look if there is a big difference between ROC-curves for the four logistic regression-models previously used throughout this course. A small heads up:

  • predictions_logit contains probability of default (PD) predictions using the default logit link and containing variables age, emp_cat, ir_cat and loan_amnt.
  • predictions_probit contains PD-predictions using the probit and containing variables age, emp_cat, ir_cat and loan_amnt.
  • predictions_cloglog contains PD-predictions using the cloglog link and containing variables age, emp_cat, ir_cat and loan_amnt.
  • predictions_all_full contains PD-predictions using the default logit link and containing all seven variables in the data set.

Load the pROC-package in your R-console. Construct the ROC-objects for the four logistic regression models using function roc(response, predictor). Remember that the response is the loan status indicator in the test_set, which can be obtained through test_set$loan_status. Use the previously created objects to construct ROC-curves. To draw them all on one plot, use plot() for the first ROC-curve drawn (for ROC_logit), and use lines() for the other three models to the same plot. Use the col-argument to change the color of the curve of ROC_probit to “blue”, ROC_cloglog to “red” and ROC_all_full to “green”. Note that, in contrast with what has been discussed in the video, the x-axis label is Specificity and not “1-Specificity”, resulting in an axis that goes from 1 on the left-hand side to 0 on the right-hand side. It seems that the link function does not have a big impact on the ROC here, and the main trigger of a better ROC is the inclusion of more variables in a model. To get an exact idea of the performance of the ROC-curves, have a look at the AUC’s, using function auc().

# Load the pROC-package
library(pROC)

# Construct the objects containing ROC-information
ROC_logit <- roc(test_set$loan_status, predictions_logit)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
ROC_probit <- roc(test_set$loan_status, predictions_probit)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
ROC_cloglog <- roc(test_set$loan_status, predictions_cloglog)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
ROC_all_full <- roc(test_set$loan_status, predictions_all_full)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# Draw all ROCs on one plot
plot(ROC_logit)
lines(ROC_probit, col="blue")
lines(ROC_cloglog, col="red")
lines(ROC_all_full, col="green")


# Compute the AUCs
auc(ROC_logit)
Area under the curve: 0.6263
auc(ROC_probit)
Area under the curve: 0.6263
auc(ROC_cloglog)
Area under the curve: 0.6263
auc(ROC_all_full)
Area under the curve: 0.6481

4.2.2 ROC-curves for comparison of tree-based models

It’s time for you to repeat the previous exercises, now comparing the tree-based models. The pROC() is now loaded in your workspace. The PD-predictions for tree-based methods are stored in the objects

predictions_undersample predictions_prior predictions_loss_matrix predictions_weights

predictions_undersample <- predict(tree_undersample, newdata = test_set)[,2]
predictions_prior <- predict(tree_prior, newdata = test_set)[,2]
predictions_loss_matrix <- predict(ptree_loss_matrix, newdata = test_set)[,2]
predictions_weights <- predict(tree_weights , newdata = test_set)[,2]
# Construct the objects containing ROC-information
ROC_undersample <- roc(test_set$loan_status, predictions_undersample)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
ROC_prior <- roc(test_set$loan_status, predictions_prior)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
ROC_loss_matrix <- roc(test_set$loan_status, predictions_loss_matrix)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
ROC_weights <- roc(test_set$loan_status, predictions_weights)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# Draw the ROC-curves in one plot
plot(ROC_undersample)
lines(ROC_prior, col="blue")
lines(ROC_loss_matrix, col="red")
lines(ROC_weights, col="green")


# Compute the AUCs
auc(ROC_undersample)
Area under the curve: 0.6016
auc(ROC_prior)
Area under the curve: 0.5821
auc(ROC_loss_matrix)
Area under the curve: 0.617
auc(ROC_weights)
Area under the curve: 0.5858

4.3 Input selection based on the AUC

4.3.1 Another round of pruning based on AUC

In the video you saw how the “full” logistic regression model with a logit link was being pruned based on the AUC. You saw how the variable home_ownership was deleted from the model, as it improved the overall AUC. After repeating this process for two additional rounds, the variables age and ir_cat were deleted, leading to the model:

log_3_remove_ir <- glm(loan_status ~ loan_amnt + grade + annual_inc + emp_cat, family = binomial, data = training_set)

with an AUC of 0.6545. Now, it’s your turn to see whether the AUC can still be improved by deleting another variable from the model.

# Build four models each time deleting one variable in log_3_remove_ir
log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, 
                         family = binomial, data = training_set) 
log_4_remove_grade <- glm(loan_status ~ loan_amnt  + annual_inc + emp_cat, 
                          family = binomial, data = training_set)
log_4_remove_inc <- glm(loan_status ~ loan_amnt + grade  + emp_cat , 
                        family = binomial, data = training_set)
log_4_remove_emp <- glm(loan_status ~ loan_amnt + grade + annual_inc, 
                        family = binomial, data = training_set)

# Make PD-predictions for each of the models
pred_4_remove_amnt <- predict(log_4_remove_amnt, newdata = test_set, type = "response")
pred_4_remove_grade <- predict(log_4_remove_grade, newdata = test_set, type = "response")
pred_4_remove_inc <- predict(log_4_remove_inc, newdata = test_set, type = "response")
pred_4_remove_emp <- predict(log_4_remove_emp, newdata = test_set, type = "response")

# Compute the AUCs
auc(test_set$loan_status, pred_4_remove_amnt)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.6484
auc(test_set$loan_status, pred_4_remove_grade)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.582
auc(test_set$loan_status, pred_4_remove_inc)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.6366
auc(test_set$loan_status, pred_4_remove_emp)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.641

4.3.2 Further model reduction?

Deleting the variable loan_amnt, the AUC can be further improved to 0.6548! The resulting model is

log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, family = binomial, data = training_set) 

Is it possible to reduce the logistic regression model to only two variable without reducing the AUC? In this exercise you will see if it is possible!

# Build three models each time deleting one variable in log_4_remove_amnt
log_5_remove_grade <- glm(loan_status ~ annual_inc + emp_cat, family = binomial, data = training_set)  
log_5_remove_inc <- glm(loan_status ~ grade  + emp_cat , family = binomial, data = training_set)
log_5_remove_emp <- glm(loan_status ~ grade + annual_inc, family = binomial, data = training_set)

# Make PD-predictions for each of the models
pred_5_remove_grade <- predict(log_5_remove_grade, newdata = test_set, type = "response")
pred_5_remove_inc <- predict(log_5_remove_inc, newdata = test_set, type = "response")
pred_5_remove_emp <- predict(log_5_remove_emp, newdata = test_set, type = "response")

# Compute the AUCs
auc(test_set$loan_status, pred_5_remove_grade)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.5766
auc(test_set$loan_status, pred_5_remove_inc)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.6345
auc(test_set$loan_status, pred_5_remove_emp)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.6419
# Plot the ROC-curve for the best model here
plot(roc(test_set$loan_status,pred_4_remove_amnt))
Setting levels: control = 0, case = 1
Setting direction: controls < cases

---
title: "Credit Risk Modeling in R"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: false
    number_sections: true
    
toc_depth: 3
---
# Introduction and data preprocessing

## Introduction and data structure

### Exploring the credit data

We will be examining the dataset loan_data:
```{r}
loan_data <- readRDS("data/loan_data_ch1.rds")
```
After being given loan_data, you are particularly interested about the defaulted loans in the data set. You want to get an idea of the number, and percentage of defaults. Defaults are rare, so you always want to check what the proportion of defaults is in a loan dataset. The CrossTable() function is very useful here.

Remember that default information is stored in the response variable loan_status, where 1 represents a default, and 0 represents non-default.

To learn more about variable structures and spot unexpected tendencies in the data, you should examine the relationship between loan_status and certain factor variables. For example, you would expect that the proportion of defaults in the group of customers with grade G (worst credit rating score) is substantially higher than the proportion of defaults in the grade A group (best credit rating score).

Conveniently, CrossTable() can also be applied on two categorical variables.
```{r}
# View the structure of loan_data
str(loan_data)

# Load the gmodels package 
library(gmodels)

# Call CrossTable() on loan_status
CrossTable(loan_data$loan_status)
```
Call CrossTable() with x argument loan_data\$grade and y argument loan_data\$loan_status. We only want row-wise proportions, so set prop.r to TRUE, but prop.c , prop.t and prop.chisq to FALSE (default values here are TRUE, and this would lead to inclusion of column proportions, table proportions and chi-square contributions for each cell. We do not need these here.)
```{r}
# Call CrossTable() on grade and loan_status
CrossTable(loan_data$grade, loan_data$loan_status, prop.r = TRUE, prop.c = FALSE, prop.t  = FALSE, prop.chisq = FALSE)
```
The proportion of defaults increases when the credit rating moves from A to G (worse credit scores).

## Histograms and outliers

### Histograms

You previously explored categorical variables using the CrossTable() function. Now you would like to explore continuous variables to identify potential outliers or unexpected data structures.

To do this, let's experiment with the function hist() to understand the distribution of the number of loans for different customers.
```{r}
# Create histogram of loan_amnt: hist_1
hist_1 <- hist(loan_data$loan_amnt)
```
Use $breaks along with the object hist_1 to get more information on the histogram breaks. Knowing the location of the breaks is important because if they are poorly chosen, the histogram may be misleading.
```{r}
# Print locations of the breaks in hist_1
hist_1$breaks
```
Change the number of breaks in hist_1 to 200 by specifying the breaks argument. Additionally, name the x-axis "Loan amount" using the xlab argument and title it "Histogram of the loan amount" using the main argument. Save the result to hist_2.
```{r}
# Change number of breaks and add labels: hist_2
hist_2 <- hist(loan_data$loan_amnt, breaks = 200, xlab = "Loan amount", 
               main = "Histogram of the loan amount")
```
Why do the peaks occur where they occur?

Note that there are some high peaks at round values: 5000, 10000, 15000,… People tend to borrow round numbers. 

### Outliers

Now it's time to look at the structure of the variable age.
```{r}
hist(loan_data$age)
```
There is a lot of blank space on the right-hand side of the plot. This is an indication of possible outliers. You will look at a scatterplot to verify this. If you find any outliers you will delete them.

If outliers are observed for several variables, it might be useful to look at bivariate plots. It's possible the outliers belong to the same observation. If so, there is even more reason to delete the observation because it is more likely that some information stored in it is wrong.
```{r}
# Plot the age variable
plot(loan_data$age, ylab = "Age")
```
The oldest person in this data set is older than 122 years! Get the index of this outlier using which() and the age of 122 as a cutoff (you can do this using loan_data$age > 122). Assign it to the object index_highage.
```{r}
# Save the outlier's index to index_highage
index_highage <- which(loan_data$age > 122)
```
Create a new data set new_data, after removing the observation with the high age using the object index_highage.
```{r}
# Create data set new_data with outlier deleted
new_data <- loan_data[-index_highage, ]
```

Have a look at the bivariate scatterplot, with age on the x-axis and annual income on the y-axis. 
```{r}
# Make bivariate scatterplot of age and annual income
plot(loan_data$age, loan_data$annual_inc, xlab = "Age", ylab = "Annual Income")
```
Did you notice that you have a bivariate outlier? The person with the huge annual wage of $6 million appeared to be 144 years old. This must be a mistake, so you will definitely delete this observation from the data in the following exercises.

## Missing data and coarse classification

### Deleting missing data

You saw before that the interest rate (int_rate) in the data set loan_data depends on the customer. Unfortunately some observations are missing interest rates. You now need to identify how many interest rates are missing and then delete them.
```{r}
# Look at summary of loan_data
summary(loan_data$int_rate)

# Get indices of missing interest rates: na_index
na_index <- which(is.na(loan_data$int_rate))

# Remove observations with missing interest rates: loan_data_delrow_na
loan_data_delrow_na <- loan_data[-na_index, ]

# Make copy of loan_data
loan_data_delcol_na <- loan_data

# Delete interest rate column from loan_data_delcol_na
loan_data_delcol_na$int_rate <- NULL
```
### Replacing missing data

Rather than deleting the missing interest rates, you may want to replace them instead. 

Create an object called median_ir, containing the median of the interest rates in loan_data using the median() function. Don't forget to include the argument na.rm = TRUE.
```{r}
# Compute the median of int_rate
median_ir <- median(loan_data$int_rate, na.rm = TRUE)

# Make copy of loan_data
loan_data_replace <- loan_data

# Replace missing interest rates with median
loan_data_replace$int_rate[na_index] <- median_ir

# Check if the NAs are gone
summary(loan_data_replace$int_rate)
```
### Keeping missing data

In some situations, the fact that an input is missing is important information in itself. NAs can be kept in a separate "missing" category using coarse classification.

Coarse classification allows you to simplify your data and improve the interpretability of your model. Coarse classification requires you to bin your responses into groups that contain ranges of values. You can use this binning technique to place all NAs in their own bin.
```{r}
# Make the necessary replacements in the coarse classification example below 
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))

loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"

loan_data$ir_cat <- as.factor(loan_data$ir_cat)

# Look at your new variable using plot()
plot(loan_data$ir_cat)
```
## Data splitting and confusion matrices

### Splitting the data set

To make your training and test sets, you should first set a seed using set.seed(). Seeds allow you to create a starting point for randomly generated numbers, so that each time your code is run the same answer is generated. The advantage of doing this in your sampling is that you or anyone else can recreate the exact same training and test sets by using the same seed.

Using sample(), you can randomly assign observations to the training and test set.

For this exercise you will use the two first arguments in the sample() function:
-	The first argument is the vector from which we will sample values. We will randomly pick row numbers as indices; you can use 1:nrow(loan_data) to create the vector of row numbers.
-	The second argument is the number of items to choose. We will enter 2 / 3 * nrow(loan_data), as we construct the training set first.

```{r}
# Set seed of 567
set.seed(567)

# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2 / 3 * nrow(loan_data))

# Create training set: training_set
training_set <- loan_data[index_train, ]

# Create test set: test_set
test_set <- loan_data[-index_train, ]
```
### Creating a confusion matrix

In this example, assume that you have run a model and stored the predicted outcomes in a vector called model_pred. You want to see how the model performed so you will construct a confusion matrix. You will compare the actual loan status column (loan_status) to the predicted values (model_pred), using the table() function, where the arguments are the true values and the predicted values. Recall the confusion matrix structure:

![](confusion_matrix.png)

$Classification accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)}$

$Sensitivity=\frac{TP}{(TP+FN)}$

$Specificity=\frac{TN}{(TN+FP)}$

```{r}
model_pred <- sample(c(0, 1), replace = TRUE, size = 9698)
```
```{r}
# Create confusion matrix
conf_matrix <- table(test_set$loan_status, model_pred)
conf_matrix

# Compute classification accuracy
(conf_matrix[1, 1] + conf_matrix[2, 2]) / (conf_matrix[1, 1] + conf_matrix[1, 2] + conf_matrix[2, 1] + conf_matrix[2, 2])

# Compute sensitivity
conf_matrix[2, 2] / (conf_matrix[2, 1] + conf_matrix[2, 2])
```
# Logistic regression

## Logistic regression: introduction

### Basic logistic regression

In the video, you looked at a logistic regression model including the variable age as a predictor. Now, you will include a categorical variable, and learn how to interpret its parameter estimates.

When you include a categorical variable in a logistic regression model in R, you will obtain a parameter estimate for all but one of its categories. This category for which no parameter estimate is given is called the reference category. The parameter for each of the other categories represents the odds ratio in favor of a loan default between the category of interest and the reference category.
```{r}
loan_data <- readRDS("data/loan_data_ch2.rds")
```
```{r}
# Set seed of 567
set.seed(567)

# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2 / 3 * nrow(loan_data))

# Create training set: training_set
training_set <- loan_data[index_train, ]

# Create test set: test_set
test_set <- loan_data[-index_train, ]
```

```{r}
# Build a glm model with variable ir_cat as a predictor
log_model_cat <- glm(formula = loan_status ~ ir_cat, family = "binomial",
                     data = training_set)

# Print the parameter estimates 
log_model_cat

# Look at the different categories in ir_cat using table()
table(loan_data$ir_cat)
```
How do you interpret the parameter estimate for the interest rates that are between 8% and 11%? Compared to the reference category with interest rates between 0% and 8%, the odds in favor of default change by a multiple of…
```{r}
exp(0.5496)
```
### Multiple variables in a logistic regression model

The interpretation of a single parameter still holds when including several variables in a model. When you do include several variables and ask for the interpretation when a certain variable changes, it is assumed that the other variables remain constant, or unchanged. There is a fancy latin phrase for this, ceteris paribus, literally meaning "keeping all others the same".

To build a logistic regression model with multiple variables, you can use the + sign to add variables. Your formula will look something like:

$y ~ x1 + ... + xk$

In order to evaluate the model there are a number of things to be aware of. You already looked at the parameter values, but that is not the only thing of importance. Also important is the statistical significance of a certain parameter estimate. The significance of a parameter is often referred to as a p-value, however in a model output you will see it denoted as $Pr(>|t|)$. In glm, mild significance is denoted by a "." to very strong significance denoted by "***". When a parameter is not significant, this means you cannot assure that this parameter is significantly different from 0. Statistical significance is important. In general, it only makes sense to interpret the effect on default for significant parameters.
```{r}
# Build the logistic regression model
log_model_multi <- glm(loan_status ~ age + ir_cat + grade + loan_amnt + annual_inc, family = "binomial", data = training_set)


# Obtain significance levels using summary()
summary(log_model_multi)
```
## Logistic regression: predicting the probability of default

### Predicting the probability of default

you can predict the probability for all the test set cases at once using the predict() function.

After having obtained all the predictions for the test set elements, it is useful to get an initial idea of how good the model is at discriminating by looking at the range of predicted probabilities. A small range means that predictions for the test set cases do not lie far apart, and therefore the model might not be very good at discriminating good from bad customers. With low default percentages, you will notice that in general, very low probabilities of default are predicted. It's time to have a look at a first model.

```{r}
log_model_small <- glm(formula = loan_status ~ age + ir_cat, family = "binomial", data = training_set)
```
```{r}
# Build the logistic regression model
predictions_all_small <- predict(log_model_small, newdata = test_set, type = "response")

# Look at the range of the object "predictions_all_small"
range(predictions_all_small)
```
### Making more discriminative models

In the previous exercise, the range for predicted probabilities of default was rather small. As discussed, small predicted default probabilities are to be expected with low default rates, but building bigger models (which basically means: including more predictors) can expand the range of your predictions.

Whether this will eventually lead to better predictions still needs to be validated and depends on the quality of the newly included predictors. But first, have a look at how bigger models can expand the range.
```{r}
# Change the code below to construct a logistic regression model using all available predictors in the data set
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)

# Make PD-predictions for all test set elements using the the full logistic regression model
predictions_all_full <- predict(log_model_full, newdata = test_set, type = "response")

# Look at the predictions range
range(predictions_all_full)
```
## Evaluating the logistic regression model result

### Specifying a cut-off

We have shown you how the specification of a cut-off can make the difference to obtain a good confusion matrix. Now, you will learn how to transform the prediction vector to a vector of binary values indicating the status of the loan. The ifelse() function in R can help you here.

Applying the ifelse() function in the context of a cut-off, you would have something like

ifelse(predictions > 0.3, 1, 0)
In the first argument, you are testing whether a certain value in the predictions-vector is bigger than 0.3. If this is TRUE, R returns "1" (specified in the second argument), if FALSE, R returns "0" (specified in the third argument), representing "default" and "no default", respectively.
```{r}
# The code for the logistic regression model and the predictions is given below
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
predictions_all_full <- predict(log_model_full, newdata = test_set, type = "response")

# Make a binary predictions-vector using a cut-off of 15%
pred_cutoff_15 <- ifelse(predictions_all_full > 0.15, 1, 0)

# Construct a confusion matrix
table(test_set$loan_status, pred_cutoff_15)
```
### Comparing two cut-offs

Have a look again at the confusion matrix using the full model and a cut-off of 15%, which is stored in the object conf_matrix_15, and another confusion matrix using a cut-off of 20% and the same model, stored in conf_matrix_20.

Accuracy increases, sensitivity decreases and specificity increases.

## Wrap-up and remarks

### Comparing link functions for a given cut-off

In this last exercise, you will fit a model using each of the three link functions (logit, probit and cloglog), make predictions for the test set, classify the predictions in the appropriate group (default versus non-default) for a given cut-off, make a confusion matrix and compute the accuracy and sensitivity for each of the models given the cut-off value! Wow, you've learned a lot so far. And finally, you will try to identify the model that performs best in terms of accuracy given the cut-off value!

It is important to know that the differences between the models will generally be very small, and again, the results will depend on the chosen cut-off value. The observed outcome (default versus non-default) is stored in true_val in the console.
```{r}
# Fit the logit, probit and cloglog-link logistic regression models
log_model_logit <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
                       family = binomial(link = logit), data = training_set)
log_model_probit <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
                       family = binomial(link = probit), data = training_set)

log_model_cloglog <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
                       family = binomial(link = cloglog), data = training_set)
  
# Make predictions for all models using the test set
predictions_logit <- predict(log_model_logit, newdata = test_set, type = "response")
predictions_probit <- predict(log_model_probit, newdata = test_set, type = "response")
predictions_cloglog <- predict(log_model_cloglog, newdata = test_set, type = "response")
  
# Use a cut-off of 14% to make binary predictions-vectors
cutoff <- 0.14
class_pred_logit <- ifelse(predictions_logit > cutoff, 1, 0)
class_pred_probit <- ifelse(predictions_probit > cutoff, 1, 0)
class_pred_cloglog <- ifelse(predictions_cloglog > cutoff, 1, 0)
  
# Make a confusion matrix for the three models
true_val <- test_set$loan_status
tab_class_logit <- table(true_val,class_pred_logit)
tab_class_logit
tab_class_probit <- table(true_val,class_pred_probit)
tab_class_probit
tab_class_cloglog <- table(true_val,class_pred_cloglog)
tab_class_cloglog
  
# Compute the classification accuracy for all three models
acc_logit <- sum(diag(tab_class_logit)) / nrow(test_set)
acc_logit
acc_probit <- sum(diag(tab_class_probit)) / nrow(test_set)
acc_probit
acc_cloglog <- sum(diag(tab_class_cloglog)) / nrow(test_set)
acc_cloglog
```
# Decision trees

## What is a decision tree?

### Computing the gain for a tree

We looked at how the Gini-measure is used to create the perfect split for a tree. Now, you will compute the gain for the tree loaded in your workspace.

The data set contains 500 cases, 89 of these cases are defaults. This led to a Gini of 0.292632 in the root node. As a small reminder, remember that Gini of a certain node = 2 * proportion of defaults in this node * proportion of non-defaults in this node. Have a look at the code for a refresher.

    gini_root <- 2 * (89 / 500) * (411 / 500)


You will use these Gini measures to help you calculate the gain of the leaf nodes with respect to the root node. Look at the following code to get an idea of how you can use the gini measures you created to calculate the gain of a node.

    Gain = gini_root - (prop(cases left leaf) * gini_left) - (prop(cases right leaf * gini_right))
    
Compute the gini in the left hand and the right hand node, and the gain of the two leaf nodes with respect to the root node. The object containing the tree is small_tree.
```{r}
# Load package rpart in your workspace.
library(rpart)

# The Gini-measure of the root node is given below
gini_root <- 2 * 89 / 500 * 411 / 500

# Compute the Gini measure for the left leaf node
gini_ll <- 2 * (401/446)*(45/446) 

# Compute the Gini measure for the right leaf node
gini_rl <-2 * (10/54)*(44/54) 

# Compute the gain
gain <- gini_root - 446 / 500 * gini_ll - 54 / 500 * gini_rl
gain

# compare the gain-column in small_tree$splits with our computed gain, multiplied by 500, and assure they are the same

small_set <- readRDS("small_set.rds")
small_tree <- rpart(formula = status ~ age, data = small_set, method = "class", 
    control = rpart.control(minsplit = 5, cp = 0.001, maxdepth = 1))
small_tree$splits
improve <- gain * 500
improve
```
## Building decision trees using the rpart()-package

### Undersampling the training set

In the video, you saw that to overcome the unbalanced data problem, you can use under- or oversampling. The training set has been undersampled for you, such that 1/3 of the training set consists of defaults, and 2/3 of non-defaults. The resulting data set is available in your workspace and named undersampled_training_set, and contains less observations (6570 instead of 19394). In this exercise, you will create a decision tree using the undersampled data set.

You will notice that the trees in this and the next exercises are very big, so big that you cannot really read them anymore.

```{r}
# Load package rpart in your workspace.
library(rpart)

# Change the code provided in the video such that a decision tree is constructed using the undersampled training set. Include rpart.control to relax the complexity parameter to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
                          data =  training_set)
```
Change the code provided such that a decision tree is constructed using the undersampled training set instead of training_set. Additionally, add the argument control = rpart.control(cp = 0.001). cp, which is the complexity parameter, is the threshold value for a decrease in overall lack of fit for any split. If cp is not met, further splits will no longer be pursued. cp's default value is 0.01, but for complex problems, it is advised to relax cp.

Plot the decision tree using the function plot and the tree object name. Add a second argument uniform = TRUE to get equal-sized branches.
```{r}
library(ROSE)
undersampled_training_set <- ovun.sample(loan_status ~ ., data = training_set, method = "under", N = 6570, seed = 1)$data
table(undersampled_training_set$loan_status )
```

```{r}
# Change the code provided in the video such that a decision tree is constructed using the undersampled training set. Include rpart.control to relax the complexity parameter to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
                          data =  undersampled_training_set,  control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_undersample, uniform = TRUE)

# Add labels to the decision tree
text(tree_undersample)
```
### Changing the prior probabilities

As mentioned in the video, you can also change the prior probabilities to obtain a decision tree. This is an indirect way of adjusting the importance of misclassifications for each class. You can specify another argument inside rpart() to include prior probabities. The argument you are looking for has the following form
    
    parms = list(prior=c(non_default_proportion, default_proportion))

    # Change the code below such that a tree is constructed with adjusted prior probabilities.
    tree_prior <- rpart(loan_status ~ ., method = "class",
                        data = training_set)

Change the code provided such that a decision tree is constructed , including the argument parms and changing the proportion of non-defaults to 0.7, and of defaults to 0.3 (they should always sum up to 1). Additionally, include control = rpart.control(cp = 0.001) as well.

```{r}
# Change the code below such that a tree is constructed with adjusted prior probabilities.
tree_prior <- rpart(loan_status ~ ., method = "class",
                    data = training_set, parms = list(prior = c(0.7, 0.3)),
                    control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_prior, uniform = TRUE)

# Add labels to the decision tree
text(tree_prior)
```
### Including a loss matrix

Thirdly, you can include a loss matrix, changing the relative importance of misclassifying a default as non-default versus a non-default as a default. You want to stress that misclassifying a default as a non-default should be penalized more heavily. Including a loss matrix can again be done in the argument parms in the loss matrix.

    parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2))

Doing this, you are constructing a 2x2-matrix with zeroes on the diagonal and changed loss penalties off-diagonal. The default loss matrix is all ones off-diagonal.

    tree_loss_matrix <- rpart(loan_status ~ ., method = "class",
                              data =  training_set)

Change the code provided such a loss matrix is included, with a penalization that is 10 times bigger when misclassifying an actual default as a non-default. This can be done replacing cost_def_as_nondef by 10, and cost_nondef_as_def by 1. Similar to what you've done in the previous exercises, include rpart.control to relax the complexity parameter to 0.001.
```{r}
# Change the code below such that a decision tree is constructed using a loss matrix penalizing 10 times more heavily for misclassified defaults.
tree_loss_matrix <- rpart(loan_status ~ ., method = "class",
                          data =  training_set,
                          parms = list(loss = matrix(c(0, 10, 1, 0), ncol = 2)),
                          control = rpart.control(cp = 0.001))


# Plot the decision tree
plot(tree_loss_matrix, uniform = TRUE)

# Add labels to the decision tree
text(tree_loss_matrix)
```
## Pruning the decision tree

### Pruning the tree with changed prior probabilities

Pruning a tree is necessary to avoid overfitting. There were some big trees in the previous exercises and now you will put what you have learned into practice, and prune the previously constructed tree with the changed prior probabilities. The rpart package is already loaded in your workspace.

You will first set a seed to make sure the results are reproducible as mentioned in the video, because you will be examining cross-validated error results. Results involve randomness and could differ slightly upon running the function again with a different seed.

In this exercise you will learn to identify which complexity parameter (CP) will minimize the cross-validated error results, then prune your tree based on this value.

tree_undersample
tree_prior 
tree_loss_matrix 

Use plotcp() to visualize cross-vaidated error (X-val Relative Error) in relation to the complexity parameter for tree_prior.
Use printcp() to print a table of information about CP, splits, and errors. See if you can identify which split has the minimum cross-validated error in tree_prior.
```{r}
set.seed(345)
# tree_prior is loaded in your workspace

# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_prior)

# Use printcp() to identify for which complexity parameter the cross-validated error rate is minimized.
printcp(tree_prior)

# Create an index for of the row with the minimum xerror
index <- which.min(tree_prior$cptable[ , "xerror"])

# Create tree_min
tree_min <- tree_prior$cptable[index, "CP"]

#  Prune the tree using tree_min
ptree_prior <- prune(tree_prior, cp = tree_min)

# Use prp() to plot the pruned tree
library(rpart.plot)
prp(ptree_prior) 
```
### Pruning the tree with the loss matrix

In this exercise, you will prune the tree that was built using a loss matrix in order to penalize misclassified defaults more than misclassified non-defaults.

```{r}
# set a seed and run the code to construct the tree with the loss matrix again
set.seed(345)
tree_loss_matrix  <- rpart(loan_status ~ ., method = "class", data = training_set,
                           parms = list(loss=matrix(c(0, 10, 1, 0), ncol = 2)),
                           control = rpart.control(cp = 0.001))

# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_loss_matrix)
```
Looking at the cp-plot, you will notice that pruning the tree using the minimum cross-validated error will lead to a tree that is as big as the unpruned tree, as the cross-validated error reaches its minimum for cp = 0.001. Because you would like to make the tree somewhat smaller, try pruning the tree using cp = 0.0012788. For this complexity parameter, the cross-validated error approaches the minimum observed error. Call the pruned tree ptree_loss_matrix.
```{r}
# Prune the tree using cp = 0.0012788
ptree_loss_matrix <- prune(tree_loss_matrix, cp = 0.0012788)

# Use prp() and argument extra = 1 to plot the pruned tree
prp(ptree_loss_matrix, extra = 1)
```
```{r}
# Create an index for of the row with the minimum xerror
index <- which.min(tree_loss_matrix$cptable[ , "xerror"])

# Create tree_min
tree_min <- tree_loss_matrix$cptable[index, "CP"]
# Prune the tree using cp = 0.0012788
ptree_loss_matrix <- prune(tree_loss_matrix, cp = tree_min)

# Use prp() and argument extra = 1 to plot the pruned tree
prp(ptree_loss_matrix, extra = 1)
```
## Other tree options and the construction of confusion matrices

### One final tree using more options

You will use some final arguments that were discussed in the video. Some specifications in the rpart.control()-function will be changed, and some weights will be included using the weights argument in rpart(). The vector case_weights has been constructed for you and is loaded in your workspace. This vector contains weights of 1 for the non-defaults in the training set, and weights of 3 for defaults in the training sets. By specifying higher weights for default, the model will assign higher importance to classifying defaults correctly.
```{r}
case_weights <- ifelse(training_set$loan_status == 0, 1, 3)
```

```{r}
# set a seed and run the code to obtain a tree using weights, minsplit and minbucket
set.seed(345)
tree_weights <- rpart(loan_status ~ ., method = "class",
                      data = training_set,
                      weights = case_weights, 
                      control = rpart.control(minsplit = 5, minbucket = 2, cp = 0.001))

# Plot the cross-validated error rate for a changing cp
plotcp(tree_weights)

# Create an index for of the row with the minimum xerror
index <- which.min(tree_weights$cp[ , "xerror"])

# Create tree_min
tree_min <- tree_weights$cp[index, "CP"]

# Prune the tree using tree_min
ptree_weights <- prune(tree_weights, cp = tree_min)

# Plot the pruned tree using the rpart.plot()-package
prp(ptree_weights, extra = 1)
```
### Confusion matrices and accuracy of our final trees

Over the past few exercises, you have constructed quite a few pruned decision trees, with four in total. As you can see, the eventual number of splits varies quite a bit from one tree to another:

    ptree_undersample  # 7 splits
    ptree_prior  # 9 splits
    ptree_loss_matrix  # 24 splits
    ptree_weights  # 6 splits

Now it is important to know which tree performs best in terms of accuracy. In order to get the accuracy, you will start off by making predictions using the test set, and construct the confusion matrix for each of these trees. You will add the argument type = "class" when doing these predictions. By doing this there is no need to set a cut-off.
Nevertheless, it is important to be aware of the fact that not only the accuracy is important, but also the sensitivity and specificity. Additionally, predicting probabilities instead of binary values (0 or 1) has the advantage that the cut-off can be moved along. Then again, the difficulty here is the choice of the cut-off.

$Classification accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)}$

```{r}
ptree_undersample <- tree_undersample
ptree_prior <- tree_prior 
ptree_loss_matrix <- tree_loss_matrix
ptree_weights <- tree_weights
# Make predictions for each of the pruned trees using the test set.
pred_undersample <- predict(ptree_undersample, newdata = test_set,  type = "class")
pred_prior <- predict(ptree_prior, newdata = test_set,  type = "class")
pred_loss_matrix <- predict(ptree_loss_matrix, newdata = test_set,  type = "class")
pred_weights <- predict(ptree_weights, newdata = test_set,  type = "class")

# construct confusion matrices using the predictions.
confmat_undersample <- table(test_set$loan_status, pred_undersample)
confmat_prior <- table(test_set$loan_status, pred_prior)
confmat_loss_matrix <- table(test_set$loan_status, pred_loss_matrix)
confmat_weights <- table(test_set$loan_status, pred_weights)

# Compute the accuracies
acc_undersample <- sum(diag(confmat_undersample)) / nrow(test_set)
acc_undersample
acc_prior <- sum(diag(confmat_prior)) / nrow(test_set)
acc_prior
acc_loss_matrix <- sum(diag(confmat_loss_matrix)) / nrow(test_set)
acc_loss_matrix
acc_weights <- sum(diag(confmat_weights)) / nrow(test_set)
acc_weights
```
# Evaluating a credit risk model

## Finding the right cut-off: the strategy curve

In the video, you learned how to compute the bad rate (or, the percentage of defaults) in the loan portfolio of a bank when given:

-	a specific model
-	the acceptance rate

In this exercise, you will compute the bad rate that a bank can expect when using the pruned tree ptree_prior that you fitted before, and an acceptance rate of 80%.

n the script, you are provided the code to make predictions for the probability of default using the pruned tree and test_set. Remember that if you use the predict() function for a tree, the probability of default can be found in the second column. Therefore [,2] was pasted to the predict() function.

Obtain the cut-off that leads to an acceptance rate of 80%, using prob_default_prior. You can use the quantile()- function to do this, setting the second argument to 0.8. Assign the name cutoff_prior.
The code to obtain the actual binary default predictions (0 or 1) is provided. ifelse() here. Name the object bin_pred_prior_80.

The code to select the default indicators of test_set for the accepted loans acording to a 80% acceptance rate is provided.

Compute the percentage of defaults (or the "bad rate") for the accepted loans. This is the number of occurences of 1 in accepted_status_prior_80, divided by the total number of instances in this vector. Print the solution to your R-console.
```{r}
# Make predictions for the probability of default using the pruned tree and the test set.
prob_default_prior <- predict(ptree_prior, newdata = test_set)[ ,2]

# Obtain the cutoff for acceptance rate 80%
cutoff_prior <- quantile(prob_default_prior, 0.8)  

# Obtain the binary predictions.
bin_pred_prior_80 <- ifelse(prob_default_prior > cutoff_prior, 1, 0)

# Obtain the actual default status for the accepted loans
accepted_status_prior_80 <- test_set$loan_status[bin_pred_prior_80 == 0]

# Obtain the bad rate for the accepted loans
sum(accepted_status_prior_80) / length(accepted_status_prior_80)
```
### The strategy table and strategy curve

Repeating the calculations you did in the previous exercise for several acceptance rates, you can obtain a strategy table. This table can be a useful tool for banks, as they can give them a better insight to define an acceptance strategy.

You know how to compute a bad rate for a certain acceptance rate by now, so the function strategy_bank was written and loaded into your workspace to speed things up. This function computes the cut-off and bad rate for the acceptance rates that are multiples of 5% (0%, 5%, 10%, …).
```{r}
strategy_bank <- function(prob_of_def){
    cutoff = rep(NA, 21)
    bad_rate = rep(NA, 21)
    accept_rate = seq(1, 0, by = -0.05)
    for (i in 1:21) {
        cutoff[i] = quantile(prob_of_def, accept_rate[i])
        pred_i = ifelse(prob_of_def > cutoff[i], 1, 0)
        pred_as_good = test_set$loan_status[pred_i == 0]
        bad_rate[i] = sum(pred_as_good)/length(pred_as_good)
    }
        table = cbind(accept_rate, cutoff = round(cutoff, 4), bad_rate = round(bad_rate, 
        4))
    return(list(table = table, bad_rate = bad_rate, accept_rate = accept_rate, 
        cutoff = cutoff))
}
```

- Have a look at the function strategy_bank.
- The vector predictions_cloglog contains predicted probabilities of default using the cloglog model you used in chapter 2, the vector predictions_loss_matrixcontains the predicted probabilities of default using the pruned tree including a loss matrix (previously constructed in chapter 3). Apply function strategy_bank to each of the prediction-vectors, assign the name strategy_cloglog and strategy_loss_matrix respectively.
- The strategy tables can be obtained using the object names in combination with $table.
- The strategy curves have been plotted for you. The strategy curve of the tree model shows pretty strange behavior. Because of the structure of classification trees, you might have a bigger chance for weird "jumps" here. Additionally, the tree with a loss matrix was a very big one, so this might be the result of overfitting!
```{r}
#withougth type = "class" to get probabilities
predictions_loss_matrix <- predict(ptree_loss_matrix, newdata = test_set)[,2]
```

```{r}
# Apply the function strategy_bank to both predictions_cloglog and predictions_loss_matrix
strategy_cloglog <- strategy_bank(predictions_cloglog)
strategy_loss_matrix <- strategy_bank(predictions_loss_matrix)

# Obtain the strategy tables for both prediction-vectors
strategy_cloglog$table
strategy_loss_matrix$table

# Plot the strategy functions
par(mfrow = c(1,2))
plot(strategy_cloglog$accept_rate, strategy_cloglog$bad_rate, 
     type = "l", xlab = "Acceptance rate", ylab = "Bad rate", 
     lwd = 2, main = "logistic regression")

plot(strategy_loss_matrix$accept_rate, strategy_loss_matrix$bad_rate, 
     type = "l", xlab = "Acceptance rate", 
     ylab = "Bad rate", lwd = 2, main = "tree")
```
## The ROC-curve

### ROC-curves for comparison of logistic regression models

ROC-curves can easily be created using the pROC-package in R. Let's have a look if there is a big difference between ROC-curves for the four logistic regression-models previously used throughout this course. A small heads up:

* predictions_logit contains probability of default (PD) predictions using the default logit link and containing variables age, emp_cat, ir_cat and loan_amnt.
* predictions_probit contains PD-predictions using the probit and containing variables age, emp_cat, ir_cat and loan_amnt.
* predictions_cloglog contains PD-predictions using the cloglog link and containing variables age, emp_cat, ir_cat and loan_amnt.
* predictions_all_full contains PD-predictions using the default logit link and containing all seven variables in the data set.

Load the pROC-package in your R-console.
Construct the ROC-objects for the four logistic regression models using function roc(response, predictor). Remember that the response is the loan status indicator in the test_set, which can be obtained through test_set$loan_status.
Use the previously created objects to construct ROC-curves. To draw them all on one plot, use plot() for the first ROC-curve drawn (for ROC_logit), and use [lines()](http://www.rdocumentation.org/packages/graphics/functions/lines to add the ROC-curves) for the other three models to the same plot.
Use the col-argument to change the color of the curve of ROC_probit to "blue", ROC_cloglog to "red" and ROC_all_full to "green". Note that, in contrast with what has been discussed in the video, the x-axis label is Specificity and not "1-Specificity", resulting in an axis that goes from 1 on the left-hand side to 0 on the right-hand side.
It seems that the link function does not have a big impact on the ROC here, and the main trigger of a better ROC is the inclusion of more variables in a model. To get an exact idea of the performance of the ROC-curves, have a look at the AUC's, using function auc().

```{r}
# Load the pROC-package
library(pROC)

# Construct the objects containing ROC-information
ROC_logit <- roc(test_set$loan_status, predictions_logit)
ROC_probit <- roc(test_set$loan_status, predictions_probit)
ROC_cloglog <- roc(test_set$loan_status, predictions_cloglog)
ROC_all_full <- roc(test_set$loan_status, predictions_all_full)

# Draw all ROCs on one plot
plot(ROC_logit)
lines(ROC_probit, col="blue")
lines(ROC_cloglog, col="red")
lines(ROC_all_full, col="green")

# Compute the AUCs
auc(ROC_logit)
auc(ROC_probit)
auc(ROC_cloglog)
auc(ROC_all_full)
```
### ROC-curves for comparison of tree-based models
It's time for you to repeat the previous exercises, now comparing the tree-based models. The pROC() is now loaded in your workspace. The PD-predictions for tree-based methods are stored in the objects

predictions_undersample
predictions_prior
predictions_loss_matrix
predictions_weights
```{r}
predictions_undersample <- predict(tree_undersample, newdata = test_set)[,2]
predictions_prior <- predict(tree_prior, newdata = test_set)[,2]
predictions_loss_matrix <- predict(ptree_loss_matrix, newdata = test_set)[,2]
predictions_weights <- predict(tree_weights , newdata = test_set)[,2]
```


```{r}
# Construct the objects containing ROC-information
ROC_undersample <- roc(test_set$loan_status, predictions_undersample)
ROC_prior <- roc(test_set$loan_status, predictions_prior)
ROC_loss_matrix <- roc(test_set$loan_status, predictions_loss_matrix)
ROC_weights <- roc(test_set$loan_status, predictions_weights)

# Draw the ROC-curves in one plot
plot(ROC_undersample)
lines(ROC_prior, col="blue")
lines(ROC_loss_matrix, col="red")
lines(ROC_weights, col="green")

# Compute the AUCs
auc(ROC_undersample)
auc(ROC_prior)
auc(ROC_loss_matrix)
auc(ROC_weights)
```
## Input selection based on the AUC


### Another round of pruning based on AUC
In the video you saw how the "full" logistic regression model with a logit link was being pruned based on the AUC. You saw how the variable home_ownership was deleted from the model, as it improved the overall AUC. After repeating this process for two additional rounds, the variables age and ir_cat were deleted, leading to the model:

    log_3_remove_ir <- glm(loan_status ~ loan_amnt + grade + annual_inc + emp_cat, family = binomial, data = training_set)

with an AUC of 0.6545. Now, it's your turn to see whether the AUC can still be improved by deleting another variable from the model.
```{r}
# Build four models each time deleting one variable in log_3_remove_ir
log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, 
                         family = binomial, data = training_set) 
log_4_remove_grade <- glm(loan_status ~ loan_amnt  + annual_inc + emp_cat, 
                          family = binomial, data = training_set)
log_4_remove_inc <- glm(loan_status ~ loan_amnt + grade  + emp_cat , 
                        family = binomial, data = training_set)
log_4_remove_emp <- glm(loan_status ~ loan_amnt + grade + annual_inc, 
                        family = binomial, data = training_set)

# Make PD-predictions for each of the models
pred_4_remove_amnt <- predict(log_4_remove_amnt, newdata = test_set, type = "response")
pred_4_remove_grade <- predict(log_4_remove_grade, newdata = test_set, type = "response")
pred_4_remove_inc <- predict(log_4_remove_inc, newdata = test_set, type = "response")
pred_4_remove_emp <- predict(log_4_remove_emp, newdata = test_set, type = "response")

# Compute the AUCs
auc(test_set$loan_status, pred_4_remove_amnt)
auc(test_set$loan_status, pred_4_remove_grade)
auc(test_set$loan_status, pred_4_remove_inc)
auc(test_set$loan_status, pred_4_remove_emp)
```
### Further model reduction?

Deleting the variable loan_amnt, the AUC can be further improved to 0.6548! The resulting model is

    log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, family = binomial, data = training_set) 

Is it possible to reduce the logistic regression model to only two variable without reducing the AUC? In this exercise you will see if it is possible!
```{r}
# Build three models each time deleting one variable in log_4_remove_amnt
log_5_remove_grade <- glm(loan_status ~ annual_inc + emp_cat, family = binomial, data = training_set)  
log_5_remove_inc <- glm(loan_status ~ grade  + emp_cat , family = binomial, data = training_set)
log_5_remove_emp <- glm(loan_status ~ grade + annual_inc, family = binomial, data = training_set)

# Make PD-predictions for each of the models
pred_5_remove_grade <- predict(log_5_remove_grade, newdata = test_set, type = "response")
pred_5_remove_inc <- predict(log_5_remove_inc, newdata = test_set, type = "response")
pred_5_remove_emp <- predict(log_5_remove_emp, newdata = test_set, type = "response")

# Compute the AUCs
auc(test_set$loan_status, pred_5_remove_grade)
auc(test_set$loan_status, pred_5_remove_inc)
auc(test_set$loan_status, pred_5_remove_emp)

# Plot the ROC-curve for the best model here
plot(roc(test_set$loan_status,pred_4_remove_amnt))
```

