Logistic Regression Model: Introduction

What is logistic regression? A regression model with output between 0 and 1.

Fitting a logistic model in R

log_model <- glm(loan_status ~ age , family = 'binomial', data = training_set)
log_model

This will show the Coefficients.

  • The Intercept = Beta0
  • The ‘age’ = Beta1

Loading data for course:

loan_data <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/162/datasets/8f48a2cbb6150e7ae32435e55f271cad5b4b8ecf/loan_data_ch1.rds")))

##################################
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))

loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"

loan_data$ir_cat <- as.factor(loan_data$ir_cat)

#Question
# Make the necessary replacements in the coarse classification example below 
loan_data$emp_cat <- rep(NA, length(loan_data$emp_length))

loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"

loan_data$emp_cat <- as.factor(loan_data$emp_cat)

# Look at your new variable using plot()

str(loan_data)
## 'data.frame':    29092 obs. of  10 variables:
##  $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ int_rate      : num  10.6 NA 13.5 NA NA ...
##  $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ emp_length    : int  10 25 13 3 9 11 0 3 3 0 ...
##  $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ annual_inc    : num  24000 12252 49200 36000 48000 ...
##  $ age           : int  33 31 24 39 24 28 22 22 28 22 ...
##  $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
##  $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...
##################################
# Set seed of 567
set.seed(567)

# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2/3*nrow(loan_data))

# Create training set: training_set
training_set <- loan_data[index_train, ]

# Create test set: test_set
test_set <- loan_data[-index_train, ]

str(loan_data)
## 'data.frame':    29092 obs. of  10 variables:
##  $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ int_rate      : num  10.6 NA 13.5 NA NA ...
##  $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ emp_length    : int  10 25 13 3 9 11 0 3 3 0 ...
##  $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ annual_inc    : num  24000 12252 49200 36000 48000 ...
##  $ age           : int  33 31 24 39 24 28 22 22 28 22 ...
##  $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
##  $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...

Exercise 1) Basic logistic regression

Instructions:

  • Construct a logistic regression model called log_model_cat with the categorical variable ir_cat as the only predictor. Your call to glm() should contain three arguments:
    • loan_status ~ ir_cat
    • family = “binomial”
    • data = training_set
  • View the result in the console to see your parameter estimates.
  • Find out what the reference category is by looking at the structure of ir_cat (in the full data set loan_data) again. Use the table() function to do this.
# Build a glm model with variable ir_cat as a predictor
log_model_cat <- glm(formula = loan_status ~ ir_cat, family = "binomial", data = training_set)

# Print the parameter estimates 
log_model_cat
## 
## Call:  glm(formula = loan_status ~ ir_cat, family = "binomial", data = training_set)
## 
## Coefficients:
##   (Intercept)  ir_cat11-13.5    ir_cat13.5+     ir_cat8-11  ir_catMissing  
##       -2.9091         1.0071         1.3874         0.5933         0.7937  
## 
## Degrees of Freedom: 19393 Total (i.e. Null);  19389 Residual
## Null Deviance:       13460 
## Residual Deviance: 13050     AIC: 13060
# Look at the different categories in ir_cat using table()
table(loan_data$ir_cat)
## 
##     0-8 11-13.5   13.5+    8-11 Missing 
##    7130    6954    6002    6230    2776

Exercise 2) Multiple variables in a logistic regression model

  • Use the “+” sign to add multiple variables to the logistic regression model.
  • Your formula should look like y ~ x1 + … + xk

Instruction:

  • Create a logistic regression model using the glm() function, and the training_set. Include the variables age, ir_cat, grade, loan_amnt, and annual_inc. Call this model log_model_multi.
  • Obtain the significance levels using summary() in combination with our model. You will look more deeply into what significance levels mean in the next exercise!
# Build the logistic regression model
log_model_multi <- glm(loan_status ~ age + ir_cat + grade + loan_amnt +
                      annual_inc , family = "binomial", data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Obtain significance levels using summary()
summary(log_model_multi)
## 
## Call:
## glm(formula = loan_status ~ age + ir_cat + grade + loan_amnt + 
##     annual_inc, family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1959  -0.5329  -0.4383  -0.3322   3.2329  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.428e+00  1.284e-01 -18.920  < 2e-16 ***
## age           -3.980e-03  3.863e-03  -1.030   0.3029    
## ir_cat11-13.5  7.018e-01  1.313e-01   5.344 9.08e-08 ***
## ir_cat13.5+    6.556e-01  1.462e-01   4.485 7.31e-06 ***
## ir_cat8-11     4.867e-01  1.160e-01   4.196 2.71e-05 ***
## ir_catMissing  5.067e-01  1.275e-01   3.973 7.09e-05 ***
## gradeB         1.427e-01  1.041e-01   1.370   0.1707    
## gradeC         4.806e-01  1.200e-01   4.006 6.18e-05 ***
## gradeD         8.033e-01  1.365e-01   5.887 3.93e-09 ***
## gradeE         9.689e-01  1.643e-01   5.896 3.73e-09 ***
## gradeF         1.512e+00  2.296e-01   6.584 4.58e-11 ***
## gradeG         2.072e+00  3.443e-01   6.019 1.76e-09 ***
## loan_amnt     -8.438e-06  4.208e-06  -2.005   0.0449 *  
## annual_inc    -4.788e-06  7.365e-07  -6.502 7.95e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13460  on 19393  degrees of freedom
## Residual deviance: 12892  on 19380  degrees of freedom
## AIC: 12920
## 
## Number of Fisher Scoring iterations: 5

In order to evaluate the model, we can look at the p-value, denoted as Pr(>|z|).
In glm(), a mild significance is denoted by a “.” and a very strong significance denoted by "***".
We want to use fields that have high p-values.

Logistic Regression: Predicting the probability of default

Example with “age” and “home ownership”.

log_model_small <- glm(loan_status ~ age + home_ownership, family = "binomial", data = training_set)
log_model_small
## 
## Call:  glm(formula = loan_status ~ age + home_ownership, family = "binomial", 
##     data = training_set)
## 
## Coefficients:
##         (Intercept)                  age  home_ownershipOTHER  
##           -2.047183            -0.006617             0.534611  
##   home_ownershipOWN   home_ownershipRENT  
##            0.162187             0.241760  
## 
## Degrees of Freedom: 19393 Total (i.e. Null);  19389 Residual
## Null Deviance:       13460 
## Residual Deviance: 13430     AIC: 13440

Making Predictions in R

Use the predict() function to predict the probability of default for one or many test set cases.
Always make sure that the test cases are stored in a data frame.

In the predict() function, by changing the type argument to “response”, you get the actual predictive probability default for this specific customer.

test_case <- as.data.frame(test_set[1,])
test_case
##   loan_status loan_amnt int_rate grade emp_length home_ownership
## 2           0      2400       NA     C         25           RENT
##   annual_inc age  ir_cat emp_cat
## 2      12252  31 Missing   15-30
predict(log_model_small, newdata = test_case)
##         2 
## -2.010561

Exercise 1) Predicting the probability of default

The video reviewed the predicted probability of just one test case, but by using the predict() function, you can do all test cases at once!

It’s useful to get an initial idea of how good the mode is at discriminating by looking at the range of predicted probabilities. A small range means that the test-case predictions do NOT lie far apart, so the model will no be good at discriminating food from bad customers.

Instructions: - The code for the prediction of test_case in the video is copied in your work space. Change the code such that the function predict() is applied to all cases in test_set. You can store them in the object predictions_all_small.
- Get an initial idea of how well the model can discriminate using range().

# Build the logistic regression model
predictions_all_small <- predict(log_model_small, newdata = test_set, type = "response")

# Look at the range of the object "predictions_all_small"
range(predictions_all_small)
## [1] 0.07793829 0.16090135

Exercise 2) Making more discriminative models

In the exercise above, the range for the predicted probabilities of default was small. We can expand the range by adding more predictors and building a bigger model.
The model’s quality will depend on the newly added predictor’s predictive power.

Instructions:
- Make log_model_full like the way you made log_model_small, but this time, include all available predictors in the data set. If you don’t want to type the name of every column separately, you can simply select all variables using loan_status ~ . - Create your prediction vector predictions_all_full for all the cases in the test set using predict(). Notice that these values represent the probability of defaulting. - Look at the range of the predictions.

# Build the logistic regression model
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)

# Make PD-predictions for all test set elements using the the full logistic regression model
predictions_all_full <- predict(log_model_full, newdata = test_set, type = "response")

# Look at the predictions range
range(predictions_all_full)

Evaluating the logistic regression model result

Recap: model evaluation

Using the predictive probabilities, you should make a confusion matrix. The matrix should compare the test_set loan status to the predicted loan status.

We need to make a “cutoff” or “threshold value” between 0 and 1.
A logical cutoff could be “0.5”.

Exercise 1) Specifying a cut-off

Transform a predicted vector into a binary vector that indicates the status of a loan.
Use the ifelse() function to convert the predicted values to binary.

Instruction:
- The code for the full logistic regression model along with the predictions-vector is given in your console.
- Using a cutoff of 0.15, create vector pred_cutoff_15 using the the ifelse() function and predictions_all_full.
- Look at the confusion matrix using table() (enter the true values, so test_set$loan_status, in the first argument).

# The code for the logistic regression model and the predictions is given below
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
predictions_all_full <- predict(log_model_full, newdata = test_set, type = "response")

# Make a binary predictions-vector using a cut-off of 15%
pred_cutoff_15 <- ifelse(predictions_all_full > 0.15, 1, 0)

# Construct a confusion matrix
conf_matrix <- table(test_set$loan_status, pred_cutoff_15)

Wrap-up and remarks

What is the best (optimal) cut-off value?

To determine the best cut-off value, we need to look into the confusion matrix measures:

  1. Accuracy = diagonal, percentage of correctly classified instances.
  2. Sensitivity = bottom horizontal, percentage of bad customers that are classified correctly.
  3. Specificity = top horizontal, percentage of good customers that are classified correctly.

Accuracy

In the chart below, there is a steep increase in the Accuracy for a cut-off until 25% (x-axis). After that, there is a slight increase until 51% (x-axis). And for all cut-offs greater than 51%, the accuracy doesn’t change anymore.

The increasing nature of Accuracy as the cut-off increases is a typical result for credit risk modeling. Or in general, any logistic model with unbalanced groups. If we looked at the Accuracy alone, we would assume that using 51% cut-off is our best option. However, the Accuracy after the 51% represents values that are all Non-Default, so we would be using a very bias data set.

Sensitivity & Specificity

Unlike Accuracy, Sensitivity has a strictly decreasing nature and Specificity has a strictly increasing nature.

If we take a cut-off at 0:

  • Sensitivity: 1037 / (1037 + 0) = 100%
  • Specificity: 0 / (0 + 864) = 0%

On the opposite end of extreme, cut-off of 1:
- Sensitivity: 0 / (0 + 1037) = 0% - Specificity: 8640 / (8640 + 0) = 100%

The trade-off between Sensitivity & Specificity ALWAYS exists.

More Logistic Regression Models

Logit

The logistic regression model is the same as logit logistic regression model.

log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#this is the SAME as:
#Use "link = logit" argument.  
log_model_full <- glm(loan_status ~ ., family = binomial(link = logit), data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##str(loan_data)

##loan_data$loan_status <- as.factor(loan_data$loan_status)

It computes the probability of default.

Probit & Cloglog

log_model_full <- glm(loan_status ~ ., family = binomial(link = probit), data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
log_model_full <- glm(loan_status ~ ., family = binomial(link = cloglog), data = training_set)

Note: The negative parameter estimates in these models predict a DECREASE in default probability.
A positive parameter estimates a INCREASE in default probability.
However, the functions describing the relationship between the parameter estimates and the actual probability of default changes, so it is more complex to predict the default (can’t use equation like logit).

Decision Tree

Another way of classifying groups (default or no-default) is with decision trees.
You answer the questions from the top, Root, through the bottom, Leaf node, a decision is made to classify case as default or non-default.

How are “splitting decisions” made?

For each node of the tree, the splitting decision will affect the final structure of the tree.

Example questions:

  • Categorical variable, Should we make a leaf node “Home ownership = RENT” OR “Home ownership = RENT or OTHER”?
  • Continuous variable, should we make a leaf node “age > 40” or “age > 45”?

We can answer these questions by measuring the Impurity.

Our goal is to minimize the impurity in each node of the tree.
A popular & default measure is the gini measure.

Gini measure example

Assume you have a training set of data with 500 cases (250 default & 250 non-default).
The actual number of non-default is on the left(green) and the defaults on the right(red).

The perfect scenario would have the leaf node split into 50/50 like in the picture below, but since this is never the case, we can compute the impurity using the gini measure.

The impurity in a certain node is given by 2 time the proportion of default in the node times the proportion of non-default in the node.

  • Gini = 2 * prop(default) * prop(non-default)

First, calculate the gini measure to the root node:

  • Gini_Root = 2 * 250/500 * 250/500 = 0.5
  • This is the maximum possible amount with exactly as many defaults as non-defaults.

Next, calculate the gini measure for the lead nodes.

  • Gini_N1 = 2 * 170/270 * 100/270 = 0.4664
  • Gini_N2 = 2 * 80/230 * 150/230 = 0.4536

Finally, calculate the PURITY GAIN that is achieved going from the root node to the 2 node: N1 & N2.

  • Calculate the Purity Gain by subtracting the weighted Gini measures in the Nodes N1 and N2 from the gini measure in the Root.

Formula:
Purity Gain = Gini_Root - (prop(cases in N1) * Gini_N1) - (prop(cases in N2) * Gini_N2)

From our example: Purity Gain = 0.5 - (270/500 * 0.4664) - (230/500 * 0.4536) = 0.039488

The algorithm will compute the possibilities and give us the leaf node that has the highest purity gain.

Exercise 1)


Instructions:

  • The computation for the Gini of the root node is given.
  • Compute the Gini measure for the left leaf node.
  • Compute the Gini measure for the right leaf node.
  • Compute the gain by taking the difference between the root node Gini and the weighted leaf node Gini measures.
  • Information regarding the split in this tree can be found using $split and the tree object, small_tree. Instead of gain, you should look at the improve column here. improve is an alternative metric for gain, simply obtained by multiplying gain by the number of cases in the data set. Make sure that the object improve (code given) has the same value as in small_tree$split.
# The Gini-measure of the root node is given below
gini_root <- 2 * 89 / 500 * 411 / 500

# Compute the Gini measure for the left leaf node
gini_ll <- 2 * 401/446 * 45/446

# Compute the Gini measure for the right leaf node
gini_rl <- 2 * 10/54 * 44/54

# Compute the gain
gain <- gini_root - 446 / 500 * gini_ll - 54 / 500 * gini_rl

# compare the gain-column in small_tree$splits with our computed gain, multiplied by 500, and assure they are the same
# small_tree is only available in counsel.
small_tree$splits
improve <- gain * 500

Building decision trees using rpart package

Imagine if we had to calculate all splits, that would be quite a challenge.
This is where the rpart package comes in.

Warning:

  • It’s hard building nice decision trees for credit risk data.
  • Main reason: unbalanced data
    • creditors data is very unbalanced due to a low percentage of defaults.

Using the default setting on the training_set of loan_data, including all variables " ~ ." and specifying the method to “class” as a response variable (loan_status is categorical), we get a warning message that only the root is created.

library(rpart)

fit_default <- rpart(loan_status ~ ., method = "class", data = training_set)

#observe the warning message!
plot(fit_default)

The reason for the warning is that, similarly to what we had observed when setting a cut-off to a logistic regression model, the highest accuracy is achieved by simply predicting all cases to be “non-defaults”.

Three techniques to overcome unbalance

There are 3 main things we can do to overcome the unbalanced data (default values are very low in data set).

1. Over-sampling OR Under-sampling
+ You can over-sample the under-represented group (defaults) OR under-sample the over-represented group.
+ Balancing the training-set will improve the accuracy of the group.
+ Over OR Under sampling should ONLY be applied the training_set and NOT the test_set.

2. Changing the prior probabilities in the rpart function
+ By default, the prior probabilities of default_vs_non-default are set equal to the proportions in the training_set. By making the prior probabilities for “default” bigger, you can trick R into attaching more importance to “defaults” leading to a better decision tree.

3. Include a Loss Matrix using rpart function
+ In this Loss Matrix, different costs can be associated with a misclassification of a “default” as a “non-default” versus the misclassification of a “non-default” as a “default”.
+ By increasing the misclassification costs of the former, more attention is given to the correct classification of “default” improving the quality of the decision tree.

The techniques should ALWAYS be validated so we can decide on the best model.