What is logistic regression? A regression model with output between 0 and 1.
log_model <- glm(loan_status ~ age , family = 'binomial', data = training_set)
log_model
This will show the Coefficients.
loan_data <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/162/datasets/8f48a2cbb6150e7ae32435e55f271cad5b4b8ecf/loan_data_ch1.rds")))
##################################
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))
loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"
loan_data$ir_cat <- as.factor(loan_data$ir_cat)
#Question
# Make the necessary replacements in the coarse classification example below
loan_data$emp_cat <- rep(NA, length(loan_data$emp_length))
loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"
loan_data$emp_cat <- as.factor(loan_data$emp_cat)
# Look at your new variable using plot()
str(loan_data)
## 'data.frame': 29092 obs. of 10 variables:
## $ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
## $ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
## $ int_rate : num 10.6 NA 13.5 NA NA ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
## $ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
## $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
## $ annual_inc : num 24000 12252 49200 36000 48000 ...
## $ age : int 33 31 24 39 24 28 22 22 28 22 ...
## $ ir_cat : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
## $ emp_cat : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...
##################################
# Set seed of 567
set.seed(567)
# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2/3*nrow(loan_data))
# Create training set: training_set
training_set <- loan_data[index_train, ]
# Create test set: test_set
test_set <- loan_data[-index_train, ]
str(loan_data)
## 'data.frame': 29092 obs. of 10 variables:
## $ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
## $ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
## $ int_rate : num 10.6 NA 13.5 NA NA ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
## $ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
## $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
## $ annual_inc : num 24000 12252 49200 36000 48000 ...
## $ age : int 33 31 24 39 24 28 22 22 28 22 ...
## $ ir_cat : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 2 2 4 4 3 ...
## $ emp_cat : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 1 1 ...
Instructions:
# Build a glm model with variable ir_cat as a predictor
log_model_cat <- glm(formula = loan_status ~ ir_cat, family = "binomial", data = training_set)
# Print the parameter estimates
log_model_cat
##
## Call: glm(formula = loan_status ~ ir_cat, family = "binomial", data = training_set)
##
## Coefficients:
## (Intercept) ir_cat11-13.5 ir_cat13.5+ ir_cat8-11 ir_catMissing
## -2.9091 1.0071 1.3874 0.5933 0.7937
##
## Degrees of Freedom: 19393 Total (i.e. Null); 19389 Residual
## Null Deviance: 13460
## Residual Deviance: 13050 AIC: 13060
# Look at the different categories in ir_cat using table()
table(loan_data$ir_cat)
##
## 0-8 11-13.5 13.5+ 8-11 Missing
## 7130 6954 6002 6230 2776
Instruction:
# Build the logistic regression model
log_model_multi <- glm(loan_status ~ age + ir_cat + grade + loan_amnt +
annual_inc , family = "binomial", data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Obtain significance levels using summary()
summary(log_model_multi)
##
## Call:
## glm(formula = loan_status ~ age + ir_cat + grade + loan_amnt +
## annual_inc, family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1959 -0.5329 -0.4383 -0.3322 3.2329
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.428e+00 1.284e-01 -18.920 < 2e-16 ***
## age -3.980e-03 3.863e-03 -1.030 0.3029
## ir_cat11-13.5 7.018e-01 1.313e-01 5.344 9.08e-08 ***
## ir_cat13.5+ 6.556e-01 1.462e-01 4.485 7.31e-06 ***
## ir_cat8-11 4.867e-01 1.160e-01 4.196 2.71e-05 ***
## ir_catMissing 5.067e-01 1.275e-01 3.973 7.09e-05 ***
## gradeB 1.427e-01 1.041e-01 1.370 0.1707
## gradeC 4.806e-01 1.200e-01 4.006 6.18e-05 ***
## gradeD 8.033e-01 1.365e-01 5.887 3.93e-09 ***
## gradeE 9.689e-01 1.643e-01 5.896 3.73e-09 ***
## gradeF 1.512e+00 2.296e-01 6.584 4.58e-11 ***
## gradeG 2.072e+00 3.443e-01 6.019 1.76e-09 ***
## loan_amnt -8.438e-06 4.208e-06 -2.005 0.0449 *
## annual_inc -4.788e-06 7.365e-07 -6.502 7.95e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13460 on 19393 degrees of freedom
## Residual deviance: 12892 on 19380 degrees of freedom
## AIC: 12920
##
## Number of Fisher Scoring iterations: 5
In order to evaluate the model, we can look at the p-value, denoted as Pr(>|z|).
In glm(), a mild significance is denoted by a “.” and a very strong significance denoted by "***".
We want to use fields that have high p-values.
Example with “age” and “home ownership”.
log_model_small <- glm(loan_status ~ age + home_ownership, family = "binomial", data = training_set)
log_model_small
##
## Call: glm(formula = loan_status ~ age + home_ownership, family = "binomial",
## data = training_set)
##
## Coefficients:
## (Intercept) age home_ownershipOTHER
## -2.047183 -0.006617 0.534611
## home_ownershipOWN home_ownershipRENT
## 0.162187 0.241760
##
## Degrees of Freedom: 19393 Total (i.e. Null); 19389 Residual
## Null Deviance: 13460
## Residual Deviance: 13430 AIC: 13440
Use the predict() function to predict the probability of default for one or many test set cases.
Always make sure that the test cases are stored in a data frame.
In the predict() function, by changing the type argument to “response”, you get the actual predictive probability default for this specific customer.
test_case <- as.data.frame(test_set[1,])
test_case
## loan_status loan_amnt int_rate grade emp_length home_ownership
## 2 0 2400 NA C 25 RENT
## annual_inc age ir_cat emp_cat
## 2 12252 31 Missing 15-30
predict(log_model_small, newdata = test_case)
## 2
## -2.010561
The video reviewed the predicted probability of just one test case, but by using the predict() function, you can do all test cases at once!
It’s useful to get an initial idea of how good the mode is at discriminating by looking at the range of predicted probabilities. A small range means that the test-case predictions do NOT lie far apart, so the model will no be good at discriminating food from bad customers.
Instructions: - The code for the prediction of test_case in the video is copied in your work space. Change the code such that the function predict() is applied to all cases in test_set. You can store them in the object predictions_all_small.
- Get an initial idea of how well the model can discriminate using range().
# Build the logistic regression model
predictions_all_small <- predict(log_model_small, newdata = test_set, type = "response")
# Look at the range of the object "predictions_all_small"
range(predictions_all_small)
## [1] 0.07793829 0.16090135
In the exercise above, the range for the predicted probabilities of default was small. We can expand the range by adding more predictors and building a bigger model.
The model’s quality will depend on the newly added predictor’s predictive power.
Instructions:
- Make log_model_full like the way you made log_model_small, but this time, include all available predictors in the data set. If you don’t want to type the name of every column separately, you can simply select all variables using loan_status ~ . - Create your prediction vector predictions_all_full for all the cases in the test set using predict(). Notice that these values represent the probability of defaulting. - Look at the range of the predictions.
# Build the logistic regression model
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
# Make PD-predictions for all test set elements using the the full logistic regression model
predictions_all_full <- predict(log_model_full, newdata = test_set, type = "response")
# Look at the predictions range
range(predictions_all_full)
Using the predictive probabilities, you should make a confusion matrix. The matrix should compare the test_set loan status to the predicted loan status.
We need to make a “cutoff” or “threshold value” between 0 and 1.
A logical cutoff could be “0.5”.
Transform a predicted vector into a binary vector that indicates the status of a loan.
Use the ifelse() function to convert the predicted values to binary.
Instruction:
- The code for the full logistic regression model along with the predictions-vector is given in your console.
- Using a cutoff of 0.15, create vector pred_cutoff_15 using the the ifelse() function and predictions_all_full.
- Look at the confusion matrix using table() (enter the true values, so test_set$loan_status, in the first argument).
# The code for the logistic regression model and the predictions is given below
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
predictions_all_full <- predict(log_model_full, newdata = test_set, type = "response")
# Make a binary predictions-vector using a cut-off of 15%
pred_cutoff_15 <- ifelse(predictions_all_full > 0.15, 1, 0)
# Construct a confusion matrix
conf_matrix <- table(test_set$loan_status, pred_cutoff_15)
To determine the best cut-off value, we need to look into the confusion matrix measures:
In the chart below, there is a steep increase in the Accuracy for a cut-off until 25% (x-axis). After that, there is a slight increase until 51% (x-axis). And for all cut-offs greater than 51%, the accuracy doesn’t change anymore.
The increasing nature of Accuracy as the cut-off increases is a typical result for credit risk modeling. Or in general, any logistic model with unbalanced groups. If we looked at the Accuracy alone, we would assume that using 51% cut-off is our best option. However, the Accuracy after the 51% represents values that are all Non-Default, so we would be using a very bias data set.
Unlike Accuracy, Sensitivity has a strictly decreasing nature and Specificity has a strictly increasing nature.
If we take a cut-off at 0:
On the opposite end of extreme, cut-off of 1:
- Sensitivity: 0 / (0 + 1037) = 0% - Specificity: 8640 / (8640 + 0) = 100%
The trade-off between Sensitivity & Specificity ALWAYS exists.
The logistic regression model is the same as logit logistic regression model.
log_model_full <- glm(loan_status ~ ., family = "binomial", data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#this is the SAME as:
#Use "link = logit" argument.
log_model_full <- glm(loan_status ~ ., family = binomial(link = logit), data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##str(loan_data)
##loan_data$loan_status <- as.factor(loan_data$loan_status)
It computes the probability of default.
log_model_full <- glm(loan_status ~ ., family = binomial(link = probit), data = training_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
log_model_full <- glm(loan_status ~ ., family = binomial(link = cloglog), data = training_set)
Note: The negative parameter estimates in these models predict a DECREASE in default probability.
A positive parameter estimates a INCREASE in default probability.
However, the functions describing the relationship between the parameter estimates and the actual probability of default changes, so it is more complex to predict the default (can’t use equation like logit).
Instructions:
# Fit the logit, probit and cloglog-link logistic regression models
log_model_logit <- glm(loan_status ~ age + ir_cat + loan_amnt,
family = binomial(link = logit), data = training_set)
log_model_probit <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
family = binomial(link = probit), data = training_set)
log_model_cloglog <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
family = binomial(link = cloglog), data = training_set)
# Make predictions for all models using the test set
predictions_logit <- predict(log_model_logit, newdata = test_set, type = "response")
predictions_probit <- predict(log_model_probit, newdata = test_set, type = "response")
predictions_cloglog <- predict(log_model_cloglog, newdata = test_set, type = "response")
# Use a cut-off of 14% to make binary predictions-vectors
cutoff <- 0.14
class_pred_logit <- ifelse(predictions_logit > cutoff, 1, 0)
class_pred_probit <- ifelse(predictions_probit > cutoff, 1, 0)
class_pred_cloglog <- ifelse(predictions_cloglog > cutoff, 1, 0)
# Make a confusion matrix for the three models
tab_class_logit <- table(true_val,class_pred_logit)
tab_class_probit <- table(true_val,class_pred_probit)
tab_class_cloglog <- table(true_val,class_pred_cloglog)
# Compute the classification accuracy for all three models
acc_logit <- sum(diag(tab_class_logit)) / nrow(test_set)
acc_probit <- sum(diag(tab_class_probit)) / nrow(test_set)
acc_cloglog <- sum(diag(tab_class_cloglog)) / nrow(test_set)
Another way of classifying groups (default or no-default) is with decision trees.
You answer the questions from the top, Root, through the bottom, Leaf node, a decision is made to classify case as default or non-default.
For each node of the tree, the splitting decision will affect the final structure of the tree.
Example questions:
We can answer these questions by measuring the Impurity.
Our goal is to minimize the impurity in each node of the tree.
A popular & default measure is the gini measure.
Assume you have a training set of data with 500 cases (250 default & 250 non-default).
The actual number of non-default is on the left(green) and the defaults on the right(red).
The perfect scenario would have the leaf node split into 50/50 like in the picture below, but since this is never the case, we can compute the impurity using the gini measure.
The impurity in a certain node is given by 2 time the proportion of default in the node times the proportion of non-default in the node.
First, calculate the gini measure to the root node:
Next, calculate the gini measure for the lead nodes.
Finally, calculate the PURITY GAIN that is achieved going from the root node to the 2 node: N1 & N2.
Formula:
Purity Gain = Gini_Root - (prop(cases in N1) * Gini_N1) - (prop(cases in N2) * Gini_N2)
From our example: Purity Gain = 0.5 - (270/500 * 0.4664) - (230/500 * 0.4536) = 0.039488
The algorithm will compute the possibilities and give us the leaf node that has the highest purity gain.
Instructions:
# The Gini-measure of the root node is given below
gini_root <- 2 * 89 / 500 * 411 / 500
# Compute the Gini measure for the left leaf node
gini_ll <- 2 * 401/446 * 45/446
# Compute the Gini measure for the right leaf node
gini_rl <- 2 * 10/54 * 44/54
# Compute the gain
gain <- gini_root - 446 / 500 * gini_ll - 54 / 500 * gini_rl
# compare the gain-column in small_tree$splits with our computed gain, multiplied by 500, and assure they are the same
# small_tree is only available in counsel.
small_tree$splits
improve <- gain * 500
Imagine if we had to calculate all splits, that would be quite a challenge.
This is where the rpart package comes in.
Warning:
Using the default setting on the training_set of loan_data, including all variables " ~ ." and specifying the method to “class” as a response variable (loan_status is categorical), we get a warning message that only the root is created.
library(rpart)
fit_default <- rpart(loan_status ~ ., method = "class", data = training_set)
#observe the warning message!
plot(fit_default)
The reason for the warning is that, similarly to what we had observed when setting a cut-off to a logistic regression model, the highest accuracy is achieved by simply predicting all cases to be “non-defaults”.
There are 3 main things we can do to overcome the unbalanced data (default values are very low in data set).
1. Over-sampling OR Under-sampling
+ You can over-sample the under-represented group (defaults) OR under-sample the over-represented group.
+ Balancing the training-set will improve the accuracy of the group.
+ Over OR Under sampling should ONLY be applied the training_set and NOT the test_set.
2. Changing the prior probabilities in the rpart function
+ By default, the prior probabilities of default_vs_non-default are set equal to the proportions in the training_set. By making the prior probabilities for “default” bigger, you can trick R into attaching more importance to “defaults” leading to a better decision tree.
3. Include a Loss Matrix using rpart function
+ In this Loss Matrix, different costs can be associated with a misclassification of a “default” as a “non-default” versus the misclassification of a “non-default” as a “default”.
+ By increasing the misclassification costs of the former, more attention is given to the correct classification of “default” improving the quality of the decision tree.
The techniques should ALWAYS be validated so we can decide on the best model.