Creditors modeling is all about Loan Default. A loan default is when a borrow cannot make good on their payments.
Components of expected loss (EL)
EL (Expected Loss) = PD x EAD x LGD
The main focus will be Probability of default.
Banks keep past customer default records, which can be used to predict future default customers.
Each line in the data represents a customer.
The loan_status field shows if 1 = customer defaulted on loan OR 0 = customer did not.
The grade field is a bureau score of the customer (only behavioral data in data set) Other fields used is loan amount, interest rate, employment length, home ownership, annual income, and age.
#RDS file for loan_data
loan_data <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/162/datasets/8f48a2cbb6150e7ae32435e55f271cad5b4b8ecf/loan_data_ch1.rds")))
The CrossTable function in the g-model shows the Cases and Proportions of the fields.
Instructions:
loan_data <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/162/datasets/8f48a2cbb6150e7ae32435e55f271cad5b4b8ecf/loan_data_ch1.rds")))
# View the structure of loan_data
str(loan_data)
## 'data.frame': 29092 obs. of 8 variables:
## $ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
## $ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
## $ int_rate : num 10.6 NA 13.5 NA NA ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
## $ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
## $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
## $ annual_inc : num 24000 12252 49200 36000 48000 ...
## $ age : int 33 31 24 39 24 28 22 22 28 22 ...
# Load the gmodels package
library(gmodels)
# Call CrossTable() on loan_status
CrossTable(loan_data$loan_status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 29092
##
##
## | 0 | 1 |
## |-----------|-----------|
## | 25865 | 3227 |
## | 0.889 | 0.111 |
## |-----------|-----------|
##
##
##
##
# Call CrossTable() on grade and loan_status
CrossTable(loan_data$grade, loan_data$loan_status, prop.r = TRUE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 29092
##
##
## | loan_data$loan_status
## loan_data$grade | 0 | 1 | Row Total |
## ----------------|-----------|-----------|-----------|
## A | 9084 | 565 | 9649 |
## | 0.941 | 0.059 | 0.332 |
## ----------------|-----------|-----------|-----------|
## B | 8344 | 985 | 9329 |
## | 0.894 | 0.106 | 0.321 |
## ----------------|-----------|-----------|-----------|
## C | 4904 | 844 | 5748 |
## | 0.853 | 0.147 | 0.198 |
## ----------------|-----------|-----------|-----------|
## D | 2651 | 580 | 3231 |
## | 0.820 | 0.180 | 0.111 |
## ----------------|-----------|-----------|-----------|
## E | 692 | 176 | 868 |
## | 0.797 | 0.203 | 0.030 |
## ----------------|-----------|-----------|-----------|
## F | 155 | 56 | 211 |
## | 0.735 | 0.265 | 0.007 |
## ----------------|-----------|-----------|-----------|
## G | 35 | 21 | 56 |
## | 0.625 | 0.375 | 0.002 |
## ----------------|-----------|-----------|-----------|
## Column Total | 25865 | 3227 | 29092 |
## ----------------|-----------|-----------|-----------|
##
##
How would you interpret the results in the table you constructed at the end of the previous exercise?
Answer: The proportion of defaults increases when the credit rating moves from A to G.
library(gmodels)
# Use "main", "xlab", and "ylab" to rename labels.
hist(loan_data$int_rate, main = "Histogram of interest Rate", xlab = "Interest rate", ylab = "Count")
# Observation of Histogram:
#There are very few loans that have a interest rate higher than 20%.
In the example below, there is only one bar in the histogram. We can change this by defining the number of breaks to use in the graph.
hist_income <- hist(loan_data$annual_inc, xlab = "Annual Income", main = "Histogram of Annual Income")
#By using "$breaks", we can get the location info of the histogram breaks.
hist_income$breaks
## [1] 0 500000 1000000 1500000 2000000 2500000 3000000 3500000
## [9] 4000000 4500000 5000000 5500000 6000000
By applying the rule of thumb break of sqrt(nrow(data_set_name)), we can see that there a few more bars then below. This is getting better but needs more improvement.
n_breaks <- sqrt(nrow(loan_data)) #170.5638
hist_income_n <- hist(loan_data$annual_inc, breaks = n_breaks, xlab = "Annual Income", main = "Histogram of Annual Income")
Look at a scatter pot to see if there are any outliers in the data set.
We can see in the plot that there is one observation that has over $6million in Annual Income.
plot(loan_data$annual_inc, ylab = "Annual Income")
When is a value an outlier?
Data scientist will use:
Expert Judgement
Rule of thumb: Q1 = 1.5 * IQR, Q3 = 1.5 * IQR
mostly: combinations of both
Delete data that has Annual Income (annual_inc) that is greater than $3 Million.
# First group the data you want to delete
index_outlier_expert <- which(loan_data$annual_inc > 3000000)
# Second, delete the grouped data from data set.
loan_data_expert <- loan_data[-index_outlier_expert, ]
# New dataset = loan_data_expert
Delete data using ‘rule of thumb’, outlier if bigger (or smaller) than Q3 + 1.5 IQR (Inter Quartile Range)
outlier_cutoff <- quantile(loan_data$annual_inc, 0.75) + 1.5 * IQR(loan_data$annual_inc)
index_outlier_ROT <- which(loan_data$annual_inc > outlier_cutoff)
loan_data_ROT <- loan_data[-index_outlier_ROT, ]
# New dataset = loan_data_ROT
When we rerun the histograms using the data without outliers, there are more breaks and much easier to analyze.
# histrogram with new dataset from above: loan_data_expert
hist(loan_data_expert$annual_inc, sqrt(nrow(loan_data_expert)), main = "expert judgement", xlab = "Annual income expert judgement")
# histogram with new data from above: loan_data_ROT
hist(loan_data_ROT$annual_inc, sqrt(nrow(loan_data_ROT)), xlab = "Annual income rule of thumb")
Deleting the outliers are can be done just for visualization if you plan to keep it in the analysis.
Bivariate plots can point out the bivariate outliers (outliers in 2 dimensions of the data).
plot(loan_data$emp_length, loan_data$annual_inc, xlab = "Employment length", ylab = "Annual Income")
Instructions:
# Create histogram of loan_amnt: hist_1
hist_1 <- hist(loan_data$loan_amnt)
# Print locations of the breaks in hist_1
hist_1$breaks
## [1] 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
## [12] 22000 24000 26000 28000 30000 32000 34000 36000
# Change number of breaks and add labels: hist_2
hist_2 <- hist(loan_data$loan_amnt, breaks = 200, xlab = "Loan Amount",
main = "Histogram of the loan amount")
Instructions:
# Plot the age variable
plot(loan_data$age, ylab = "Age")
# Save the outlier's index to index_highage
index_highage <- which(loan_data$age > 122)
# Create data set new_data with outlier deleted
new_data <- loan_data[-index_highage, ]
# Make bivariate scatterplot of age and annual income
plot(loan_data$age, loan_data$annual_inc, xlab = "Age", ylab = "Annual Income")
At this point, we have deleted outliers from Annual Income and Age from the data set.
There are “NA”s in the data set for 2 fields: “int_rate” and “emp_length”
First, we want to know how many inputs are missing.
Use the summary() function.
summary(loan_data$emp_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 4.000 6.145 8.000 62.000 809
There are 3 main ways to handle missing input:
Using the code below, this will delete the entire row with “NA” value in “emp_length” field.
Newly added code: which(is.na())
# Group the data in emp_length field that has NA as value.
index_NA <- which(is.na(loan_data$emp_length))
# Delete the group from data set.
loan_data_no_NA <- loan_data[-c(index_NA), ]
The code below will delete the entire column.
# Create a copy of the data set before you delete it. Name the saved data set something else.
loan_data_delete_employ <- loan_data
# Make all values in emp_length = NULL.
loan_data_delete_employ$emp_length <- NULL
Common practice of replacing variables is replacing it with the median value. This is called Median Imputation.
# Group the data in emp_length field that has NA as value.
index_NA <- which(is.na(loan_data$emp_length))
# Save a copy of the data set.
loan_data_replace <- loan_data
#Replace the group of the NA values in emp_length with median value.
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
In some situation, we want to keep NA because it is important for the analysis.
Problem arises when we use a model that will delete the NA’s automatically.
To keep the models from deleting NA’s automatically, we can use Coarse Classification (putting variables in bins). Coarse Classification puts continuous variables in “bins”.
Example: Lets start with a new variable emp_cat that will be replacing emp_length.
If we plot the new field, we see that group 0 - 15 carries a very high proportion of the data set.
You should re-do the process again (trial & error) and group even more granularity in the 0 - 15 group.
Instructions:
# Look at summary of loan_data
summary(loan_data$int_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.42 7.90 10.99 11.00 13.47 23.22 2776
# Get indices of missing interest rates: na_index
na_index <- which(is.na(loan_data$int_rate))
# Remove observations with missing interest rates: loan_data_delrow_na
loan_data_delrow_na <- loan_data[-na_index, ]
# Make copy of loan_data
loan_data_delcol_na <- loan_data
# Delete interest rate column from loan_data_delcol_na
loan_data_delcol_na$int_rate <- NULL
Instructions:
# Compute the median of int_rate
median_ir <- median(loan_data$int_rate, na.rm = TRUE)
# Make copy of loan_data
loan_data_replace <- loan_data
# Replace missing interest rates with median
loan_data_replace$int_rate[na_index] <- median_ir
# Check if the NAs are gone
summary(loan_data_replace$int_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.42 8.49 10.99 11.00 13.11 23.22
Instructions:
Use Coarse Classification to keep your NA data by binning the NA’s to their own bin.
#Question
# Make the necessary replacements in the coarse classification example below
loan_data$emp_cat <- rep(NA, length(loan_data$emp_length))
loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"
loan_data$emp_cat <- as.factor(loan_data$emp_cat)
# Look at your new variable using plot()
#Answer
# Make the necessary replacements in the coarse classification example below
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))
loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"
loan_data$ir_cat <- as.factor(loan_data$ir_cat)
# Look at your new variable using plot()
plot(loan_data$ir_cat)
After your data is fully preprocessed, we can start our analysis.
You can use the same data set to run the model on and to evaluate the results, but this usually will lead to a result that is too optimistic.
We need to split the data into 2 pieces, so we can run the model on the Training Set and evaluate the model on the Test Set.
Common way of splitting the data is using 2/3 to Train and 1/3 to Test. - There could be a lot of variations and performance estimates depending which 2/3 of the data we select for the training set.
One way to reduce the variations is by using Cross Validation.
For a 2/3 training set and 1/3 test set example, a cross-validation variant looks like this:
The data will be split in 3 equal parts (2 parts is training set and 1 part is test set). You could technically have many parts are you want, but that would mean running the model on all parts, and it would make it computationally heavy. In this course, we’ll just have 1 training set and 1 test set.
Imagine you just ran a model, and now we want apply the model to our Test-set to see how good the results are.
Evaluating the model for credit risk means comparing the observed outcomes of Default vs Non-Default stored in loan_status field in test_set (test_set$loan_status) with the predicted outcome (model_prediction).
Here, we’re just using 14 values to demonstrate the concept, but usually we’d use the Confusion Matrix to evaluate a larger set of values.
A confusion matrix is a contingency table of correct & incorrect classifications. The CORRECT classifications are diagonal from top left to right bottom. There are 8 correctly classified & predicted Non_Default customers and 3 correctly classified & predicted default customers.
The INCORRECT classifications are diagonal form top right to bottom left.
There were 2 Non_defaulters incorrectly classified as Defaulters, and 1 Defaulter incorrectly classified as Non_defaulter.
There are important measures in the confusion matrix.
1. Accuracy = diagonal, percentage of correctly classified instances.
+ (8 + 3)/14 = 78.57%
2. Sensitivity = bottom horizontal, percentage of bad customers that are classified correctly.
+ 3/(1 + 3) = 75%
3. Specificity = top horizontal, percentage of good customers that are classified correctly.
+ 8/(8 + 2) = 80%
Instructions:
# Set seed of 567
set.seed(567)
# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2/3*nrow(loan_data))
# Create training set: training_set
training_set <- loan_data[index_train, ]
# Create test set: test_set
test_set <- loan_data[-index_train, ]
Not available here, but in the counsel, there is a predicted outcome stored in a vector called “model_pred”.
We want to use the Confusion Matrix to see how well the model performed.
Compare the actual loan status (loan_status) to the predicted values (model_pred), using the table() function.
This was the worst example.
# Create confusion matrix
conf_matrix <- table(test_set$loan_status, model_pred)
# Compute classification accuracy
(6092 + 349) / nrow(test_set)
# Compute sensitivity
349 / 1037