Introduction

Creditors modeling is all about Loan Default. A loan default is when a borrow cannot make good on their payments.

Components of expected loss (EL)

Probability of default (PD)
- The probability of the borrow will fail to make repayment of the loan.
Exposure at default (EAD)
- Expected loan value at the time of default (the amount of loan that needs to be repaid).
Loss given default (LGD)
- The amount of the loss, if there is a default.

EL (Expected Loss) = PD x EAD x LGD

The main focus will be Probability of default.

Information used by banks

Banks keep past customer default records, which can be used to predict future default customers.

Application Information:
- Income
- Marital Status
  so forth
Behavioral Information:
- Current Account Balance
- Payment history
- Account history so forth

The Data

Each line in the data represents a customer.
The loan_status field shows if 1 = customer defaulted on loan OR 0 = customer did not.
The grade field is a bureau score of the customer (only behavioral data in data set) Other fields used is loan amount, interest rate, employment length, home ownership, annual income, and age.

#RDS file for loan_data

loan_data <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/162/datasets/8f48a2cbb6150e7ae32435e55f271cad5b4b8ecf/loan_data_ch1.rds")))

CrossTable

The CrossTable function in the g-model shows the Cases and Proportions of the fields.

Exercise 1)

Instructions:

Get familiar with the data set by looking at its structure with str().
Load the gmodels package using library(). It is already installed on DataCamp’s servers.
Have a look at the CrossTable() of loan status, using just one argument: loan_data$loan_status.
Call CrossTable() with x argument loan_data$grade and y argument loan_data$loan_status. We only want row-wise proportions, so set prop.r to TRUE, but prop.c , prop.t and prop.chisq to FALSE (default values here are TRUE, and this would lead to inclusion of column proportions, table proportions and chi-square contributions for each cell. We do not need these here.)

loan_data <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/162/datasets/8f48a2cbb6150e7ae32435e55f271cad5b4b8ecf/loan_data_ch1.rds")))

# View the structure of loan_data
str(loan_data)

## 'data.frame':    29092 obs. of  8 variables:
##  $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ int_rate      : num  10.6 NA 13.5 NA NA ...
##  $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ emp_length    : int  10 25 13 3 9 11 0 3 3 0 ...
##  $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ annual_inc    : num  24000 12252 49200 36000 48000 ...
##  $ age           : int  33 31 24 39 24 28 22 22 28 22 ...

# Load the gmodels package 
library(gmodels)

# Call CrossTable() on loan_status
CrossTable(loan_data$loan_status)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  29092 
## 
##  
##           |         0 |         1 | 
##           |-----------|-----------|
##           |     25865 |      3227 | 
##           |     0.889 |     0.111 | 
##           |-----------|-----------|
## 
## 
## 
##

# Call CrossTable() on grade and loan_status
CrossTable(loan_data$grade, loan_data$loan_status, prop.r = TRUE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  29092 
## 
##  
##                 | loan_data$loan_status 
## loan_data$grade |         0 |         1 | Row Total | 
## ----------------|-----------|-----------|-----------|
##               A |      9084 |       565 |      9649 | 
##                 |     0.941 |     0.059 |     0.332 | 
## ----------------|-----------|-----------|-----------|
##               B |      8344 |       985 |      9329 | 
##                 |     0.894 |     0.106 |     0.321 | 
## ----------------|-----------|-----------|-----------|
##               C |      4904 |       844 |      5748 | 
##                 |     0.853 |     0.147 |     0.198 | 
## ----------------|-----------|-----------|-----------|
##               D |      2651 |       580 |      3231 | 
##                 |     0.820 |     0.180 |     0.111 | 
## ----------------|-----------|-----------|-----------|
##               E |       692 |       176 |       868 | 
##                 |     0.797 |     0.203 |     0.030 | 
## ----------------|-----------|-----------|-----------|
##               F |       155 |        56 |       211 | 
##                 |     0.735 |     0.265 |     0.007 | 
## ----------------|-----------|-----------|-----------|
##               G |        35 |        21 |        56 | 
##                 |     0.625 |     0.375 |     0.002 | 
## ----------------|-----------|-----------|-----------|
##    Column Total |     25865 |      3227 |     29092 | 
## ----------------|-----------|-----------|-----------|
## 
##

Exercise 2) Interpreting a CrossTable()

How would you interpret the results in the table you constructed at the end of the previous exercise?
Answer: The proportion of defaults increases when the credit rating moves from A to G.

Video: Histogram & Outliers

Using function hist()

library(gmodels)

# Use "main", "xlab", and "ylab" to rename labels.
hist(loan_data$int_rate, main = "Histogram of interest Rate", xlab = "Interest rate", ylab = "Count")

# Observation of Histogram:
#There are very few loans that have a interest rate higher than 20%.

Using function hist() on annual_inc

In the example below, there is only one bar in the histogram. We can change this by defining the number of breaks to use in the graph.

hist_income <- hist(loan_data$annual_inc, xlab = "Annual Income", main = "Histogram of Annual Income")

#By using "$breaks", we can get the location info of the histogram breaks.
hist_income$breaks

##  [1]       0  500000 1000000 1500000 2000000 2500000 3000000 3500000
##  [9] 4000000 4500000 5000000 5500000 6000000

By applying the rule of thumb break of sqrt(nrow(data_set_name)), we can see that there a few more bars then below. This is getting better but needs more improvement.

n_breaks <- sqrt(nrow(loan_data)) #170.5638

hist_income_n <- hist(loan_data$annual_inc, breaks = n_breaks, xlab = "Annual Income", main = "Histogram of Annual Income")

annual_inc

Look at a scatter pot to see if there are any outliers in the data set.
We can see in the plot that there is one observation that has over $6million in Annual Income.

plot(loan_data$annual_inc, ylab = "Annual Income")

Outlier

When is a value an outlier?
Data scientist will use:
Expert Judgement
Rule of thumb: Q1 = 1.5 * IQR, Q3 = 1.5 * IQR
mostly: combinations of both

Delete data that has Annual Income (annual_inc) that is greater than $3 Million.

# First group the data you want to delete
index_outlier_expert <- which(loan_data$annual_inc > 3000000)

# Second, delete the grouped data from data set.
loan_data_expert <- loan_data[-index_outlier_expert, ]

# New dataset = loan_data_expert

Delete data using ‘rule of thumb’, outlier if bigger (or smaller) than Q3 + 1.5 IQR (Inter Quartile Range)

outlier_cutoff <- quantile(loan_data$annual_inc, 0.75) + 1.5 * IQR(loan_data$annual_inc)

index_outlier_ROT <- which(loan_data$annual_inc > outlier_cutoff)

loan_data_ROT <- loan_data[-index_outlier_ROT, ]

# New dataset = loan_data_ROT

When we rerun the histograms using the data without outliers, there are more breaks and much easier to analyze.

# histrogram with new dataset from above: loan_data_expert  
hist(loan_data_expert$annual_inc, sqrt(nrow(loan_data_expert)), main = "expert judgement", xlab = "Annual income expert judgement")

# histogram with new data from above: loan_data_ROT
hist(loan_data_ROT$annual_inc, sqrt(nrow(loan_data_ROT)), xlab = "Annual income rule of thumb")

Deleting the outliers are can be done just for visualization if you plan to keep it in the analysis.

Bivariate plot

Bivariate plots can point out the bivariate outliers (outliers in 2 dimensions of the data).

plot(loan_data$emp_length, loan_data$annual_inc, xlab = "Employment length", ylab = "Annual Income")

Histogram Exercise

Exercise 1) Regroup Histogram Breaks

Instructions:

Use hist() to create a histogram with only one argument: loan_data$loan_amnt. Assign the result to a new object called hist_1.
Use $breaks along with the object hist_1 to get more information on the histogram breaks. Knowing the location of the breaks is important because if they are poorly chosen, the histogram may be misleading.
Change the number of breaks in hist_1 to 200 by specifying the breaks argument. Additionally, name the x-axis “Loan amount” using the xlab argument and title it “Histogram of the loan amount” using the main argument. Save the result to hist_2. Why do the peaks occur where they occur?

# Create histogram of loan_amnt: hist_1
hist_1 <- hist(loan_data$loan_amnt)

# Print locations of the breaks in hist_1
hist_1$breaks

##  [1]     0  2000  4000  6000  8000 10000 12000 14000 16000 18000 20000
## [12] 22000 24000 26000 28000 30000 32000 34000 36000

# Change number of breaks and add labels: hist_2
hist_2 <- hist(loan_data$loan_amnt, breaks = 200, xlab = "Loan Amount", 
               main = "Histogram of the loan amount")

Exercise 2) Outliers (Identify & Delete)

Instructions:

Build a scatter plot of the variable age (through loan_data$age) using the function plot(). Give the y-axis the appropriate label “Age” using ylab as a second argument,.
The oldest person in this data set is older than 122 years! Get the index of this outlier using which() and the age of 122 as a cutoff (you can do this using loan_data$age > 122). Assign it to the object index_highage.
Create a new data set new_data, after removing the observation with the high age using the object index_highage.
Have a look at the bivariate scatter plot, with age on the x-axis and annual income on the y-axis. Change the labels to “Age” and “Annual Income”, respectively.

# Plot the age variable
plot(loan_data$age, ylab = "Age")

# Save the outlier's index to index_highage
index_highage <- which(loan_data$age > 122)

# Create data set new_data with outlier deleted
new_data <- loan_data[-index_highage, ]

# Make bivariate scatterplot of age and annual income
plot(loan_data$age, loan_data$annual_inc, xlab = "Age", ylab = "Annual Income")

Missing Data & Coarse Classification

At this point, we have deleted outliers from Annual Income and Age from the data set.

Missing inputs

There are “NA”s in the data set for 2 fields: “int_rate” and “emp_length”

First, we want to know how many inputs are missing.
Use the summary() function.

summary(loan_data$emp_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   4.000   6.145   8.000  62.000     809

There are 809 NA observations in the employment length field.

Missing input: strategies

There are 3 main ways to handle missing input:

Delete the missing inputs or entire variable
Replace it
Just keep it

Delete rows

Using the code below, this will delete the entire row with “NA” value in “emp_length” field.
Newly added code: which(is.na())

# Group the data in emp_length field that has NA as value.
index_NA <- which(is.na(loan_data$emp_length))

# Delete the group from data set.
loan_data_no_NA <- loan_data[-c(index_NA), ]

Delete column

The code below will delete the entire column.

# Create a copy of the data set before you delete it.  Name the saved data set something else.  
loan_data_delete_employ <- loan_data

# Make all values in emp_length = NULL.
loan_data_delete_employ$emp_length <- NULL

Replacing a variable

Common practice of replacing variables is replacing it with the median value. This is called Median Imputation.

# Group the data in emp_length field that has NA as value.
index_NA <- which(is.na(loan_data$emp_length))

# Save a copy of the data set.
loan_data_replace <- loan_data

#Replace the group of the NA values in emp_length with median value.
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)

Keeping NA’s

In some situation, we want to keep NA because it is important for the analysis.
Problem arises when we use a model that will delete the NA’s automatically.
To keep the models from deleting NA’s automatically, we can use Coarse Classification (putting variables in bins). Coarse Classification puts continuous variables in “bins”.

Example: Lets start with a new variable emp_cat that will be replacing emp_length.

emp_length houses variables from 0 - 62 years.
We want to put these variables in bins of 15 years.
Groups: “0 - 15”, “15 - 30”, “30 - 45”, “45+”, “missing”

If we plot the new field, we see that group 0 - 15 carries a very high proportion of the data set.

You should re-do the process again (trial & error) and group even more granularity in the 0 - 15 group.

Exercise 1) Deleting missing data

Instructions:

Take a look at the number of missing inputs for the variable int_rate using summary().
Use which() and is.na() to create an index of the observations without a recorded interest rate. Store the result in the object na_index.
Create a new data set called loan_data_delrow_na, which does not contain the observations with missing interest rates.
Recall that we made a copy of loan_data called loan_data_delcol_na. Instead of deleting the observations with missing interest rates, delete the entire int_rate column by setting it equal to NULL.

# Look at summary of loan_data
summary(loan_data$int_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5.42    7.90   10.99   11.00   13.47   23.22    2776

# Get indices of missing interest rates: na_index
na_index <- which(is.na(loan_data$int_rate))

# Remove observations with missing interest rates: loan_data_delrow_na
loan_data_delrow_na <- loan_data[-na_index, ]

# Make copy of loan_data
loan_data_delcol_na <- loan_data

# Delete interest rate column from loan_data_delcol_na
loan_data_delcol_na$int_rate <- NULL

Exercise 2) Replacing missing data

Instructions:

Create an object called median_ir, containing the median of the interest rates in loan_data using the median() function. Don’t forget to include the argument na.rm = TRUE.
In the new data set loan_data_replace, replace all the missing instances in the indices stored in object na_index with the median of all the interest rates, median_ir.
Have a look at the variable int_rate in the new data set using summary() to make sure that the As are gone.

# Compute the median of int_rate
median_ir <- median(loan_data$int_rate, na.rm = TRUE)

# Make copy of loan_data
loan_data_replace <- loan_data

# Replace missing interest rates with median
loan_data_replace$int_rate[na_index] <- median_ir

# Check if the NAs are gone
summary(loan_data_replace$int_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.42    8.49   10.99   11.00   13.11   23.22

Exercise 3) Keeping missing data

Instructions:
Use Coarse Classification to keep your NA data by binning the NA’s to their own bin.

Make the necessary changes to the code provided to coarse classify int_rate, saving the result to a new variable called ir_cat.
- First, replace loan_data$emp_cat by loan_data$ir_cat where it occurs in the R script, as well as replacing loan_data$emp_length by loan_data$int_rate.
- Next, the variables should be binned in categories “0-8”, “8-11”, “11-13.5”, and “13.5+” (replacing “0-15”,“15-30”,“30-45” and “45+”). Usage of > and <= is exactly as in the video. Make sure to change the numbers in the conditional statements too (15, 30 and 45 should be changed to 8, 11 and 13.5 respectively).
Look at your new variable ir_cat using plot(loan_data$ir_cat).

#Question
# Make the necessary replacements in the coarse classification example below 
loan_data$emp_cat <- rep(NA, length(loan_data$emp_length))

loan_data$emp_cat[which(loan_data$emp_length <= 15)] <- "0-15"
loan_data$emp_cat[which(loan_data$emp_length > 15 & loan_data$emp_length <= 30)] <- "15-30"
loan_data$emp_cat[which(loan_data$emp_length > 30 & loan_data$emp_length <= 45)] <- "30-45"
loan_data$emp_cat[which(loan_data$emp_length > 45)] <- "45+"
loan_data$emp_cat[which(is.na(loan_data$emp_length))] <- "Missing"

loan_data$emp_cat <- as.factor(loan_data$emp_cat)

# Look at your new variable using plot()

#Answer
# Make the necessary replacements in the coarse classification example below 
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))

loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"

loan_data$ir_cat <- as.factor(loan_data$ir_cat)

# Look at your new variable using plot()
plot(loan_data$ir_cat)

Data Splitting & Confusion Matrices

After your data is fully preprocessed, we can start our analysis.
You can use the same data set to run the model on and to evaluate the results, but this usually will lead to a result that is too optimistic.

We need to split the data into 2 pieces, so we can run the model on the Training Set and evaluate the model on the Test Set.
Common way of splitting the data is using 2/3 to Train and 1/3 to Test. - There could be a lot of variations and performance estimates depending which 2/3 of the data we select for the training set.

One way to reduce the variations is by using Cross Validation.

For a 2/3 training set and 1/3 test set example, a cross-validation variant looks like this:

The data will be split in 3 equal parts (2 parts is training set and 1 part is test set). You could technically have many parts are you want, but that would mean running the model on all parts, and it would make it computationally heavy. In this course, we’ll just have 1 training set and 1 test set.

Imagine you just ran a model, and now we want apply the model to our Test-set to see how good the results are.
Evaluating the model for credit risk means comparing the observed outcomes of Default vs Non-Default stored in loan_status field in test_set (test_set$loan_status) with the predicted outcome (model_prediction).
Here, we’re just using 14 values to demonstrate the concept, but usually we’d use the Confusion Matrix to evaluate a larger set of values.

A confusion matrix is a contingency table of correct & incorrect classifications. The CORRECT classifications are diagonal from top left to right bottom. There are 8 correctly classified & predicted Non_Default customers and 3 correctly classified & predicted default customers.

The INCORRECT classifications are diagonal form top right to bottom left.
There were 2 Non_defaulters incorrectly classified as Defaulters, and 1 Defaulter incorrectly classified as Non_defaulter.

The CORRECT values in the confusion matrix is called:
- True Negative (TN)
- True Positive (TP)
The INCORRECT values are called:
- False Positive (FP)
- False Negative (NG)

There are important measures in the confusion matrix.

1. Accuracy = diagonal, percentage of correctly classified instances.
+ (8 + 3)/14 = 78.57%
2. Sensitivity = bottom horizontal, percentage of bad customers that are classified correctly.
+ 3/(1 + 3) = 75%
3. Specificity = top horizontal, percentage of good customers that are classified correctly.
+ 8/(8 + 2) = 80%

Exercise 1) Splitting the data set

Setting the seed: setting the seed allows to create a starting point for randomly generated numbers, so we can reproduce the same answer using random numbers in the model.
sample(): Use the sample() function to randomly assign observations to the training and test set. The function uses 2 arguments.

1st argument is the vector that we’ll get the sample from. Randomly pick row numbers as indices and create the vector for row number: 1:nrow(loan_data).
2nd argument is the number of items to choose. Rule of thumb, two thirds rule: 2/3 x nrow(loan_data).

Instructions:

Set a seed of 567 using the set.seed() function.
Store the row indices of the training set in the object index_train. Use the sample() function with a first and a second argument as discussed above.
Create the training set by selecting the row numbers stored in index_train from the data set loan_data. Save the result to training_set.
The test set contains the rows that are not in index_train. Copy the code that you used to create the training set, but use the negative sign (-) right before index_train inside the square brackets. Save the result to test_set.

# Set seed of 567
set.seed(567)

# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2/3*nrow(loan_data))

# Create training set: training_set
training_set <- loan_data[index_train, ]

# Create test set: test_set
test_set <- loan_data[-index_train, ]

Exercise 2) Creating a confusion matrix

Not available here, but in the counsel, there is a predicted outcome stored in a vector called “model_pred”.
We want to use the Confusion Matrix to see how well the model performed.
Compare the actual loan status (loan_status) to the predicted values (model_pred), using the table() function.

This was the worst example.

# Create confusion matrix
conf_matrix <- table(test_set$loan_status, model_pred)

# Compute classification accuracy
(6092 + 349) / nrow(test_set)

# Compute sensitivity
349 / 1037

Introduction and data reprocessing

Hikaru

12/27/2019

Introduction

Information used by banks

The Data

CrossTable

Exercise 1)

Exercise 2) Interpreting a CrossTable()

Video: Histogram & Outliers

Using function hist()

Using function hist() on annual_inc

annual_inc

Outlier

Bivariate plot

Histogram Exercise

Exercise 1) Regroup Histogram Breaks

Exercise 2) Outliers (Identify & Delete)

Missing Data & Coarse Classification

Missing inputs

Missing input: strategies

Delete rows

Delete column

Replacing a variable

Keeping NA’s

Exercise 1) Deleting missing data

Exercise 2) Replacing missing data

Exercise 3) Keeping missing data

Data Splitting & Confusion Matrices

Exercise 1) Splitting the data set

Exercise 2) Creating a confusion matrix