Note: Estimated Read-time ~ 30 Minutes
In this project/case study i will try to discuss some ways to forecast the chance that a specific costumer, who wants to apply for a loan, will default on his payments(or not).
First at all, to “forecast” something is a really strange expression, isn´t it? Ordinary People are likely to connect “forecasting” with the daily weather report - and we all know that this kind of forecast is not reliable all the time.
So, why making the efford trying to look into the future to estimate default probability? Why banks just don´t give anyone who need it some money knowing that they can do nothing to reduce bad debtors?
That´s of course not how this works - the most intuitiv way (and a really still popular one) is that a skilled and experienced loan officier does his/her best to decide whether an applicant gets approved - or not.
Well, even though i personally lended a few bucks to others via P2P-Lending, i don´t consider myself “skilled” and “experienced” enough to estimate the credit-worthiness of a customer. But with some “Statistic-Magic”, i will try my best, so let´s just dive right in!
Note: This little project is strongly based on Lore Dirick´s “Credit Risk Modeling in R”- DataCamp Course, but i will try to make it even more straightforward to grasp the principles and ideas behind every code and calculation.
# Import Data
database <- readRDS("loan_data_ch1.rds")
# Show first 4 lines
head(database[1:6], n = 4)
## loan_status loan_amnt int_rate grade emp_length home_ownership
## 1 0 5000 10.65 B 10 RENT
## 2 0 2400 NA C 25 RENT
## 3 0 10000 13.49 C 13 RENT
## 4 0 5000 NA A 3 RENT
Let´s get started - if you are new to Programming with R, please don´t mind any gray backgrounded box - you don´t have to understand exactly what they mean. It´s just the R-Code i execute to display the results shown in the white boxes. They are also useful if you want to replicate everything i do here.
Note: If you want to understand what a line of code does, just look at the hashtaged #comment above it.
In this case, i import the necessary database from DataCamp (see link above) and display the first 4 lines (+ header) of our new database.
But that´s not all - we have about 29.000 Customers (who were already given a loan and we know whether they repaid it or not) to “play with”, as you can see here:
tail(database[1:6], n = 4) # Show last 4 lines
## loan_status loan_amnt int_rate grade emp_length home_ownership
## 29089 0 8500 10.28 C 3 RENT
## 29090 0 5000 8.07 A 0 MORTGAGE
## 29091 0 5000 7.43 A 0 MORTGAGE
## 29092 0 7500 NA E 0 OWN
What can we conclude by looking at these 2 boxes?
Here are the 2 missing columns:
head(database[7:8], n = 4) # Show first 4 lines of the 2 missing columns
## annual_inc age
## 1 24000 33
## 2 12252 31
## 3 49200 24
## 4 36000 39
We now have our Database and we roughly know how it looks like. But that´s not enough - with some simple functions here in R, we can gain even more detailed informations. Instead of just building a statistical model without knowing what it does, let´s have a look on some interesting tables:
library(gmodels)
Note: To this point, i worked only with the basic version of R. But since R is an Open-Sourced-Programming-Language, basically everyone can contribute to make R even more powerful (like Wikipedia for example). But in order to use functions build by other users/developer, we need to import it. Everytime you see me using library, i make use of an extern R package.
CrossTable(database$loan_status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 29092
##
##
## | 0 | 1 |
## |-----------|-----------|
## | 25865 | 3227 |
## | 0.889 | 0.111 |
## |-----------|-----------|
##
##
##
##
So this table is actually way more interesting, isn´t it? It simply splits our data and shows the amount of people who succesfully repaid their loan - and those, who didn´t.
About 3227 Customers (or 11.1%) defaulted on their loans. Please note that we don´t know on which point of loan duration they defaulted - we just know that at some point, these customers weren´t able to pay for interest and principals (i will come to this on a later point).
We can do the same thing to find out how many people in total lived under different circumstances:
CrossTable(database$home_ownership)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 29092
##
##
## | MORTGAGE | OTHER | OWN | RENT |
## |-----------|-----------|-----------|-----------|
## | 12002 | 97 | 2301 | 14692 |
## | 0.413 | 0.003 | 0.079 | 0.505 |
## |-----------|-----------|-----------|-----------|
##
##
##
##
The biggest part of customers were paying rent ( ~50% ), while the second biggest part lived in a mortgage-financed house ( ~41% ). Only ~ 8% lived in their own house.
Well, i honestly don´t know what Others could mean. Apparently, they are 97 customers who may lived under a Bridge or temporarly rentfree in some other places (maybe soldiers living in a barrack or grown adults living rentfree at hotel mama?)
What we don´t see is, how a different home ownership has an impact on default probability. But that would be quit interesting - what do you think? Which group has the highest chance of defaulting?
The CrossTable function can give us the answer:
CrossTable(database$home_ownership, database$loan_status, prop.r = TRUE,
prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 29092
##
##
## | database$loan_status
## database$home_ownership | 0 | 1 | Row Total |
## ------------------------|-----------|-----------|-----------|
## MORTGAGE | 10821 | 1181 | 12002 |
## | 0.902 | 0.098 | 0.413 |
## ------------------------|-----------|-----------|-----------|
## OTHER | 80 | 17 | 97 |
## | 0.825 | 0.175 | 0.003 |
## ------------------------|-----------|-----------|-----------|
## OWN | 2049 | 252 | 2301 |
## | 0.890 | 0.110 | 0.079 |
## ------------------------|-----------|-----------|-----------|
## RENT | 12915 | 1777 | 14692 |
## | 0.879 | 0.121 | 0.505 |
## ------------------------|-----------|-----------|-----------|
## Column Total | 25865 | 3227 | 29092 |
## ------------------------|-----------|-----------|-----------|
##
##
This table looks way more confusing, but it´s quite powerful. It gives us some really interesting insights.
Can you see which group of home ownership has the highest chance of defaulting? (remember: default = loan_status 1)
With a chance of 17.5%, the group of Others are most likely to default. So it seems that living under a Bridge or living rentfree with your parents are not a good sign for your credit worthiness. Please note that - due to the small amount of Others - this is not a really significant conclusion.
Suprisingly, the best creditors ( = lowest default probability) are those who live in a mortgage-financed house! That´s quite interesting - intuitivly, you should think that those who actually own a house are more likely to pay their loans. Well, that´s a topic better discussed somewhere else.
Remember that every costumer were given a Grade from A-G, representing their credit-worthiness? Obviously, someone already did some analyses of the default-probability. With this information, we can evaluate the existing grade-system:
CrossTable(database$grade, database$loan_status, prop.r = TRUE,
prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 29092
##
##
## | database$loan_status
## database$grade | 0 | 1 | Row Total |
## ---------------|-----------|-----------|-----------|
## A | 9084 | 565 | 9649 |
## | 0.941 | 0.059 | 0.332 |
## ---------------|-----------|-----------|-----------|
## B | 8344 | 985 | 9329 |
## | 0.894 | 0.106 | 0.321 |
## ---------------|-----------|-----------|-----------|
## C | 4904 | 844 | 5748 |
## | 0.853 | 0.147 | 0.198 |
## ---------------|-----------|-----------|-----------|
## D | 2651 | 580 | 3231 |
## | 0.820 | 0.180 | 0.111 |
## ---------------|-----------|-----------|-----------|
## E | 692 | 176 | 868 |
## | 0.797 | 0.203 | 0.030 |
## ---------------|-----------|-----------|-----------|
## F | 155 | 56 | 211 |
## | 0.735 | 0.265 | 0.007 |
## ---------------|-----------|-----------|-----------|
## G | 35 | 21 | 56 |
## | 0.625 | 0.375 | 0.002 |
## ---------------|-----------|-----------|-----------|
## Column Total | 25865 | 3227 | 29092 |
## ---------------|-----------|-----------|-----------|
##
##
If you want to understand this kind of table, don´t try to understand every single cell displayed. Just have a look at the really interesting part: The evolution of default-probability from A (=best grade) to G (=worst grade).
It seems that this grade model is quite useful - can you see the linear relationship? With every downgrade, the group of customers are more likely to default. “A”-graded customers have a 5,9% default-probability on average, while the worst Grade “G” have an enourmes default-quote of 37,5%!
Of course, i don´t want to do this for every of our 8 variables.
A more intuitiv way of displaying a linear relationship is by plotting it:
Note: Usually, you don´t display the code when you plot something, since the only interesting thing is the actual plot. But for reasons of replicability, i will show the code anyway.
# Creating a variable that contains our CrossTable shown above
data <-CrossTable(database$grade, database$loan_status, prop.r = TRUE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
# Creating a variable that contains only the default probability
data_relationship <- data$prop.row[,2]
# Plotting it
position <- data_relationship / 2
text(x = barplot(data_relationship),labels=names(data_relationship), y = position)
title("The worse the grade, the higher the default probability")
Now the linear relationship is even more identifiable!
While the difference between every downgrade is decreasing down to “E”, it´s dramatically changing from “E” to “F” and even more from “F” to “G.” That indicates that the bank should probably stop giving loans to “E” and “G” debtors or demand a way higher interest rate.
In R, it´s really straightforward to show this. All we do is plotting the difference between the grades:
difference <- diff(data_relationship)
plot(difference, type = "b", xlab = "Grades", ylab = "Changes in Default Probability",xaxt="n")
axis(1, at=1:6, labels=names(difference))
title("Probability of Default are changing dramatically from E/F and F/G")
I think we gained enough insights from our database. We now know that the grade is a good indicator to estimate default-probability - also the home ownership should be a variable to consider when we want to give a future customer a loan.
By now, we have actually a really simple Forecast-System: Imagine you are a loan officer, responsible to decide whether a customer get´s a loan or not. When someone enters your office, asking for some money, you could just ask “Okay, what´s your grade?” When the customer says “My grade is C”, then you just look up the CrossTable shown above to estimate, how likely future default of this specific costumer is.
This approach is - of course - not really scientificly correct, because we need to incorporate every variable (and their relation to each other) we have.
Note: This Approach is actualy part of the Descreptive Statistics (gaining Insights from Past Data), while we try to do here is part of the so called Predictive Statistics (trying to estimate some Future Outcome)
A really non-trivial and critical task for every ongoing-DataScientist is to prepare his/her Data for further analyses. Of course, this may not be the most interesting part of this project. Feel free to skip this part and go ahead to Building the Forecast Model. (by clicking on the Table of Content on the left side)
After this chapter, our Database will look a little bit different - but don´t worry, i will briefly explain the changes made.
summary(database)
## loan_status loan_amnt int_rate grade
## Min. :0.0000 Min. : 500 Min. : 5.42 A:9649
## 1st Qu.:0.0000 1st Qu.: 5000 1st Qu.: 7.90 B:9329
## Median :0.0000 Median : 8000 Median :10.99 C:5748
## Mean :0.1109 Mean : 9594 Mean :11.00 D:3231
## 3rd Qu.:0.0000 3rd Qu.:12250 3rd Qu.:13.47 E: 868
## Max. :1.0000 Max. :35000 Max. :23.22 F: 211
## NA's :2776 G: 56
## emp_length home_ownership annual_inc age
## Min. : 0.000 MORTGAGE:12002 Min. : 4000 Min. : 20.0
## 1st Qu.: 2.000 OTHER : 97 1st Qu.: 40000 1st Qu.: 23.0
## Median : 4.000 OWN : 2301 Median : 56424 Median : 26.0
## Mean : 6.145 RENT :14692 Mean : 67169 Mean : 27.7
## 3rd Qu.: 8.000 3rd Qu.: 80000 3rd Qu.: 30.0
## Max. :62.000 Max. :6000000 Max. :144.0
## NA's :809
Most of the time when you handle large dataset, you are confronted with NA´s. In this case, they are much NA´s in the int_rate - column and some in the emp_length-column.
head(database[1:6], n = 4)
## loan_status loan_amnt int_rate grade emp_length home_ownership
## 1 0 5000 10.65 B 10 RENT
## 2 0 2400 NA C 25 RENT
## 3 0 10000 13.49 C 13 RENT
## 4 0 5000 NA A 3 RENT
For example - for customer 2 and 4 here are no annual interest rate given. This can have different reasons (failed recording, customer refused to share information etc.).
They are some ways how to handle them:
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
data_locf <- na.locf(database)
head(data_locf[1:6], n = 5)
## loan_status loan_amnt int_rate grade emp_length home_ownership
## 1 0 5000 10.65 B 10 RENT
## 2 0 2400 10.65 C 25 RENT
## 3 0 10000 13.49 C 13 RENT
## 4 0 5000 13.49 A 3 RENT
## 5 0 3000 13.49 E 9 RENT
Look, the NA´s are gone! What is this kind of sorcery? In this example, i used the locf-function ( “last observation carried forward”) from the xts-package to simply fill the NA´s with the last obvervation. Customer 2 now has the same rate as customer 1 and customer 4 the same as customer 3.
Here is the other way: (“next observation carried backwards”)
data_nocb <- na.locf(database, fromLast = TRUE)
head(data_nocb[1:6], n = 5)
## loan_status loan_amnt int_rate grade emp_length home_ownership
## 1 0 5000 10.65 B 10 RENT
## 2 0 2400 13.49 C 25 RENT
## 3 0 10000 13.49 C 13 RENT
## 4 0 5000 12.69 A 3 RENT
## 5 0 3000 12.69 E 9 RENT
Well, while both of this approaches come really handy when you work with Time-Series-Data (like stock prices), in this special case they don´t make any sense. Why should costumer 2 have the same rate like costumer 1 or 3? They are not related to each other in any way - so, let´s dismiss this idea.
But wait.. what if we don´t want to remove all those NA´s or replace them with (useless) numbers? What if a not known interest rate is actually something we can use to improve our forecast?
Note: In Data Analyses, this is called MNAR ( = Missing Not At Random), which means the value of the variable that is missing may be related to the reason it is missing.
Let´s try something different: Why we don´t create bins of interest rates (like 0-7, 7-9, etc.)? Since these categories are non-numerical factors, we can create a last bin called “Missing” and calculate with them!
database$int_bin <- rep(NA, length(database$int_rate))
database$int_bin[which(database$int_rate <= 7)] <- "0-7"
database$int_bin[which(database$int_rate > 7 & database$int_rate <= 9)] <- "7-9"
database$int_bin[which(database$int_rate > 9 & database$int_rate <= 11)] <- "9-11"
database$int_bin[which(database$int_rate > 11 & database$int_rate <= 13.5)] <- "11-13.5"
database$int_bin[which(database$int_rate > 13.5)] <- "13.5+"
database$int_bin[which(is.na(database$int_rate))] <- "Missing"
database$int_bin <- as.factor(database$int_bin)
Okay, i know - that´s a lot of confusing code. Let me explain it:
NA´s and has the same length like all the other columns.Here you can see how many customers are in each bin: (maybe i should do the first bin larger?)
plot(database$int_bin)
Now we do the same for the “emp_length” - column. To make this look a little bid less confusing, i removed the code for the bin - creation. Just know that they are new bins for the emp_length and they look like this:
plot(database$emp_bin)
Now that we have new and improved columns for our interest rates and employment length, let´s remove the old ones:
database$int_rate <- NULL
database$emp_length <- NULL
Before i finish the Data-preparation, let me look for something called “Outliers”. Outliers are specific datapoints that significantly differ from all the others. Imagine you have a set of data with Ages of People reaching from 18-60. Now, there is one who is like 1000 years old. Obviously, this can´t be right, so you have to remove this datapoint before you can go ahead and analyse the data.
Let´s have another look on our summary:
summary(database)
## loan_status loan_amnt grade home_ownership
## Min. :0.0000 Min. : 500 A:9649 MORTGAGE:12002
## 1st Qu.:0.0000 1st Qu.: 5000 B:9329 OTHER : 97
## Median :0.0000 Median : 8000 C:5748 OWN : 2301
## Mean :0.1109 Mean : 9594 D:3231 RENT :14692
## 3rd Qu.:0.0000 3rd Qu.:12250 E: 868
## Max. :1.0000 Max. :35000 F: 211
## G: 56
## annual_inc age int_bin emp_bin
## Min. : 4000 Min. : 20.0 0-7 :3342 0-1 :6260
## 1st Qu.: 40000 1st Qu.: 23.0 11-13.5:6954 1-3 :6505
## Median : 56424 Median : 26.0 13.5+ :6002 3-7 :7726
## Mean : 67169 Mean : 27.7 7-9 :5150 7+ :7792
## 3rd Qu.: 80000 3rd Qu.: 30.0 9-11 :4868 Missing: 809
## Max. :6000000 Max. :144.0 Missing:2776
##
Feel free to compare the new summary to our old ones computed above.
You now know what an “Outlier” is - actually, there is a really big one. Can you spot it?
Maybe this plot can help you:
plot(database$age, ylab = "Age")
Do you see this lonely point on the top of the image? Apparently, one of the creditors was more than 144 age old! This is ether Wolverine who asked for money or it´s a mistake - maybe it was meant to be “44” and someone added a 1? Who knows, but let´s remove it:
# Save the Position of all Ages above 100
index_highage <- which(database$age > 100)
# Delete every Row with Ages above 100
database <- database[-index_highage, ]
Great! How old is our new, oldest customer?
max(database$age)
## [1] 94
Okay, a 94 year old who applied for a loan? At least that´s more realistic than a 144 year old, so let´s keep it.
When you looked at the summary closely, you maybe saw another outlier:
plot(database$annual_inc, ylab = "Annual Income")
Here, we have a few of them! Why you should apply for a loan for a few thousend dollars, when you earn more than 1 Million $/year? This may be legit, but let´s remove them anyway.
index_highincome <- which(database$annual_inc > 1000000)
database <- database[-index_highincome, ]
If you skipped the last 2 part´s, here are the changes made:
Let´s get to the " meat and potatoes“-part of this project - building the Forecast-Model
I will shortly explain how we will do this:
Do you remember that we want to forecast default probability of future-customers? We don´t want to find out if past customers were able to repay their loan - we already know this!
If you ever want to build a forecast-model (or something similiar like a Trading Backtest), NEVER EVER use your whole dataset to fit your model on. Please, remember this - NEVER. Just don´t.
This process is called Sampling. For this, we split our data into 2 groups:
# Create the training-set, that contains the first 20000 customers
training <- database[1:20000,]
# Create the test-set, that cointains the last 9092 customers
test <- database[20000:29092,]
We will only fit the model on the training group - we are not allowed to touch the test set in any way.
But why this? Would a Model build on all 29000 Customers not be more useful? The answer is: No, absolutly not.
In Statistic-Testing, this is something called Overfitting.
Let me give you an example:
Imagine you are a intern at a bank and your task is to find a trading system, that - literally - printed money in the past. So you look at your database (4 weeks of Stock-data in May) and you see something interesting: Every Wednesday in all 4 weeks, there was a really shift uptrend! On Average, you find out that stock prices rised between +1% and + 3% - every Wednesday! WoW, that was easy. Now you simulate a backtest, were your System bought on every Wednesday and the System made 10% in only 1 month! That´s great - now you can show your results to your manager and get a well-paid permanent contract, because you just showed your bank that they can make 10%/month when they just buy on every Wednesday.
You probably know that this is not how this works - why? Because the intern in this example was totally overfitting his System - he build the Model based on all 4 weeks, and how he tested it? Right - he tested it on the same 4 weeks! That´s a absolut no-go - you never ever test your model on the same data the model was build on.
In this example, the intern should try how is “Printing-Machine” would perform when tested on a different month (like June).
I hope you get the Concept of Overfitting. A overfitted Model is a useless Model.
Note: If you ever encounter an ad that claim´s something like “this Trading System made 1000% in the last 2 years”, you can be sure that it´s just a absolutly overfitted model, used to bait naive people into buying those systems. Researches proved that overfitted Trading-Systems are absolutly likely to lose money when used in real-life.
Let´s find out which parameters are actually useful for our model:
model_age <- glm(loan_status ~ age, family = "binomial", data = training)
model_age
##
## Call: glm(formula = loan_status ~ age, family = "binomial", data = training)
##
## Coefficients:
## (Intercept) age
## -1.905871 -0.009138
##
## Degrees of Freedom: 19999 Total (i.e. Null); 19998 Residual
## Null Deviance: 13320
## Residual Deviance: 13310 AIC: 13320
You may ask now: Wait - what is this? No introduction to logistic regression or something? What does all those numbers mean?
Look at the used code:
glm(loan_status ~ age, family = "binomial", data = training)
glm stands for “general logistic model” and is a common used function when building forecast models.loan_status the first variable we put in this function is just the outcome we want to predict.~ age is the so called “predictor”- simply said, the variable used to predict the outcome of the desired variable.family = "binomial" “binomial” just means: They are 2 possible outcomes. Customers can ether default - or not. If you want to predict way more complex outcomes, you need to choose another family, but they are way beyond this project.data = training The data used. Remember: NEVER use the test set.Basically, we tell the computer: “Hey, i have this variable i want to predict (loan status). Please look if you can predict it´s outcome with these variable(age).”
As is said above - when you have to handle results with a varity of outputs, try to concentrate on the only output of interest - in this case, it´s the coefficents:
model_age$coefficients
## (Intercept) age
## -1.905870516 -0.009138289
There is a reason we don´t calculate these numbers manually, because there is some crazy math going on under the hood. If you have a PhD in Statistics or Math, you may know what the glm - function does, but for everyone with “mortal professions”“, let me just interpret the results in plain english:
All other variables fixed, the older our customer is, the less likely he/she is to default on his payments. Exactly, the odds in favor of defaulting decrease by e-0.00913828 = 0.9909033 ~ 1,00% for every 1 year of age-difference.
Here is the exact calculation:
coefficient <- model_age$coefficients # age coefficient
e <- exp(1) # e equals the exponantional function from 1, do you remember from school math?
Percentage <- e^coefficient
Percentage
## (Intercept) age
## 0.1486931 0.9909033
If you think this looks like complicated math, try to check out the formulas and calculations made behind the glm function. This is not supposed to be a Intermediate Course for Statistics, so let us be satisfied with 2 things:
Of course, we could do this for all 8 variables now. But while poor Statistic-Students are forced to calculate this by hand, let us use our glm - function again - but now on every variable!
model_all <- glm(loan_status ~ ., family = "binomial", data = training)
summary(model_all)
##
## Call:
## glm(formula = loan_status ~ ., family = "binomial", data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2542 -0.5137 -0.4157 -0.3148 3.6920
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.489e+00 1.558e-01 -15.972 < 2e-16 ***
## loan_amnt -1.389e-05 4.357e-06 -3.187 0.00144 **
## gradeB 6.698e-01 1.598e-01 4.192 2.77e-05 ***
## gradeC 8.696e-01 1.663e-01 5.229 1.70e-07 ***
## gradeD 1.131e+00 1.790e-01 6.317 2.66e-10 ***
## gradeE 1.390e+00 2.078e-01 6.689 2.25e-11 ***
## gradeF 1.767e+00 2.904e-01 6.087 1.15e-09 ***
## gradeG 2.503e+00 4.863e-01 5.148 2.64e-07 ***
## home_ownershipOWN -7.265e-02 9.550e-02 -0.761 0.44680
## home_ownershipRENT 6.617e-03 5.446e-02 0.122 0.90329
## annual_inc -6.867e-06 8.588e-07 -7.997 1.28e-15 ***
## age -5.040e-03 3.936e-03 -1.281 0.20030
## int_bin11-13.5 3.750e-01 1.843e-01 2.035 0.04185 *
## int_bin13.5+ 4.744e-01 1.923e-01 2.467 0.01364 *
## int_bin7-9 5.447e-01 1.054e-01 5.167 2.37e-07 ***
## int_bin9-11 1.081e-01 1.899e-01 0.569 0.56909
## int_binMissing 2.676e-01 1.668e-01 1.604 0.10871
## emp_bin1-3 8.777e-03 7.519e-02 0.117 0.90707
## emp_bin3-7 3.220e-02 7.186e-02 0.448 0.65409
## emp_bin7+ 1.589e-01 7.401e-02 2.146 0.03184 *
## emp_binMissing 7.237e-01 1.132e-01 6.392 1.64e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13317 on 19999 degrees of freedom
## Residual deviance: 12671 on 19979 degrees of freedom
## AIC: 12713
##
## Number of Fisher Scoring iterations: 5
I know that i “promised” you to be straigthforward here and i don´t want you to be overwelmed by those informations displayed, but let me make it simple: What is this?
glm(loan_status ~ ., family = "binomial", data = training)
As you can see in our call to the glm - function, i now try to predict the loan_status - outcome with the variable .. This . just means that we use ALL other variables available.
Instead of printing out the results, i used the summary - function to help us answering following question: Which variables should we include?
In Statistic, this is called the Significance of a variable - meassured by it´s p-value. And now again - i don´t want to get to deep into calculations, just have a look at the results: Can you see the numbers of * computed on the right side of a variable?
They tell us how “important” one variable is to forecast the loan_status outcome. Can you remember that in the Chapter Further Insights, i said that they are 97 People (out of 29000!) living in “Other” circumstances? I also said that 97 People are waaaay to less to use it.
The computer now tells us the exact same thing, just “scientificly correct” for EVERY variable we have.
Suprisingly, this chapter will be the shortest one. Why? Because actually building the model is a really short task.
We now know 3 important things:
glm) we need to use.signifcant.So, here it is:
Final_Model <- glm(loan_status ~ loan_amnt + grade + annual_inc + int_bin + emp_bin, family = "binomial", data = training)
Note: We don´t use variables like Age or Home_Ownership due to their unsignificants. When you try to predict, what Weather tomorrow will be and you find out that the amount of chocolate you ate today has absolutly no effect on the future Weather, would you use this variable for prediction? No, because it´s not significant.
We nearly did it! In the next chapters, i will try to predict the default chance of 1 single customer just by his age so you can see what calculations are been made (be careful, they are not really easy to grasp).
And finally, we will test our newly build model on the UNUSED AND NEVER touched test-set.
Before we try to predict the outcomes of loan_status for 9.000 customers, let my give you an example how this works. Imagine, we build our model ONLY with the age - variable. Of course, this is not a really good model since age has absolutly NO Significance and thus should not be used (please check it for yourself that the age variable has not a single *).
But for explanation, it should be enough. We already build our model:
model_age$coefficients
## (Intercept) age
## -1.905870516 -0.009138289
Do you remember that i used this info to make following Statement: When all other variables are fixed, the default chance decreases by ~ 1.00% when someone is 1 year older ?
But older than whom? You probably wondered what this strange intercept could mean.
The intercept gives us the probability of default of the reference categorial. Every default-chance is compared to this reference categorial.
Here is the formula the computer will use to predict that loan_status will be 1 (“loanstatus = 1”) given the age of the customer.
\[ Probability(loanstatus = 1 | age) = \frac{1}{e^-(intercept + age\times Coeffcient )} \] First, let us calculate the default probability of our reference categorial:
DefaultProp <- 1 / (e^-(-1.905870516+0*(-0.009138289)))
DefaultProp
## [1] 0.1486931
Now we know this, we can calculate the expected default chance of all customers when we just use their age!
Person1, Age 25:
DefaultProp_Person1 <- 1 / (e^-(-1.905870516+25*(-0.009138289))) #
DefaultProp_Person1
## [1] 0.1183241
A Person aged 25 has a 11.83241% chance of defaulting. So, let us check my statement made that this number should decrease by ~ 1.00%, when we have a person aged 26.
Person2, Age26 Way1: (the easy, simplified way)
Change <- -0.9909033/100
DefaultProp_Person2_Easy <- DefaultProp_Person1 * (1 + Change)
DefaultProp_Person2_Easy
## [1] 0.1171516
Way2: (using the formula)
DefaultProp_Person2_Formula <- 1 / (e^-(-1.905870516+26*(-0.009138289))) #
DefaultProp_Person2_Formula
## [1] 0.1172478
Okay, they are not exactly the same(they are always rounding errors), but my statement was nearly correct!
Imagine, on Monday you finished your work on the Forecasting-Model and you want to use it + you want to see if it´s usefull or not. On Thusday, their is coming a customer - the first one of our test set:
test[1,] #Show the 1st Customer
## loan_status loan_amnt grade home_ownership annual_inc age int_bin
## 20005 0 15500 E MORTGAGE 415000 35 13.5+
## emp_bin
## 20005 1-3
You maybe think now: Wait. We know if these customer will default or not since the loan_status is equal to 0. Why we should try to predict it? Simply said - our model doesn´t know it! It´s doesn´t know whether this customer will repay his/her loan, it only sees the values of the variables shown.
So, would our model give this customer a loan?
predictions_1st <- predict(Final_Model, newdata = test[1,], type = "response")
predictions_1st
## 20005
## 0.02102588
Wow! Our model says that this customer will only default with a chance of about 2,10%! That´s really small (compared to average 11,11%), so our first Customer would have been approved. And, suprise, suprise, our model was succesfull, since our 1st customer was NOT defaulting (loan_status = 0).
Let us have a look on our second customer:
test[2,]
## loan_status loan_amnt grade home_ownership annual_inc age int_bin
## 20006 0 22750 B RENT 58600 22 11-13.5
## emp_bin
## 20006 3-7
predictions_2st <- predict(Final_Model, newdata = test[2,], type = "response")
predictions_2st
## 20006
## 0.09362765
Now, our Model says that this customer is way more likely to default - with a chance of ~ 9,36%.
Would we give this customer a credit? With this question, i want to continue to our next important task.
In the last chapter, you saw that the 1st customer was given a loan since a default chance of ~ 2,10% is really low. But our 2nd customer was a little bit more difficult.. should we accept a ~ 9,36% risk or not?
Of course, we don´t want to ask this every time for every customer, so let us define a cut-off level, that just says: “Every customer with a default chance of over this cut-off-level will be denied”
Instead of calculating the Default Chance of only 1 customer, let us calculate it for all 9084:
predictions_All <- predict(Final_Model, newdata = test, type = "response")
head(predictions_All)
## 20005 20006 20007 20008 20009 20010
## 0.02102588 0.09362765 0.08709034 0.06293566 0.22926239 0.16855996
We now have a really powerfull dataset. It cointains an estimated Default-Chance for every customer.
Now, let us play Fannie Mae and Freddy Mac (“FanFred”) in the year 2007. (They just gave everyone a loan who was able to spell their names correctly). Since we don´t care about default-chance, we just give EVERYONE a loan - no matter what our model says:
# Set cut-off-level to 1.00
cutoff <- 1.00
pred_cutoff_0 <- ifelse(predictions_All > cutoff,1,0)
table(test$loan_status,pred_cutoff_0)
## pred_cutoff_0
## 0
## 0 7929
## 1 1155
The results are - as expected - devasting! Out of 9084 customers, we gave 9084 customers a loan! While 7929 (or ~ 87%) were repaying, 1155 (~13%) were defaulting!
So, the Accuracity (how many times our Model was correct) equals to:
\[ Accuracity = \frac {Right Choices}{Total Choices} \]
Accuracity_0 <- 7929 / 9084
Accuracity_0
## [1] 0.8728534
The FanFred-Method was right in 87% of all choices, since it predicted 87% of all loan_status outcomes right.
Let us enhance the complexity: After our Bank was bailed-out with Billions of Tax-payers-Money because we just gave everyone a credit, the government (our new Owner) excpect us to set our cut-off-level to only 5%. So, when our model says that a specific customer will default with a chance of more than 5%, he will be denied:
cutoff <- 0.05
pred_cutoff_5 <- ifelse(predictions_All > cutoff,1,0)
table(test$loan_status,pred_cutoff_5)
## pred_cutoff_5
## 0 1
## 0 538 7391
## 1 43 1112
Oh no, it´s nearly the same code but now we have an additional column named “1”! Don´t worry, the results are really easy to understand:
RIGHT.FALSE because they would have repaid them.RIGHT because they would have defaulted.FALSE because they defaulted.Ignoring the fact that we just gave 6.40% ((538 + 43 ) / 9084) of all customers a loan, at least we have only 43 Defaulters (compared to 1155).
To calculate our new Accuracity, we have to modify our formula, because now we have 2 right choices:
\[ Accuracity = \frac {TL + TD}{Total Choices} \]
Accuracity_5 <- (538 + 1112) / 9084
Accuracity_5
## [1] 0.181638
In only ~ 18% of all cases, our Model was right? This can´t be true - is a cut-off-level of conservative 5% really so bad? Especially in comparison to the ridiculous FanFred-System?
You probably figured out that it´s not possible to evaluate a model only by it´s Accuracity. The intuition behind this is really straightforward: What is more important? Succesfully forecast someone, who were given a loan and he/she repaid it ( = TL )? Or someone who was denied a loan and he/she really defaulted? ( = TD )
That´s easy - in most cases, predicting a Defaulter is way more important.
The simplified calculation behind this statement:
\[ Profit Of The Bank When Debtor Is Succesfull = \sum_{i=1}^{n} Interest Payments\] where n = Total loan Duration in years.
Simply said - when a debtor repays it´s loan, the Profit for the bank is the sum of all Interest Payments.
But what amount is Lost when a customer is defaulting?
\[ Loss Of The Bank When Debtor Is Failing = LoanAmount - (\sum_{i=1}^{n} Interest Payments+Principals)\]
In most cases, the Loss (2nd Formula) exceeds the Profit (1st Formula), so it´s more important to find a potential Defaulter to prevend this losses.
Now the government is complaining that we don´t give enough loans to the citizens and thus preventing them from living the “American-Dream” - finance unnecessary Stuff like 2nd Cars / Swimming Pools and College tuition fees. In the future, we are forced to set our cut-off-level to 15% for a balanced - method:
cutoff <- 0.15
pred_cutoff_15 <- ifelse(predictions_All > cutoff,1,0)
table(test$loan_status,pred_cutoff_15)
## pred_cutoff_15
## 0 1
## 0 6118 1811
## 1 728 427
Feel free to evaluate the results on your own! (Try to understand what each number mean)
They are 2 main questions to be answered:
cut-off-levels) is the best?Unfortunately, with the Data given in the Dataset, it´s not possible to say whether our Model is usefull, since we:
Without knowing these 2 Variables, we can´t calculate the amount of both Key Figurs displayed in “Profit vs. Loss. What´s more important?” and thus we can´t decide how many Winners we need to compensate for 1 Defaulter.
But that wasn´t the porpose of this Project. We succesfully build a model which is able to estimate the Default Probabilty of a customer who want´s to apply for a loan.
Next time you fill out your application for a credit, you now know what happends to the data you provide and how the bank calculate it to estimate if you can repay your loan - or not.
Questions? Feel free to contact me on my LinkedIn account!