Debt and the issues resulting from debt have been an important and controversial issue in the media recently. From student loans to car payments to mortgages, almost everyone is in some form of debt. In this analysis we will be investigating a couple of questions, what are the causes of loan default and which types of loans are the riskiest (cause more defaults and missed payments). To do this, we will be looking at three types of models, logistic regression, machine learning and decision trees. With logistic regression we will be using statistical techniques to tell us the probably of someone going into loan default and then using that to predict if they will go into default. With machine learning we will use cross-validation and ROC Curves to determine accuracy of those models and lastly we will use decision trees to find key variables that we will use to create cut-off values to determine whether someone with certain characteristics will default or not on their loans.
The data that we will be using is a subset of a larger database that looks at the who defaulted on their loans and who did not. In this dataset there are 1000 observations and there are 16 variables of interest. Below are the variables and their descriptions:
Variable 1: Checking Amount (Numeric)
Variable 2: Term (in months) (Numeric)
Variable 3: Credit Score (Numeric)
Variable 4: Gender (Categorical)
Variable 5: Martial Status (Categorical)
Variable 6: Car Loan (1-Car Loan, 0-No Car Loan)
Variable 7: Personal Loan (1-Personal Loan, 0-No Personal Loan)
Variable 8: Home Loan (1-Home Loan, 0- No Home Loan)
Variable 9: Education Loan (1- Education Loan, 0- No Education Loan)
Variable 10: Employment Status
Variable 11: Loan Amount
Variable 12: Savings Amount
Variable 13: Employment Duration in Months
Variable 14: Age in Years
Variable 15: Number of Credit Amount
Variable 16: Default Status (1-Yes, 0-No)
As you can see there are 16 variables that according to the source 13 are numeric and 3 are categorical. Depending on the loan type and if they have multiple loans, they could also be considered categorical variables. This is something we will investigate in the EDA. The variable of most importance is the Default Status, as this is the dependent variable that we are investigating in this study.
When looking into the dataset and doing some early investigating, as you can see below, there are no missing values or observations. This is a great sign as we do not have to impute or estimate potential missing values to sure up our analysis. Since there no missing values, there is really nothing we have to do to the dataset or the missing values to have it ready for the analysis. There are some concerns of incorrect values for the checking amount as some values are in the negative, but after further investigation and reasoning, this is not to be a concern as it is a regular occurrence for certain individuals to have a negative balance in their accounts.
| Default | Checking_amount | Term | Credit_score | Gender | Marital_status | Car_loan | Personal_loan | Home_loan | Education_loan | Emp_status | Amount | Saving_amount | Emp_duration | Age | No_of_credit_acc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :0.0 | Min. :-665.0 | Min. : 9.00 | Min. : 376.0 | Length:1000 | Length:1000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Min. :0.000 | Length:1000 | Min. : 244 | Min. :2082 | Min. : 0.00 | Min. :18.00 | Min. :1.000 | |
| 1st Qu.:0.0 | 1st Qu.: 164.8 | 1st Qu.:16.00 | 1st Qu.: 725.8 | Class :character | Class :character | 1st Qu.:0.000 | 1st Qu.:0.000 | 1st Qu.:0.000 | 1st Qu.:0.000 | Class :character | 1st Qu.:1016 | 1st Qu.:2951 | 1st Qu.: 15.00 | 1st Qu.:29.00 | 1st Qu.:1.000 | |
| Median :0.0 | Median : 351.5 | Median :18.00 | Median : 770.5 | Mode :character | Mode :character | Median :0.000 | Median :0.000 | Median :0.000 | Median :0.000 | Mode :character | Median :1226 | Median :3203 | Median : 41.00 | Median :32.00 | Median :2.000 | |
| Mean :0.3 | Mean : 362.4 | Mean :17.82 | Mean : 760.5 | NA | NA | Mean :0.353 | Mean :0.474 | Mean :0.056 | Mean :0.112 | NA | Mean :1219 | Mean :3179 | Mean : 49.39 | Mean :31.21 | Mean :2.546 | |
| 3rd Qu.:1.0 | 3rd Qu.: 553.5 | 3rd Qu.:20.00 | 3rd Qu.: 812.0 | NA | NA | 3rd Qu.:1.000 | 3rd Qu.:1.000 | 3rd Qu.:0.000 | 3rd Qu.:0.000 | NA | 3rd Qu.:1420 | 3rd Qu.:3402 | 3rd Qu.: 85.00 | 3rd Qu.:34.00 | 3rd Qu.:3.000 | |
| Max. :1.0 | Max. :1319.0 | Max. :27.00 | Max. :1029.0 | NA | NA | Max. :1.000 | Max. :1.000 | Max. :1.000 | Max. :1.000 | NA | Max. :2362 | Max. :4108 | Max. :120.00 | Max. :42.00 | Max. :9.000 |
The purpose of EDA is to get a look into the variables and the dataset to see if we notice and trends, dependencies, or obscurities that need to be addressed before we go into our modeling and analysis. In the sections below we will look into the numeric variables, there relationships, patterns and if they needed to be transformed. We will look into the categorical variables as well looking at the dependencies and any patterns observed. Lastly, based off of those results we will determine if any transformations or treatments are needed for the data before analysis.
Before doing any type of analysis, there are a some variables that are listed as numeric as 0 for no and 1 for yes, that need to be changed to categorical variables. Those variables are Default, Car_Loan, Personal_Loan, Home_Loan and Education_Loan. Those changes are reflected in the code below.
The purpose of numerical data analysis is to use the graphics that we create to investigate any potential associations that could be interesting for our analysis. We do this by looking at the correlations of the variables as well as looking at the density curves of the variables we will study in pairs.
When looking at the correlations between the numeric variables, there does not seem to be any strong correlation between the variables. All correlations are below 0.3 which is not significant. As for the density curves, there are a couple of concern, first being employment duration, this seems to be very associated and dependent for loan default status. The same can be said for number of credit accounts, as these curves are nearly identical. For the analysis, we will be removing those variables for the analytical dataset.
There are a couple of variables of interest that need to be investigated. The first is the term of the loans. Most loans have a very similar length of loans. The plot below shows the distribution of the term variable.
From viewing the distributions of the continuous variables, all variables besides credit score are normally distributed. To look into this variable further, we will be categorizing each credit score into a certain level. These levels are based off of the FICO score ranges. According to US News & World Report (https://money.usnews.com/credit-cards/articles/what-are-the-credit-score-ranges), credit scores can be categorized by Poor: 579 and lower, Fair: 580-669, Good: 670-739, Very Good: 740-799 and Exceptional: 800 and above. We will use these classifications for our categorical analysis of Credit Score.
The purpose of categorical data analysis is to see any relationships that can be seen between variables of certain categories. We can do this by making mosaic plots and comparing the amount of defaults from each category. For this study I looked at what I call Life Status (Employment, Marital and Gender) as well as the types of Loans (Car, Personal, Home and Education)
From looking at the plots above, there is no association between employment status and defaulting on loans. For each employment status, the amount of those who defaulted is equal for each group. When looking at married vs single, those who were single have a higher amount of default. When comparing gender, Females have a higher rate of default when compared to males. From investigating these variables, employment status will also be dropped from the analytical dataset.
The mosaic plots above look into which types of loans are and their associations with default. When looking at overall size, there are significantly more people that have car loans or personal loans compared to those with home or education loans. These plots show that having a car loan or education loan are positively correlated with loan default. The opposite is true for those with home loans and personal loans, those loans are paid back.
From looking at the mosaic plot of the groups, credit score group is associated with loan default. As those that have better credit groups, have a lower rate of default.
Based off of the results of the EDA, I believe that after switching certain numeric variables to character, we were able to get a good idea of what the data looks like and how it will perform under testing. When looking at the data numerically, there does not seem to be any significant associations between the numeric variables themselves. There is an association between loan default status and employment duration as well as number of credit accounts. When looking at the categorical variables, there is no association between employment status and loan defaults, there more loan defaults for those that are single compared to those that are married. When comparing gender, females had a higher amount of default compared to Males. As for the different type of loans, those with car loans and educational loans have a higher rate of default compared to not having those and the opposite was true for those with personal and home loans. For our analytical dataset we will be using the 12 following variables to predict loan default:
Checking Amount (Numeric)
Term (in months) (Numeric)
Credit Score (Numeric)
Gender (Categorical)
Martial Status (Categorical)
Car Loan (1-Car Loan, 0-No Car Loan)
Personal Loan (1-Personal Loan, 0-No Personal Loan)
Home Loan (1-Home Loan, 0- No Home Loan)
Education Loan (1- Education Loan, 0- No Education Loan)
Loan Amount
Savings Amount
Age in Years
In this study we are looking into the causes of loan default and the different types of loans that could cause loan default. The question of interest in this analysis is what are the causes of loan default? The other question that we will also look into is, are different types of loans associated with higher percentage of loan default.
The response variable or variable of interest in Loan Default. Loan default is when someone stops making required payments on a loan. The variable is a binary variable with 1 for defaulting on a loan and 0 for not defaulting/still making payments.
To answer the questions of interest, we will be constructing a logistic predictive model for our dataset. To do this we will start with one large model that contains all the variables as well as another minimal model with one variable. After that we will use a stepwise approach to find the best model of interest. Lastly, we will predict some arbitrary made up values to see if our final model works.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 39.7515891 | 4.6373275 | 8.5720902 | 0.0000000 |
| Checking_amount | -0.0051152 | 0.0006708 | -7.6254106 | 0.0000000 |
| Term | 0.1703708 | 0.0516719 | 3.2971634 | 0.0009767 |
| Credit_score | -0.0108810 | 0.0020374 | -5.3405879 | 0.0000001 |
| Gender | 0.2127173 | 0.4992540 | 0.4260703 | 0.6700566 |
| Marital_status | -0.2321025 | 0.4680443 | -0.4958985 | 0.6199660 |
| Car_loan | -0.4637733 | 2.6739669 | -0.1734402 | 0.8623054 |
| Personal_loan | -1.4303372 | 2.6737602 | -0.5349534 | 0.5926821 |
| Home_loan | -3.5442514 | 2.7615556 | -1.2834257 | 0.1993430 |
| Education_loan | 0.7035589 | 2.7085576 | 0.2597541 | 0.7950535 |
| Amount | 0.0008521 | 0.0005094 | 1.6728565 | 0.0943555 |
| Saving_amount | -0.0047812 | 0.0005978 | -7.9973755 | 0.0000000 |
| Age | -0.6443179 | 0.0640591 | -10.0581719 | 0.0000000 |
From looking at the results above, most of the variables are deemed to be not significant, a couple of note that are significant are checking amount, term, credit score, amount, saving amount and age.
This model will only have 2 variables, the amount of the loan and credit score. We chose these because credit score determines interest rates and having more to pay off will most likely result in a default.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 9.4980041 | 0.9772893 | 9.718723 | 0.0000000 |
| Amount | 0.0009304 | 0.0002637 | 3.528215 | 0.0004184 |
| Credit_score | -0.0153278 | 0.0012510 | -12.251962 | 0.0000000 |
As we thought, amount and credit score on their own are deemed to be extremely significant when making the model on loan default. These two variables will be the starting point of our model building.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 39.1974627 | 3.8175409 | 10.267726 | 0.0000000 |
| Checking_amount | -0.0051112 | 0.0006691 | -7.638687 | 0.0000000 |
| Term | 0.1740823 | 0.0511469 | 3.403572 | 0.0006651 |
| Credit_score | -0.0108403 | 0.0020334 | -5.331213 | 0.0000001 |
| Personal_loan | -0.9735234 | 0.3321871 | -2.930648 | 0.0033826 |
| Home_loan | -3.0603146 | 0.7605712 | -4.023706 | 0.0000573 |
| Education_loan | 1.1607001 | 0.5489788 | 2.114289 | 0.0344906 |
| Amount | 0.0008553 | 0.0005100 | 1.677108 | 0.0935214 |
| Saving_amount | -0.0047847 | 0.0005974 | -8.009094 | 0.0000000 |
| Age | -0.6436853 | 0.0636483 | -10.113164 | 0.0000000 |
Above is the final model, which uses 9 of the original 16 variables. The only variable with a non-significant at 0.05 p-value is amount, which when thinking of loan defaults and loans is extremely important because that is what is needed to be paid off. Next we will use some made up situations to predict whether someone will default on their loans or not.
| Checking_amount | Term | Credit_score | Personal_loan | Home_loan | Education_loan | Amount | Saving_amount | Age | Pred.Response |
|---|---|---|---|---|---|---|---|---|---|
| 634 | 18 | 600 | 1 | 0 | 0 | 1534 | 3350 | 33 | 0 |
| 300 | 25 | 737 | 0 | 0 | 1 | 1804 | 2449 | 29 | 1 |
Above is the random test values used to test the model. Using the 10 variables of interest our model predicted that the first individual will not default on their loan, while the second individual will default on their loan.
The purpose of cross validation is to use our data to test the models that we have come up with to figure out which model is the best. In order to do this we will be splitting the data into a training set (70% of data) and a testing set (30% of data). The training set will allow us to pick a model and then the test set will be used as validation for our model. We will also use the testing dataset to come up with our optimal cut-off probability.
To find the optimal cut-off we will be using 20 different possible cut-off probabilities and then use a 5 fold cross validation to find the best cut-off probability. We will be doing this to two different models and picking the best based on the best accuracy. To do this, the models used will be the full model we created will all the analytical variables and the final model we came up with previously with the 9 variables used.
To test the final model, we will use the remaining 30% of the analytical dataset to test our model’s predictions at the optimal cutoff point, which we determined to be a probability of 0.62
| test.accuracy |
|---|
| 0.92 |
The results showed an accuracy of 88.667% which is slightly than our original accuracy with the training dataset. This shows that this model is a great fit for the data and does an excellent job predicting probabilities correctly.
To test how well our model works, there are performance measures that can be used to determine just that. There are local and global measures that we can use. The local measures we use are precision, which is the percentage of true positives among all positives, recall which are positives that are correctly predicted as positives and F1 score which is a metric that combines precision and recall. These are all evaluated at the cutoff probability. The global measures are sensitivity, which is the same as reacll, specificity which is the true negative rate among the dataset. We also use a ROC curve which shows performance at all cutoff probability thresholds. The AUC shows us how accurate our model is at all thresholds.
| sensitivity | specificity | precision | recall | F1 |
|---|---|---|---|---|
| 0.8977273 | 0.9292453 | 0.8404255 | 0.8977273 | 0.8681319 |
The table above gives us at the optimal cutoff there is a sensitivity of 0.711 which is those of who are actually defaulting when they are predicted to default. The specificity is 0.97 which is those who did not default when they were predicted to default.
## Warning in one.minus.spec[-101] - one.minus.spec[-1]: longer object length is
## not a multiple of shorter object length
The ROC curve above shows us the false positive rate (1-specificity) vs the the true positive rate or sensitivity. This graph goes from 20 different cutoff points from 0 to 1. As specificity decreases or on the graph the x-axis increases, so the less we predict negatives incorrectly, we also predict positives correctly. The AUC is 0.981, which means 98% of values are accounted for by the ROC curve.
Another way that models can be build is using a data driven approach via neural network. This approach uses machine learning to find to patterns of the variables to predict the response variable. In order to this we will standardize all of the numeric variables and reset the categorical variables to have one value set as a dummy variable. We will using the analytical dataset that we created as a result of our EDA analysis.
## [1] "(Intercept)" "Default"
## [3] "Checking_amount" "Term"
## [5] "Credit_score" "Gender"
## [7] "Marital_status" "Car_loan"
## [9] "Personal_loan" "Home_loan"
## [11] "Education_loan" "Amount"
## [13] "Saving_amount" "Age"
## [15] "grp.credit_scoreFair" "grp.credit_scoreGood"
## [17] "grp.credit_scorePoor" "grp.credit_scoreVery Good"
## [19] "default.status"
Below is the defined model that we will be putting into the Neural Network:
Default = CheckingAmount+ Term + CreditScore + Gender + MaritalStatus + CarLoan + PersonalLoan + HomeLoan + EducationLoan + Amount + SavingAmount + Age
To train and test the Neural Network, the data will be split into a training datset (70% of data) and a testing dataset (30% of data). The Neural Network will be built on the training dataset while the accuracy and performance will be tested with the testing dataset.
| error | 4.9662358 |
| reached.threshold | 0.0093099 |
| steps | 2681.0000000 |
| Intercept.to.1layhid1 | -8.5848714 |
| CheckingAmount.to.1layhid1 | 10.6997799 |
| Term.to.1layhid1 | -7.0213942 |
| CreditScore.to.1layhid1 | 20.1028208 |
| Gender.to.1layhid1 | -8.6470249 |
| MaritalStatus.to.1layhid1 | -7.2778820 |
| CarLoan.to.1layhid1 | -1.9047456 |
| PersonalLoan.to.1layhid1 | -0.0129849 |
| HomeLoan.to.1layhid1 | -0.7345207 |
| EducationLoan.to.1layhid1 | -3.3819397 |
| Amount.to.1layhid1 | -1.5866660 |
| SavingAmount.to.1layhid1 | 10.7228029 |
| Age.to.1layhid1 | 18.1772274 |
| Intercept.to.Default | 0.9945111 |
| 1layhid1.to.Default | -0.9883547 |
In order to test the accuracy and performance of the Neural Network model, it must be tested and cross validated with the remaining data. This will be done by finding an optimal cut-off probability for predicting loan default and then using that probability to find the accuracy of the model as well as coming up with the ROC curve.
The resulting plot shows the optimal cut-off probability to be 0.57 for predicting loan default. This value will be used to test the accuracy versus the logistic model later on. Below is the ROC curve to test how well the data fits the Neural Network model.
## Warning in roc.default(category, prediction): Deprecated use a matrix as
## predictor. Unexpected results may be produced, please pass a numeric vector.
The ROC curve results in a value that is much higher than 0.5, which means this model is a great fit and a much better use of predicting loan default than random guessing.
To test the accuracy of the Neural Network model, the cutoff probability of 0.57 will be used in order for the model to make predictions. The predictions will be compared to the actual Loan Default values and the accuracy will be calculated from there.
| 0 | 1 | |
|---|---|---|
| FALSE | 463 | 36 |
| TRUE | 15 | 186 |
| accuracy |
|---|
| 0.9271429 |
The table above shows us the accuracy and how many correct false default and true defaults. Overall, this model has an accuracy of about 90% which is very similar to the logistic model. Both models would be excellent choices for our final model. The AUC for both models is in the 90% percent range as well. The logistic model is simpler as it uses less variables than the Neural Network, but otherwise all of the performance measures test out extremely well for both models and you could not go wrong with either model.
Decision Tree algorithms are considered supervised algorithms that can be used in modeling. This approach looks at mostly categorical variables to predict outcomes. In order to do this there are two metrics, gini and entropy that tell us which variables can be split to give us more information for making a decision. In this dataset there are a lot of categorical variables, specifically for which types of loans that the observations have taken out. Using that information combined with other variables can lead us to better predicting default. The approach here will be to create eight different decision trees. We will be penalizing false negatives and false positives at different rates, while making two trees for each type of rate, one using the Gini index to make variable splits and another using entropy.
The decision trees that will be built, will be based off of different penalization for incorrect predictions. In this scenario, it is very costly for a company that incorrectly predicts whether someone defaults on their loans. To do this we will make trees that do not penalize for being incorrect, penalize for just false positive, penalize for just false negative and penalize for both. We will also be making a tree for each with the Gini Index and one for entropy.
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
Above are the eight decision trees for Defaulting on Loans. As you can
see for each penalization and decision index type, each tree is
different and tells us different information and predictions.
In the sections below we will use ROC curves and cut-off probabilities to determine the best tree. ### ROC Curves ROC Curves show us how accurate our models are and if using the trees will be a better guess than randomly selected Default or No Default.
Above is the corresponding ROC curves for the 8 decision trees. The best
model is the info10.10 because it has the largest AUC, which means it is
the most accurate and it also penalizes for incorrect answers which
separates it from the not penalized model.
The purpose of the cut-off probability is to determine the optimal probability given by the model to determine whether it will predict if someone will default or not default. Below are those cut-off probabilities.
The above plots are the optimal cut-off points for each decision tree. We can use these values to further optimize our models and ROC Curves to pick the best model depending on how accurate we want to be.
From our modeling, ROC Curves and Cut-Off probability, the best model from the decision tree algorithms is the entropy model that penalizes for all incorrect predictions. This model gave us great accuracy at well over 90% which is the highest of the models. Also, this gave us a very large AUC that was also well over 90% which accounts for more than 90% of the data.
The purpose of this study was to investigate the predictors and causes of loan default using multiple models and then how to implement the models. The three models used were a logistic regression, neural network and decision tree models. All three models did extremely well when tested for accuracy and precision, with all of those values being in the 90% range, which is really high when compared to random guessing default or no default will get you an accuracy around 50%.
The simplest and easiest to use model would be the logistic regression model. This model only uses nine of the sixteen variables and is easy to use. One can plug in the values and it will give you a prediction on whether someone will default or not on their loan. This will be useful in deciding whether your company should give the loan out or not. It is easy to understand because it will give you a value that either represents no default or default.
The next model used was the machine learning model via the use of neural networks. This model is very accurate but tough to understand and use. It is also more complex as it uses all twelve of the variables remaining. This model has great accuracy and is cross-validated but cannot pump out a value to tell you the prediction of loan default or no loan default. This would be tough to implement but is a good resource for overall how well the loans are doing.
Lastly, is the decision trees. These models are simpler than the neural networks but more complex than the logisitic regression models. These also use all 12 variables in the analytical dataset. The best model for that was the entropy decision trees that penalize for false positives and false negatives. This is important as there is loss in incorrectly predicting someone to not default and they do as well as predicting someone to default when they do not. From a business perspective, we lose money on giving loans to people who do not pay them off as well as those who we predict will not pay them off, so we do not give the loan but the end up paying it off and we could have collected the interest. To implement this model, one must follow the guidelines and cut-offs on the tree to find whether an individual with certain characteristics will default or not on their loan.
In the end the best recommendation is the logistic regression model as it has very high accuracy and is the easiest to implement as one can put in values into an excel sheet basically and it will produce a clean prediction of whether someone will default or not on their loans.
Harzog, Beverly. “What Are the Credit Score Ranges? | Credit Card News & Advice | U.S. News.” US News & World Report, 17 July 2019, money.usnews.com/credit-cards/articles/what-are-the-credit-score-ranges.