With the improved growth of Smart Banking in years past, there has been quite a huge movement for lenders to increase profitability while taking on greater risk. The primary function of the credit market is to locate borrowers and make sound investment decisions based on their credit worthiness. The purpose of this thesis is to look into how lenders make decisions, specifically how they avoid high-risk borrowers. The way money is given on loan basis and the way we shop online are close in terms of the relationship with the context of the way the goal is. To go deeper into this, we will use an example of an online shopping system (Amazon) where whenever a buyer wants to make a purchase, he or she makes the final decision based on the comments (feedback) given to the seller the overall rating of how good and how credible the product of the seller is, and final judgment is made based those comments and feedback given by those who have made a previous purchase. Usually, the feedback is given and more information is given based on product performance, satisfaction and errors which are the features. Quite similar to our loan data that consists of variables (features). In lending, the investors also work in a way that similar to how the consumers operate when trying to make an online purchase. The investor checks the feedback (default), the existing credit history among so many other factors before coming to a conclusion on whether the borrower will get the loan here she is asking for.
Lending is one of the best ways to equip borrowers with the right amount of money that they need based on their status, purpose of loan and credit worthiness. By studying the loan data, we will be analyzing the different factors and different borrowers. An issue with lending is repayments because they are not always guaranteed but there are a number of measures that we can take to assess each borrower by checking the information supplied and comparing it through the method of clustering.
This project also aims at helping lenders avoid high risk borrowers, judge the borrowers and classify them into different groups based on the similarities. In this project, clustering analysis is the method that will be used for the classification of borrowers. An algorithm will be used to find out how many clusters need to be created. There are 1000 observations and 21 important features that gives us insights on each borrower so the computation of their credit worthiness can be made faster. The most important feature in this data at the end of the day, is the default variable that actually shows their risk level and serves as a guide to an optimal solution.
We will also be assessing the probability of default for every incoming loan application we will receive, this will be based on a model that will receive all the necessary input and give us a prediction of default after learning and comparing against previous data. The final end output of this model will be a 0 or 1 after giving a condition of cases where we might have a probability default of greater than 0.5
We begin with a summary of the data.
| Rows | Columns | |
|---|---|---|
| Loan Application Data | 1000 | 21 |
| Mean | Std Deviation | Median | Max | Min | Mode | |
|---|---|---|---|---|---|---|
| Monthly Loan Duration | 20.903 | 12.05881 | 18 | 72 | 4 | 24 |
| Requested Amount | 33271.258 | 2822.737 | 2319.5 | 18424 | 250 | 548230 |
| Installment Rate | 2.973 | 1.118715 | 3 | 4 | 1 | 4 |
| Residence History | 2.845 | 1.103718 | 3 | 4 | 1 | 4 |
| Age | 35.546 | 11.37547 | 33 | 75 | 19 | 27 |
| Existing Loan | 1.407 | 0.57765 | 1 | 4 | 1 | 1 |
| Default | 1.3 | 0.45848 | 1 | 2 | 1 | 1 |
| Dependents | 1.155 | 0.362085 | 1 | 2 | 1 | 1 |
Overall, we can see from the summary statistic that the smallest requested amount is 250 dollars with the maximum amount being over 15000.
In Addition to this, the age group in this data ranges from 19 to 75 with most of the age groups falling from 40 downwards. Most of the applicants have lived in the current country for almost 3 years.
Finally, we can see that a larger percentage of the data has dependents being 1. Most of the borrowers in the data have loans for up to 20 months which is equivalent to a year and 8 months.
From the project objectives that can be deduced from the introduction above , there are three key questions to be seen and answered:
Banks and other loan businesses are usually at risk when it comes to lending and this is because of the payment setup and the borrowers who receive but don’t pay at when due or necessary. If the borrower does not pay back, the banks and loan businesses will have no way to makeup for the loss. We need to minimize the potential risk by assessing the important features and placing weights on them. In every loan application, these are important factors to consider.
The main goal of investing in a loan is to get an increased profit or return. One of the main ways lenders can increase their profit through loans is by targeting those who are under the category of low or medium risk and so they try to curb any potential loss by avoiding those in the high risk bracket. The aim of this section of the project is to help lenders identify the finest group of borrowers by assessing them based on different factors. The primary issue when making the investment decision is to figure out how (what methodology) to scrutinize borrowers and through the implementation of what method. Based on the similarities of borrowers’ default patterns, borrowers will be classified using clustering analysis.
From the overall data, we are able to deduce the proportion of borrowers who do not default and who default the loans which we can see from the table below.
| Category | Proportion |
|---|---|
| Defaulted | 0.7 |
| Not Defaulted | 0.3 |
Most borrowers (30%) are in a good creditworthiness level, while there are a lot of borrowers (70%) with loans that are considered to be defaulters (Table 3).
The metric used in this project was based on the concept of probability and proportions. After getting the number of defaulters and non-defaulters, the proportion was calculated for each category, after making calculations, the groups with greater than 0.5 were categorized as “High Risk” while the groups with proportions between 0.1-0.4 were categorized as “Medium Risk” with the groups with proportions lower than 0.1 being classified as “Low Risk”. A proper breakdown will be given as visualizations below.
Below, is a table that shows the groupings and insights about the defaulters. We will also employ the use of the bucket metric or benchmark explained above to classify each group as either low, medium or high risk.
| Group Number | Number of Non Defaulters | Number of Defaulters | Proportion of Defaulters | Risk Level |
|---|---|---|---|---|
| Group 1 | 324 | 27 | 0.076 | Low Risk |
| Group 2 | 18 | 162 | 0.9 | High Risk |
| Group 3 | 189 | 13 | 0.064 | Low Risk |
| Group 4 | 67 | 15 | 0.223 | Medium Risk |
| Group 5 | 26 | 29 | 0.5 | High Risk |
| Group 6 | 57 | 41 | 0.418 | Medium Risk |
| Group 7 | 19 | 13 | 0.406 | Medium Risk |
The figure below depicts the group of people who defaulted the most. We can see that the group with the most default is the common law followed by the single people and most married and divorced people on the same level of default.
Default Count Based on Personal Status
We have established that both single individuals and people under common law default the most, but now we want to see the reason they apply for loan or what they use the loan for. From the figure below, we have been able to identify that. Starting with home maintenance (furniture, domestic appliances,repairs and home entertainment) being the main purpose followed by skill development which encompasses education, training programs and business development. In further analysis and conclusion, we will tie the knot to completely give a very brief summary on the people we should watch out for.
Default Count Based on Personal Status with Purpose of Loan
A larger percentage of the defaulters mostly request for loans that are 5000 dollars or less and the applicants are usually 40 years old or less.
There’s a concentration of points at the left end of the figure below that confirms this fact. We have two unique cases of the defaulters requesting for over 15000 dollars.
Relationship between the amount requested by the borrower and the Age (Specifically for those categorised under high risk)
The Figure 4 below shows the distribution of housing status of these defaulters. We see that a larger percentage of them in the high risk category actually own houses. This confirms a fact from the plot generated in Figure 2 where we had most of the defaulters requesting for loans because of home maintenance (furniture, domestic appliances,repairs and home entertainment). While we think owning a house places you at a higher end of not defaulting, we have the exact opposite in this case.
We have extremely few defaulters under the category of those who have fully paid off their houses.
Personal Status and Housing Situation of Those who are classified as high risk
The statistic explains the installment payment set up for the borrowers where 1 means weekly, 2 means bi-weekly, 3 means monthly and 4 means quarterly. The people who have the longest duration (most times monthly or quarterly) for the payment are those with a checking balance of < 0 which technically means that they do not have sufficient funds in their account. As a result of that, it is quite understandable to see that those individuals would take the longest installment period to pay up.
Surprisingly, those with greater than a thousand dollars in their balance, also have the monthly installment payment set up.
Lastly, the dependents of these borrowers also matter a lot because we need to understand the number of people they are catering for and the amount of responsibility they have at hand, especially kids, parents, extended family members too. Below, is a table consisting of the number of dependents we have for the common law defaulters. It appears that over 90% of them have 1 dependent but still defaulted. Apart from having a number of dependents, this could also be due to the gravity of responsibility, payment and tasks brought upon by as little as 1 dependent. Yes, they could be financially draining.
| Number of Dependents | Count |
|---|---|
| 1 | 71 |
| 2 | 1 |
We will be using regression to develop the model we will use for the prediction of the default probability based on the features of the data.
possibility <- step_loan_data %>% predict(test_ld, type = "response")
predicted.classes <- ifelse(possibility > 0.5, 1, 0)
head(predicted.classes)
## 2 4 10 15 21 28
## 0 1 0 1 0 1
newdat <- c(1,24,1,10000,3,1,2,1,0,3,2,28,2,1,0,1,1,1,1,0,0,0,1,0,0)
possibility <- step_loan_data %>% predict(newdat, type = "response")
predicted.classes
## 3011
## 1
We will be focusing on the cluster/group that possesses the most number of non-defaulters as this gives us a hint to understanding the similarities and why a massive portion of them were placed into the first group where we have 324 non defaulters out of 351.
The figure below shows us that 68% of those non-defaulters in the good category who are also typical customers are those who are skilled employees with just 2% of them being unemployed non-residents. This is a rational fact as these individuals are those who are not entitled to work in the country and are applying for loans, it can be a little more tasking to pay back these loans assuming they do not have any other source of income or wealth.
Loan Applicants by Job
Most of typical customers who are non-defaulters have fully repaid this bank and applied for the loan in order to meet the skill development needs which encompasses education, training programs and business development. Home maintenance still remains to be the top reason most of the applicants ask for a loan.
Credit History with the Purpose of Loan
We have a lot of points concentrated at the start of the requested amount so we have a lot of applicants who are single being the typical customers who borrow a lot and request for 5000 dollars or less with just a few of them having tendencies to go beyond requesting for 10,000 dollars.
Personal Status and The Requested Amount Borrowed (Good Borrowers)
Lenders try to make maximum profit as possible and get paid when it’s due, I order to do that, they have to avoid high risk borrowers. These kind of borrowers were placed as the focus of this analysis after classifying them into low, medium and high risk. After rounding up the analysis on these high risk borrowers, we were able to get the important features and specifics of these borrowers, they are listed below.
The of all these features will serve as the benchmark to understanding or figuring out if an incoming loan applicant will default to a reasonable extent.
Our typical customers who are good and on the low risk side are those who are
The reason to focus on this group of people is because of how much profit the lender could make if they lend money over to those who are diligent with a low level of risk and good standing in credit history. Again, the combination of these features will serve as a good benchmark or guide to understanding or figuring out the good applicants.