Brief Introduction

With the improved growth of Smart Banking in years past, there has been quite a huge movement for lenders to increase profitability while taking on greater risk. The primary function of the credit market is to locate borrowers and make sound investment decisions based on their credit worthiness. The purpose of this thesis is to look into how lenders make decisions, specifically how they avoid high-risk borrowers. The way money is given on loan basis and the way we shop online are close in terms of the relationship with the context of the way the goal is. To go deeper into this, we will use an example of an online shopping system (Amazon) where whenever a buyer wants to make a purchase, he or she makes the final decision based on the comments (feedback) given to the seller the overall rating of how good and how credible the product of the seller is, and final judgment is made based those comments and feedback given by those who have made a previous purchase. Usually, the feedback is given and more information is given based on product performance, satisfaction and errors which are the features. Quite similar to our loan data that consists of variables (features). In lending, the investors also work in a way that similar to how the consumers operate when trying to make an online purchase. The investor checks the feedback (default), the existing credit history among so many other factors before coming to a conclusion on whether the borrower will get the loan here she is asking for.

Lending is one of the best ways to equip borrowers with the right amount of money that they need based on their status, purpose of loan and credit worthiness. By studying the loan data, we will be analyzing the different factors and different borrowers. An issue with lending is repayments because they are not always guaranteed but there are a number of measures that we can take to assess each borrower by checking the information supplied and comparing it through the method of clustering.

Objective

This project also aims at helping lenders avoid high risk borrowers, judge the borrowers and classify them into different groups based on the similarities. In this project, clustering analysis is the method that will be used for the classification of borrowers. An algorithm will be used to find out how many clusters need to be created. There are 1000 observations and 21 important features that gives us insights on each borrower so the computation of their credit worthiness can be made faster. The most important feature in this data at the end of the day, is the default variable that actually shows their risk level and serves as a guide to an optimal solution.

We will also be assessing the probability of default for every incoming loan application we will receive, this will be based on a model that will receive all the necessary input and give us a prediction of default after learning and comparing against previous data. The final end output of this model will be a 0 or 1 after giving a condition of cases where we might have a probability default of greater than 0.5

Descriptive Statistics

We begin with a summary of the data.

Dimension of The Data.
Rows Columns
Loan Application Data 1000 21
Descriptive Statistics of the Quantitative Variables.
Mean Std Deviation Median Max Min Mode
Monthly Loan Duration 20.903 12.05881 18 72 4 24
Requested Amount 33271.258 2822.737 2319.5 18424 250 548230
Installment Rate 2.973 1.118715 3 4 1 4
Residence History 2.845 1.103718 3 4 1 4
Age 35.546 11.37547 33 75 19 27
Existing Loan 1.407 0.57765 1 4 1 1
Default 1.3 0.45848 1 2 1 1
Dependents 1.155 0.362085 1 2 1 1

Overall, we can see from the summary statistic that the smallest requested amount is 250 dollars with the maximum amount being over 15000.

In Addition to this, the age group in this data ranges from 19 to 75 with most of the age groups falling from 40 downwards. Most of the applicants have lived in the current country for almost 3 years.

Finally, we can see that a larger percentage of the data has dependents being 1. Most of the borrowers in the data have loans for up to 20 months which is equivalent to a year and 8 months.

Methodology

From the project objectives that can be deduced from the introduction above , there are three key questions to be seen and answered:

Factors Considered To Affect Loan Risks

Banks and other loan businesses are usually at risk when it comes to lending and this is because of the payment setup and the borrowers who receive but don’t pay at when due or necessary. If the borrower does not pay back, the banks and loan businesses will have no way to makeup for the loss. We need to minimize the potential risk by assessing the important features and placing weights on them. In every loan application, these are important factors to consider.

  • Age - The age of the applicant.
  • Purpose of Loan - The reason for the collection of the loan.
  • Requested Amount - The amount requested.
  • Property - If the applicant owns a property or rents one.
  • Existing Loans - The number of current existing loans to be repaid.
  • Personal Status - The applicant is either single, married, in a common law relationship or divorced.
  • Existing Credit History - The current state of the credit history.
  • Checking Balance - The amount in the bank account.
  • Dependents - The number of dependents.

Analysis I - Classifying the Borrowers as Low, Medium or High Risk.

The main goal of investing in a loan is to get an increased profit or return. One of the main ways lenders can increase their profit through loans is by targeting those who are under the category of low or medium risk and so they try to curb any potential loss by avoiding those in the high risk bracket. The aim of this section of the project is to help lenders identify the finest group of borrowers by assessing them based on different factors. The primary issue when making the investment decision is to figure out how (what methodology) to scrutinize borrowers and through the implementation of what method. Based on the similarities of borrowers’ default patterns, borrowers will be classified using clustering analysis.

From the overall data, we are able to deduce the proportion of borrowers who do not default and who default the loans which we can see from the table below.

Proportion of Defaulted and Not Defaulted
Category Proportion
Defaulted 0.7
Not Defaulted 0.3

Most borrowers (30%) are in a good creditworthiness level, while there are a lot of borrowers (70%) with loans that are considered to be defaulters (Table 3).

Bucket Metric Used For Classifying Defaulters.

The metric used in this project was based on the concept of probability and proportions. After getting the number of defaulters and non-defaulters, the proportion was calculated for each category, after making calculations, the groups with greater than 0.5 were categorized as “High Risk” while the groups with proportions between 0.1-0.4 were categorized as “Medium Risk” with the groups with proportions lower than 0.1 being classified as “Low Risk”. A proper breakdown will be given as visualizations below.

Below, is a table that shows the groupings and insights about the defaulters. We will also employ the use of the bucket metric or benchmark explained above to classify each group as either low, medium or high risk.

Insights on Proportion of Defaulters in Each Group
Group Number Number of Non Defaulters Number of Defaulters Proportion of Defaulters Risk Level
Group 1 324 27 0.076 Low Risk
Group 2 18 162 0.9 High Risk
Group 3 189 13 0.064 Low Risk
Group 4 67 15 0.223 Medium Risk
Group 5 26 29 0.5 High Risk
Group 6 57 41 0.418 Medium Risk
Group 7 19 13 0.406 Medium Risk

Visualisation 1

The figure below depicts the group of people who defaulted the most. We can see that the group with the most default is the common law followed by the single people and most married and divorced people on the same level of default.

Default Count Based on Personal Status

Default Count Based on Personal Status

Visualisation 2

We have established that both single individuals and people under common law default the most, but now we want to see the reason they apply for loan or what they use the loan for. From the figure below, we have been able to identify that. Starting with home maintenance (furniture, domestic appliances,repairs and home entertainment) being the main purpose followed by skill development which encompasses education, training programs and business development. In further analysis and conclusion, we will tie the knot to completely give a very brief summary on the people we should watch out for.

Default Count Based on Personal Status with Purpose of Loan

Default Count Based on Personal Status with Purpose of Loan

Visualisation 3

A larger percentage of the defaulters mostly request for loans that are 5000 dollars or less and the applicants are usually 40 years old or less.

There’s a concentration of points at the left end of the figure below that confirms this fact. We have two unique cases of the defaulters requesting for over 15000 dollars.

 Relationship between the amount requested by the borrower and the Age (Specifically for those categorised under high risk)

Relationship between the amount requested by the borrower and the Age (Specifically for those categorised under high risk)

Visualisation 4

The Figure 4 below shows the distribution of housing status of these defaulters. We see that a larger percentage of them in the high risk category actually own houses. This confirms a fact from the plot generated in Figure 2 where we had most of the defaulters requesting for loans because of home maintenance (furniture, domestic appliances,repairs and home entertainment). While we think owning a house places you at a higher end of not defaulting, we have the exact opposite in this case.

We have extremely few defaulters under the category of those who have fully paid off their houses.

Personal Status and Housing Situation of Those who are classified as high risk

Personal Status and Housing Situation of Those who are classified as high risk

Dependents and Installment Rate of The Common Law Defaulters

  • The average installment rate of those with less than 0, between a dollar to a thousand and greater than a thousand in the bank account are 3.3,2.8 and 3.00.

The statistic explains the installment payment set up for the borrowers where 1 means weekly, 2 means bi-weekly, 3 means monthly and 4 means quarterly. The people who have the longest duration (most times monthly or quarterly) for the payment are those with a checking balance of < 0 which technically means that they do not have sufficient funds in their account. As a result of that, it is quite understandable to see that those individuals would take the longest installment period to pay up.

Surprisingly, those with greater than a thousand dollars in their balance, also have the monthly installment payment set up.

Lastly, the dependents of these borrowers also matter a lot because we need to understand the number of people they are catering for and the amount of responsibility they have at hand, especially kids, parents, extended family members too. Below, is a table consisting of the number of dependents we have for the common law defaulters. It appears that over 90% of them have 1 dependent but still defaulted. Apart from having a number of dependents, this could also be due to the gravity of responsibility, payment and tasks brought upon by as little as 1 dependent. Yes, they could be financially draining.

Count of Dependents - Common Law Defaulters
Number of Dependents Count
1 71
2 1

Analysis II - Probability of Default (Using Multiple Linear Regression)

We will be using regression to develop the model we will use for the prediction of the default probability based on the features of the data.

Using Test Data

possibility <- step_loan_data %>% predict(test_ld, type = "response")
predicted.classes <- ifelse(possibility > 0.5, 1, 0)
head(predicted.classes)
##  2  4 10 15 21 28 
##  0  1  0  1  0  1

Using New Incoming Loan Applicant Data

  • Checking balance of 1-1000 dollars
  • Loan Duration of 24 months
  • Existing credit history - 1 (Critical)
  • Requested Amount - 10,000 dollars
  • Savings Balance - 501-1000 dollars
  • Employment Duration - 1 year
  • Installment Rate - 2 (Bi-weekly)
  • Personal Status - 1 (Divorced)
  • Other Debtors - 0 (None)
  • Residence History - 3 years
  • Property - 2 (Savings or Real Estate)
  • Age - 28
  • Installment - 2 (Bank)
  • Housing - 1 Fully Paid with an existing loan of 0 and 1 dependent.
  • Landline - 1 Yes
  • Foreign Worker - Yes with a purpose of skill development.
  • Individual is also self employed
newdat <- c(1,24,1,10000,3,1,2,1,0,3,2,28,2,1,0,1,1,1,1,0,0,0,1,0,0)
possibility <- step_loan_data %>% predict(newdat, type = "response")
predicted.classes
## 3011 
##    1

Analysis III - Who are The Typical Customers To Focus On? The Good Borrowers.

We will be focusing on the cluster/group that possesses the most number of non-defaulters as this gives us a hint to understanding the similarities and why a massive portion of them were placed into the first group where we have 324 non defaulters out of 351.

Visualisation 5

The figure below shows us that 68% of those non-defaulters in the good category who are also typical customers are those who are skilled employees with just 2% of them being unemployed non-residents. This is a rational fact as these individuals are those who are not entitled to work in the country and are applying for loans, it can be a little more tasking to pay back these loans assuming they do not have any other source of income or wealth.

Loan Applicants by Job

Loan Applicants by Job

Visualisation 6

Most of typical customers who are non-defaulters have fully repaid this bank and applied for the loan in order to meet the skill development needs which encompasses education, training programs and business development. Home maintenance still remains to be the top reason most of the applicants ask for a loan.

Credit History with the Purpose of Loan

Credit History with the Purpose of Loan

Visualisation 7

We have a lot of points concentrated at the start of the requested amount so we have a lot of applicants who are single being the typical customers who borrow a lot and request for 5000 dollars or less with just a few of them having tendencies to go beyond requesting for 10,000 dollars.

Personal Status and The Requested Amount Borrowed (Good Borrowers)

Personal Status and The Requested Amount Borrowed (Good Borrowers)

Conclusion

Lenders try to make maximum profit as possible and get paid when it’s due, I order to do that, they have to avoid high risk borrowers. These kind of borrowers were placed as the focus of this analysis after classifying them into low, medium and high risk. After rounding up the analysis on these high risk borrowers, we were able to get the important features and specifics of these borrowers, they are listed below.

The of all these features will serve as the benchmark to understanding or figuring out if an incoming loan applicant will default to a reasonable extent.

Our typical customers who are good and on the low risk side are those who are

The reason to focus on this group of people is because of how much profit the lender could make if they lend money over to those who are diligent with a low level of risk and good standing in credit history. Again, the combination of these features will serve as a good benchmark or guide to understanding or figuring out the good applicants.

References

https://www.researchgate.net/publication/346080530_CLUSTERING_ANALYSIS_TO_SUPPORT_LENDER%27S_DECISION-MAKING_IN_P2P_LENDING_-_Bondora_case_study_borrower%27s_creditworthiness_classification - Yingqi Zuo