Automobile: “Automobile, byname auto, also called motorcar or car, a usually four-wheeled vehicle designed primarily for passenger transportation and commonly propelled by an internal-combustion engine using a volatile fuel” (Cromer et al., 2023). Automobiles could also be referred to as a vehicle, motor vehicle, light trucks, etc. These days, automobiles can be propelled by either an internal-combustion engine or by electric battery. About 282 million vehicles were registered in the United States in 2021 about 90 million more compared to about 193 million in 1990. (U.S. Vehicle Fleet 1990-2021 | Statista, 2023). Also, the average price of motor vehicles in the United States in 2022 is about $46,000 USD (U.S.: Average Selling Price of New Vehicles 2022 | Statista, 2023)
Auto Loan: An auto loan is the money you borrow to pay for your car. You will have to repay the loan with interest in fixed installments (Martin, 2023). Auto Loans are also referred to as car loans, vehicle loans, car financing, etc. These loans are often secured loans meaning that the car is used as the collateral to secure the loan. Typically, consumers borrow money to buy vehicles. In fact, consumers owed about 1.41 trillion US dollars on vehicles they drove in 2022. Also, the average auto loan balance is about $22,000 USD. In addition, about 80% of all new vehicles on the road is financed through a loan or lease (Chris, 2023). This goes to tell us that the auto-loan industry is a multi-trillion dollar industry which signifies the importance of loan default models to help minimize losses.
Vehicle Loan Default: Loan default refers to when a
borrower fails to make their installment payments as agreed by the loan
terms. Usually for secured loans, the lender (bank in this case) can
repossess the asset (car) which is used as collateral for the loan.
Banks usually do not want to do that but sometimes are forced to do that
if the borrower defaults and does not make an arrangement with the
lender. When a lender repossess the car, the value of the car at the
time of repossession may not be up to the loan balance and the lender
will have no option than to write off that balance as a loss. In the US,
default rates for auto loans are on the rise and currently sits at about
2%. The benefits of being able to predict whether a borrower will
default cannot be over emphasized as it will not only help lenders to
know whether to approve or deny a loan application, it will also help
them price the interest rate for lenders appropriately.
The problem of loan default is in all ramifications very probabilistic
and it’s often also considered strongly in credit risk modeling/credit
scoring. Hence, lenders use a vast amount of credit risk tool to
determine whether a borrower is likely to default or not. Entering
default simply means that the lender determines that the borrower is not
going to pay, usually some time after 90 days of no payments — can
translate into your car being repossessed (O’Brien, 2023).
In Crook et. al.(2007), Credit scoring is concerned with developing empirical models to support decision making in the retail credit business. Also, a credit score is a model-based estimate of the probability that a borrower will show some undesirable behavior in the future. In application scoring, for example, lenders employ predictive models, called scorecards, to estimate how likely an applicant is to default. Such PD (probability of default) scorecards are routinely developed using classification algorithms (Hand & Henley, 1997).
Whilst the extension of credit goes back to the Babylonian times, the history of credit scoring began in 1941 with the publication by Durand of a study that distinguished between good and bad loans made by 37 firms (Crook et al. 2007). Since then, the already established techniques of statistical discrimination have been developed and an enormous number of new classification algorithms have been researched and tested. Virtually all major banks use credit scoring with specialized consultancies providing credit scoring services and offering powerful software to score applicants, monitor their performance and manage their accounts.
Altman and Saunders (1998) published an overview of credit risk modelling for the last 20 years. They found that credit risk modelling has evolved drastically for the past 20 years due to new emerging statistical techniques(Altman & Saunders, 1998). Later, another group of researchers published an extension to Altman and Saunders work presenting a further development of credit risk modelling(Hao, Alam, & Carling, 2010). Their work identified more than 1000 articles on this topic, and found that logistic regression (LR) model and discriminant analysis are the most widely used methods for constructing scoring systems.
Also, Crook et al.(2007) conducted a research on credit risk scoring and found that the commonest method of estimating a classifier of applicants into those likely to repay and those unlikely to repay is logistic regression with the logit value compared with a cut off. Basically, this research makes claim that the industry standard for predicting loan default is the logistic regression model.
Lessmann et al.(2015) compared about 41 classifiers based on six performance measures across eight real-world credit scoring data sets from the UK, Europe, and Australia. They investigated the overall model performance using several datasets, and examine the predictive performance in each case. The conclusion from this research suggests that several classifiers predict risk significantly better than the industry standard of Logistic Regression (LR). It went further to recommend the Random Forests(RF) model as a benchmark model because of its effectiveness, precision, and its interpretability.
Agrawal et al.(2014) studied the impact of contract-specific variables as predictors in commercial vehicle loans. In their research, applying a logistic regression model for predicting default, around 11 out of 17 contract-specific variables where identified to provide additional assistance for the credit lending institution(Agrawal, Agrawal, & Raizada, 2014). The authors also suggest that contract information could improve the accuracy in more advanced nonlinear models. Specifically, the authors suggest the use of Neural Networks as one potential predictive model to improve the performance based on contract information(Agrawal et al.,2014).
Keeping the outcome of the above literature in mind, this project aims to contribute to the field of vehicle loan prediction by developing machine learning models that will be able to predict vehicle loan default based on the available data, and also compare three (3) models: Random Forest, XGBoost, and Neural Networks models to the industry standard Logistic Regression model. The data set used in this project contains more data than those used in most of the literature reviewed above. Statistically, more data provides better results and we hope that will be useful in better comparing the model and decide which model provides the best prediction metrics. In this work, we explain the data used, the pre-processing and feature selection involved and also provide a quick overview of predictive analytics as well as quick review to help understand the scoring metrics for classification problems. In addition, each of the models used are briefly explained and then the analysis of the data followed by modeling/testing and then conclusion on the findings.
Logistic Regression is a type of classification algorithm that is
used when the response variable is binary in nature. Being binary means
that the response variable is a two level categorical variable. The
Logistic Regression model belongs to the family of generalized linear
models (GLMs) and it’s used when the response is a two-level categorical
variable. Typical examples of binary response variables are Yes/No;
Male/Female; Cancer/No Cancer; Approve/Deny, etc. These response
variables are often coded as 1 or 0. Even though the name contains
regression, the logistic regression model is often used when the
response variable is discrete. i.e It is a classification
algorithm.
Logistic Regressions must not necessarily be binary as there are
possible situations where there are more than two-level categories in
the response variables and such situations are regarded as multinomial
logistic regression which is beyond the scope of this project.
The logistic regression model relates the probability that a response
variable would be successful to the predictors \(x_{1, i}, x_{2, i}, ..., x_{k, i}\) through
a framework like that of multiple regression:
\(logit(p_{i}) =
log_{e}(\frac{p_{i}}{1-p_{i}}) = \beta_{0} + \beta_{1}x_{1,i} +
\beta_{2}x_{2,i} + ... + \beta_{k}x_{k,i}\)
Tree-based models are a type of supervised learning algorithm used
for both classification and regression problems. They construct a
decision tree that recursively splits the data into subsets based on the
most significant features. The tree structure consists of nodes
representing the features, edges representing the decision rules, and
leaves representing the output (class or value). The key advantage of
tree-based models is their ability to handle non-linear relationships
and interactions between features. However, they are prone to
overfitting, especially when the trees become too complex. Tree based
models follow two major approaches - bagging or boosting.
Bagging (Bootstrap Aggregating): It’s a technique that aims to
reduce variance and prevent overfitting by training multiple models on
different bootstrapped subsets of the dataset and then averaging the
predictions. Random Forest is an ensemble method based on bagging,
utilizing multiple decision trees trained on different subsets of the
data.
Boosting: Boosting is an ensemble technique that combines weak
learners (typically shallow trees) sequentially to create a strong
model. It focuses on improving the shortcomings of its predecessors by
assigning higher weight to misclassified data, effectively learning from
previous mistakes. Gradient Boosting and XGBoost are examples of
boosting algorithms.
It’s an ensemble learning method that constructs multiple decision trees and merges their predictions to improve accuracy and reduce overfitting. It introduces randomness both in feature selection and dataset bootstrapping. By aggregating predictions from various trees, it tends to be more robust and less prone to overfitting compared to a single decision tree.
This is a boosting technique that builds trees sequentially. It fits each tree to the residuals (errors) of the preceding tree, reducing the errors at each step. Gradient boosting is a powerful algorithm known for its ability to handle complex data and achieve high accuracy.
It is an optimized and highly efficient implementation of gradient boosting. XGBoost improves upon the traditional gradient boosting method by introducing regularization, parallel computing, and a variety of enhancements that significantly speed up the training process and improve accuracy.
Artificial Neural Networks (ANNs) are a class of machine learning models inspired by the structure and function of the human brain. They consist of interconnected nodes, known as neurons, organized in layers to process and learn from data.
Image and Speech Recognition: CNNs are widely used in image
classification, object detection, and speech recognition.
Natural Language Processing (NLP): RNNs and variants like LSTM
and GRU are used for text analysis, language translation, and sentiment
analysis.
Predictive Modeling: ANNs are applied in various predictive
tasks, including regression, classification, and time-series
forecasting.
Computational Complexity: Training large ANNs can be
computationally intensive and require significant resources.
Overfitting: ANNs, especially with complex architectures, can
overfit the training data if not properly regularized or trained on
diverse data.
CaTools package into training
(df_train) and testing data (df_test)
Note: Primary accounts are those which the customer has taken for his
personal use while secondary accounts are those which the customer act
as a co-applicant or guarantor.
The response variable is the loan_default variable and it is a
two level categorical variable with a value of 1 for default and value 0
for no default.
The method adopted in this analysis follows a certain number of steps and code re-use was essential to avoid errors resulting from code duplication. To perform this analysis, the following steps were taken in R studio:
data_cleaning function developed in this analysis. It
leverages the tidyverse group of packages.
loan_default in a bar chart. Further exploratory data
analysis was done to better understand the data.
data_preprocessing function also developed in this
analysis.
model_training function
developed as part of this analysis as well.
Further details about the functions used in the analysis are provided in the section below:
To clean and pre-process the data, two major functions
data_training and data_preprocessing were
developed respectively to avoid code duplication and errors such that
new data can simply be passed into those functions to be cleaned,
pre-processed and ready for either training or testing.
This function pre-processes the data for either training or testing
purposes. It expects the cleaned data from the data_cleaning function
and a mode to specify whether the data will be used for training or
testing. It has a further helper function
data_preprocess_scaling that uses the standard-scaler
approach to scale the data. For testing data, mode is passed as ‘test’
and the function basically calls the helper function to scale the data
and returned the scaled data that can then be used for testing purposes
while for training data, mode is passes as ‘train’ and the function uses
the ROSE package in R to do oversampling/under-sampling to balance the
class of data in this binary classification problem after which it calls
the helper function to standardize/scale the balanced data.
This function does the model training of any of the four (4) models.
It expects a pre-processed dataframe and mode of training. It has a
further helper function xgb_nnet_preprocess that does
further preprocessing for XGB, and Neural Network models. For the XGB
model, the train function expects a certain DMatrix which can only be
numeric values and in that case, the model_training function calls this
helper function which converts the non-numeric predictor variables to
dummies, and converts the dataframe into a DMatrix format that can be
used to train the data using the XGB algorithm. Also, for the NNET
model, the helper function only dummifies the non-numeric predictor
variables.
For models that need further pre-processing, the function calls the
appropriate helper function to do that and then trains the model using
the data and type of model provided to return a trained model.
This behaves similarly like the model_training function although it
does so for predictions. It expects three parameters - data,
trained_model, and model_type. Depending on the type of the model, it
determines if the data needs further pre-processing like in the case of
XGB and NNET models whereby it calls the same helper function
xgb_nnet_preprocess to return the same format of data it
gave to the training model as that is what would be needed to make
predictions. The predictions returned are probabilities of belonging to
a certain class (Class 0). However, this function goes further to
convert these probabilities to classes using a given threshold which can
always be adjusted based on business needs. For this project, a 50%
threshold is used.
After training of a model and using the model to make predictions, it
is very important to determine if the predictions of the model can be
useful or taken seriously. Hence, we evaluate the model and this
function does the evaluation of the model. It calculates the accuracy,
precision, recall, and auc for any one of the four models once the
actual and predicted values are provided. It relies heavily on the
Metrics package in R for the evaluations except for auc
where it uses the pROC package.
Note: Although, there is a section for exploratory data analysis, there is no function that does this.
Libraries
Load the required libraries for the analysis
library(Amelia) # To visualize missing data
library(caret)
library(caTools) # for train test split
library(corrplot) # To plot correlation plot
library(cowplot) # To combine plots in a grid
library(fastDummies) # to convert character variables to dummies
library(ggcorrplot) # To plot correlation plot
library(kableExtra)
library(Metrics) # for model evaluation
library(nnet)
library(pROC)
library(ranger) # for random forest implementation
library(ROSE) # to balance the data
library(tidyverse)
library(xgboost)Read the datasets into memory from the github location.
url_vehicle_loan_default_train = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_train_data.csv"
url_vehicle_loan_default_test = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_test_data.csv"
vehicle_loan_default_train_raw = read_csv(url_vehicle_loan_default_train) %>% as_tibble()
vehicle_loan_default_test_raw = read_csv(url_vehicle_loan_default_test) %>% as_tibble()Display a few records of the dataset to have an idea of how the data looks like.
# display a few records of the raw data
raw_data_few_records <- kable(head(vehicle_loan_default_train_raw, 50), "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "850px", height = "350px")
raw_data_few_records| UNIQUEID | DISBURSED_AMOUNT | ASSET_COST | LTV | BRANCH_ID | SUPPLIER_ID | MANUFACTURER_ID | CURRENT_PINCODE_ID | DATE_OF_BIRTH | EMPLOYMENT_TYPE | DISBURSAL_DATE | STATE_ID | EMPLOYEE_CODE_ID | MOBILENO_AVL_FLAG | AADHAR_FLAG | PAN_FLAG | VOTERID_FLAG | DRIVING_FLAG | PASSPORT_FLAG | PERFORM_CNS_SCORE | PERFORM_CNS_SCORE_DESCRIPTION | PRI_NO_OF_ACCTS | PRI_ACTIVE_ACCTS | PRI_OVERDUE_ACCTS | PRI_CURRENT_BALANCE | PRI_SANCTIONED_AMOUNT | PRI_DISBURSED_AMOUNT | SEC_NO_OF_ACCTS | SEC_ACTIVE_ACCTS | SEC_OVERDUE_ACCTS | SEC_CURRENT_BALANCE | SEC_SANCTIONED_AMOUNT | SEC_DISBURSED_AMOUNT | PRIMARY_INSTAL_AMT | SEC_INSTAL_AMT | NEW_ACCTS_IN_LAST_SIX_MONTHS | DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS | AVERAGE_ACCT_AGE | CREDIT_HISTORY_LENGTH | NO_OF_INQUIRIES | LOAN_DEFAULT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 420825 | 50578 | 58400 | 89.55 | 67 | 22807 | 45 | 1441 | 01-01-1984 | Salaried | 03-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 537409 | 47145 | 65550 | 73.23 | 67 | 22807 | 45 | 1502 | 31-07-1985 | Self employed | 26-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 598 | I-Medium Risk | 1 | 1 | 1 | 27600 | 50200 | 50200 | 0 | 0 | 0 | 0 | 0 | 0 | 1991 | 0 | 0 | 1 | 1yrs 11mon | 1yrs 11mon | 0 | 1 |
| 417566 | 53278 | 61360 | 89.63 | 67 | 22807 | 45 | 1497 | 24-08-1985 | Self employed | 01-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 624493 | 57513 | 66113 | 88.48 | 67 | 22807 | 45 | 1501 | 30-12-1993 | Self employed | 26-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 305 | L-Very High Risk | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 0 | 0 | 0 | 0yrs 8mon | 1yrs 3mon | 1 | 1 |
| 539055 | 52378 | 60300 | 88.39 | 67 | 22807 | 45 | 1495 | 09-12-1977 | Self employed | 26-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 1 | 1 |
| 518279 | 54513 | 61900 | 89.66 | 67 | 22807 | 45 | 1501 | 08-09-1990 | Self employed | 19-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 825 | A-Very Low Risk | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1347 | 0 | 0 | 0 | 1yrs 9mon | 2yrs 0mon | 0 | 0 |
| 529269 | 46349 | 61500 | 76.42 | 67 | 22807 | 45 | 1502 | 01-06-1988 | Salaried | 23-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 510278 | 43894 | 61900 | 71.89 | 67 | 22807 | 45 | 1501 | 04-10-1989 | Salaried | 16-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 17 | Not Scored: Not Enough Info available on the customer | 1 | 1 | 0 | 72879 | 74500 | 74500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 2mon | 0yrs 2mon | 0 | 0 |
| 490213 | 53713 | 61973 | 89.56 | 67 | 22807 | 45 | 1497 | 15-11-1991 | Self employed | 05-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 718 | D-Very Low Risk | 1 | 1 | 0 | -41 | 365384 | 365384 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4yrs 8mon | 4yrs 8mon | 1 | 0 |
| 510980 | 52603 | 61300 | 86.95 | 67 | 22807 | 45 | 1492 | 01-06-1968 | Salaried | 16-09-2018 | 6 | 1998 | 1 | 0 | 0 | 1 | 0 | 0 | 818 | A-Very Low Risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2608 | 0 | 0 | 0 | 1yrs 7mon | 1yrs 7mon | 0 | 0 |
| 548567 | 53278 | 61230 | 89.83 | 67 | 22807 | 45 | 1493 | 01-01-1979 | Self employed | 29-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 300 | M-Very High Risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2270 | 0 | 0 | 0 | 0yrs 7mon | 0yrs 7mon | 0 | 1 |
| 486821 | 64769 | 74190 | 89.23 | 67 | 22807 | 45 | 1446 | 07-09-1984 | Salaried | 03-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 786 | B-Very Low Risk | 3 | 2 | 0 | 676 | 36154 | 23374 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2yrs 1mon | 2yrs 3mon | 1 | 0 |
| 478647 | 53278 | 61330 | 89.68 | 67 | 22807 | 45 | 1497 | 01-06-1974 | Salaried | 30-08-2018 | 6 | 1998 | 1 | 0 | 0 | 1 | 0 | 0 | 300 | M-Very High Risk | 7 | 2 | 1 | 0 | 69900 | 69900 | 0 | 0 | 0 | 0 | 0 | 0 | 3300 | 0 | 0 | 0 | 1yrs 3mon | 2yrs 9mon | 0 | 1 |
| 479533 | 49478 | 57010 | 89.46 | 67 | 22807 | 45 | 1497 | 16-08-1984 | Salaried | 30-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 738 | C-Very Low Risk | 10 | 5 | 0 | 79750 | 187000 | 187000 | 0 | 0 | 0 | 0 | 0 | 0 | 23309 | 0 | 1 | 0 | 1yrs 0mon | 2yrs 1mon | 4 | 1 |
| 483869 | 49278 | 57080 | 89.35 | 67 | 22807 | 45 | 1495 | 18-02-1973 | Self employed | 31-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 300 | M-Very High Risk | 5 | 5 | 3 | 95597 | 179252 | 179252 | 0 | 0 | 0 | 0 | 0 | 0 | 3514 | 0 | 0 | 0 | 3yrs 11mon | 7yrs 2mon | 0 | 1 |
| 600655 | 47549 | 61400 | 79.80 | 67 | 22807 | 45 | 1440 | 05-07-1994 | Salaried | 22-10-2018 | 6 | 1998 | 1 | 0 | 0 | 1 | 0 | 0 | 17 | Not Scored: Not Enough Info available on the customer | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7900 | 0 | 1 | 0 | 0yrs 1mon | 0yrs 1mon | 0 | 1 |
| 513916 | 57713 | 65750 | 89.28 | 67 | 22807 | 45 | 1440 | 01-06-1976 | Self employed | 18-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 300 | M-Very High Risk | 6 | 4 | 2 | 29069 | 1067200 | 1067200 | 0 | 0 | 0 | 0 | 0 | 0 | 47100 | 0 | 1 | 1 | 2yrs 6mon | 5yrs 6mon | 0 | 0 |
| 522020 | 53503 | 62100 | 87.28 | 67 | 22807 | 45 | 1498 | 27-02-1983 | Self employed | 20-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 688 | E-Low Risk | 13 | 8 | 0 | 1076657 | 2277048 | 2277048 | 0 | 0 | 0 | 0 | 0 | 0 | 4982 | 0 | 1 | 0 | 1yrs 10mon | 4yrs 7mon | 0 | 0 |
| 492995 | 70017 | 86760 | 82.99 | 67 | 22807 | 45 | 1479 | 10-08-1988 | Self employed | 06-09-2018 | 6 | 1998 | 1 | 0 | 0 | 1 | 0 | 0 | 585 | I-Medium Risk | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1yrs 9mon | 1yrs 9mon | 0 | 1 |
| 568857 | 58259 | 68500 | 86.13 | 67 | 22807 | 45 | 1468 | 16-04-1980 | Self employed | 11-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 615 | H-Medium Risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 11mon | 0yrs 11mon | 1 | 1 |
| 590630 | 58013 | 69650 | 84.71 | 67 | 22807 | 45 | 1497 | 01-11-1978 | Self employed | 20-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 750 | C-Very Low Risk | 9 | 1 | 0 | 134499 | 32198 | 32198 | 0 | 0 | 0 | 0 | 0 | 0 | 557 | 0 | 1 | 0 | 0yrs 6mon | 0yrs 10mon | 1 | 0 |
| 467015 | 31184 | 57110 | 56.91 | 67 | 22807 | 45 | 1498 | 29-02-1984 | Salaried | 27-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 801 | B-Very Low Risk | 7 | 5 | 0 | 1338774 | 2306289 | 2291743 | 0 | 0 | 0 | 0 | 0 | 0 | 11083 | 0 | 0 | 0 | 2yrs 9mon | 5yrs 10mon | 2 | 0 |
| 563215 | 43594 | 78256 | 57.50 | 67 | 22744 | 86 | 1499 | 14-07-1994 | Self employed | 08-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 513139 | 54513 | 61900 | 89.66 | 67 | 22807 | 45 | 1468 | 31-05-1979 | Self employed | 17-09-2018 | 6 | 1998 | 1 | 0 | 0 | 1 | 0 | 0 | 738 | C-Very Low Risk | 1 | 1 | 0 | 6690 | 25200 | 25200 | 0 | 0 | 0 | 0 | 0 | 0 | 1700 | 0 | 0 | 0 | 1yrs 3mon | 1yrs 3mon | 0 | 0 |
| 498082 | 73123 | 92900 | 79.66 | 67 | 22807 | 45 | 1480 | 02-01-1989 | Self employed | 10-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 586411 | 55213 | 68600 | 83.09 | 67 | 22807 | 45 | 1494 | 01-01-1986 | Salaried | 18-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 440293 | 53713 | 61780 | 89.83 | 67 | 22807 | 45 | 1468 | 02-08-1968 | Self employed | 16-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 1 |
| 566763 | 57713 | 68040 | 86.27 | 67 | 22807 | 45 | 1497 | 01-01-1976 | Self employed | 10-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 1 |
| 605314 | 57513 | 65750 | 88.97 | 67 | 22807 | 45 | 1497 | 12-09-1972 | Self employed | 23-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 615 | H-Medium Risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3yrs 1mon | 3yrs 1mon | 1 | 1 |
| 519075 | 54513 | 61900 | 89.66 | 67 | 22807 | 45 | 1473 | 27-06-1969 | Self employed | 19-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 730 | D-Very Low Risk | 5 | 3 | 0 | 101518 | 162800 | 162800 | 0 | 0 | 0 | 0 | 0 | 0 | 8972 | 0 | 1 | 0 | 2yrs 0mon | 5yrs 4mon | 0 | 0 |
| 551137 | 45349 | 60300 | 76.29 | 67 | 22807 | 45 | 1501 | 01-01-1974 | Self employed | 30-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 758 | C-Very Low Risk | 5 | 3 | 0 | 909093 | 1442349 | 1442349 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2yrs 0mon | 2yrs 9mon | 0 | 0 |
| 525983 | 46549 | 69518 | 69.05 | 67 | 22744 | 86 | 1480 | 23-05-1990 | Salaried | 21-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 501823 | 57259 | 70100 | 82.74 | 67 | 22807 | 45 | 1497 | 01-06-1966 | Salaried | 12-09-2018 | 6 | 1998 | 1 | 0 | 0 | 1 | 0 | 0 | 768 | B-Very Low Risk | 7 | 3 | 0 | 324323 | 604845 | 604845 | 0 | 0 | 0 | 0 | 0 | 0 | 1219 | 0 | 1 | 0 | 1yrs 10mon | 4yrs 10mon | 0 | 0 |
| 451537 | 42594 | 60630 | 72.57 | 67 | 22807 | 45 | 1497 | 29-07-1996 | Self employed | 21-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 439084 | 50678 | 58300 | 89.88 | 67 | 22807 | 45 | 1474 | 01-06-1977 | Self employed | 14-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 584660 | 53078 | 64280 | 84.01 | 67 | 22807 | 45 | 1497 | 05-10-1993 | Self employed | 17-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 610 | H-Medium Risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2111 | 0 | 0 | 0 | 0yrs 4mon | 0yrs 4mon | 1 | 1 |
| 606338 | 56013 | 63930 | 89.16 | 67 | 22807 | 45 | 1502 | 22-09-1986 | Self employed | 23-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 653 | F-Low Risk | 9 | 6 | 0 | 3878357 | 4015900 | 4015900 | 0 | 0 | 0 | 0 | 0 | 0 | 126287 | 0 | 4 | 0 | 0yrs 11mon | 4yrs 0mon | 1 | 1 |
| 641415 | 58013 | 65838 | 89.61 | 67 | 22807 | 45 | 1497 | 01-01-1991 | Self employed | 30-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 413 | K-High Risk | 13 | 3 | 1 | 19121 | 91161 | 91161 | 0 | 0 | 0 | 0 | 0 | 0 | 22427 | 0 | 0 | 2 | 1yrs 1mon | 2yrs 1mon | 4 | 0 |
| 590213 | 55759 | 63100 | 89.54 | 67 | 22807 | 45 | 1492 | 19-06-1977 | Self employed | 20-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 709 | D-Very Low Risk | 4 | 3 | 0 | 18518 | 77480 | 77480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1yrs 9mon | 3yrs 9mon | 4 | 0 |
| 422926 | 50578 | 58400 | 89.55 | 67 | 22807 | 45 | 1577 | 16-08-1996 | Salaried | 06-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 1 |
| 557071 | 51303 | 66450 | 78.25 | 67 | 22807 | 45 | 1499 | 01-06-1969 | Salaried | 04-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 719 | D-Very Low Risk | 5 | 2 | 0 | 8000 | 145000 | 145000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1yrs 3mon | 2yrs 11mon | 0 | 0 |
| 582949 | 40894 | 61230 | 67.78 | 67 | 22807 | 45 | 1497 | 07-02-1993 | Self employed | 16-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 596436 | 42894 | 70600 | 61.61 | 67 | 22807 | 45 | 1497 | 20-06-1982 | Self employed | 21-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 507978 | 64282 | 74290 | 89.11 | 67 | 22807 | 45 | 1474 | 01-06-1990 | Salaried | 15-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 529381 | 57213 | 64750 | 89.88 | 67 | 22807 | 45 | 1502 | 27-11-1976 | Salaried | 23-09-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 16 | Not Scored: No Activity seen on the customer (Inactive) | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 11mon | 0yrs 11mon | 0 | 0 |
| 480958 | 68082 | 79806 | 87.71 | 67 | 22744 | 86 | 1504 | 05-03-1990 | Salaried | 31-08-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | No Bureau History Available | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0yrs 0mon | 0yrs 0mon | 0 | 0 |
| 566809 | 48349 | 67650 | 72.43 | 67 | 22807 | 45 | 1497 | 15-01-1993 | Salaried | 10-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 15 | Not Scored: Sufficient History Not Available | 1 | 1 | 0 | 155000 | 155000 | 155000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0yrs 0mon | 0yrs 0mon | 1 | 0 |
| 585779 | 61013 | 68850 | 89.76 | 67 | 22807 | 45 | 1501 | 01-06-1986 | Self employed | 17-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 701 | E-Low Risk | 9 | 2 | 0 | 157671 | 214800 | 214800 | 0 | 0 | 0 | 0 | 0 | 0 | 2667 | 0 | 0 | 0 | 0yrs 5mon | 1yrs 8mon | 0 | 1 |
| 559601 | 54078 | 70000 | 78.57 | 67 | 22807 | 45 | 1497 | 25-03-1974 | Salaried | 06-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 626 | H-Medium Risk | 4 | 4 | 1 | 2470898 | 2836417 | 2836417 | 0 | 0 | 0 | 0 | 0 | 0 | 29840 | 0 | 0 | 0 | 2yrs 6mon | 5yrs 2mon | 1 | 0 |
| 612741 | 57613 | 68950 | 84.99 | 67 | 22807 | 45 | 1499 | 14-09-1980 | Self employed | 24-10-2018 | 6 | 1998 | 1 | 1 | 0 | 0 | 0 | 0 | 717 | D-Very Low Risk | 1 | 1 | 0 | 3793 | 49597 | 49597 | 0 | 0 | 0 | 0 | 0 | 0 | 1956 | 0 | 0 | 0 | 2yrs 10mon | 2yrs 10mon | 1 | 1 |
The function does cleaning of the data. It only accepts a dataframe (or a tibble)
data_cleaning <- function(df){
# This function accepts a dataframe (df) as input and returns another dataframe (cleaned_df) that is clean.
tryCatch({
if(is_tibble(df) | is.data.frame(df)){
print("The dataframe is a tibble, will proceed to clean data")
print("Data Cleaning in progress...")
# rename all the columns in the dataframe to lowercase
cleaned_df <- df %>% rename_all(tolower) %>%
# compute the age of the applicant in number of years
mutate(date_of_birth = dmy(date_of_birth),
disbursal_date = dmy(disbursal_date),
age = difftime(disbursal_date, date_of_birth, units = "days"),
age_years = round(as.numeric(age / 365.25), 0)) %>%
# extract the years and month component of the average_acct_age and convert to years
mutate(average_acct_age_year_comp = as.numeric(str_extract(average_acct_age, "\\d+")),
average_acct_age_mon_comp = as.numeric(str_extract(average_acct_age, "\\d+(?=mon)")),
average_acct_age = round((average_acct_age_year_comp + average_acct_age_mon_comp/12), 0)
) %>%
# extract the years and month component of the credit_history_length and convert to years
mutate(credit_history_length_year_comp = as.numeric(str_extract(credit_history_length, "\\d+")),
credit_history_length_comp = as.numeric(str_extract(credit_history_length, "\\d+(?=mon)")),
credit_history_length = round((credit_history_length_year_comp + credit_history_length_comp/12), 0)
) %>%
# clean up the perform_cns_score_distribution to include only a few categories
mutate(lowercase_cns_description = tolower(perform_cns_score_description),
perform_cns_score_description = case_when(
str_detect(lowercase_cns_description, "very low risk") ~ "very_low_risk",
str_detect(lowercase_cns_description, "low risk") ~ "low_risk",
str_detect(lowercase_cns_description, "medium risk") ~ "medium_risk",
str_detect(lowercase_cns_description, "high risk") ~ "high_risk",
str_detect(lowercase_cns_description, "very high risk") ~ "very_high_risk",
str_detect(lowercase_cns_description, "not scored|no bureau") ~ "low_risk",
TRUE ~ "none")) %>%
# clean up the employment type to have only few categories
mutate(lower_case_employment_type = tolower(employment_type),
employment_type = case_when(
str_detect(lower_case_employment_type, "salaried") ~ "salaried",
str_detect(lower_case_employment_type, "self employed") ~ "self_employed",
TRUE ~ "not_reported")) %>%
# select only the required columns
select(
age_years, disbursed_amount, asset_cost, ltv, employment_type, perform_cns_score_description,
pri_no_of_accts, pri_active_accts, pri_overdue_accts, pri_current_balance, pri_sanctioned_amount,
pri_disbursed_amount, sec_no_of_accts, sec_active_accts, sec_overdue_accts, sec_current_balance,
sec_sanctioned_amount, sec_disbursed_amount, primary_instal_amt, sec_instal_amt, new_accts_in_last_six_months,
delinquent_accts_in_last_six_months, average_acct_age, credit_history_length, no_of_inquiries, loan_default
)
print("Data Cleaning complete!!!")
return(cleaned_df)
}
else{
print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
}
},
#if an error occurs, tell me the error
error=function(e) {
message('An Error Occurred')
print(e)
},
#or if a warning occurs, tell me the warning
warning=function(w) {
message('A Warning Occurred')
print(w)
return(NA)
}
)
}Use the data_cleaning function to clean the data.
## [1] "The dataframe is a tibble, will proceed to clean data"
## [1] "Data Cleaning in progress..."
## [1] "Data Cleaning complete!!!"
Display a few records of the cleaned data.
# display a few records of the cleaned data
cleaned_data_few_records <- kable(head(vehicle_loan_default_train_cleaned, 200), "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "850px", height = "350px")
cleaned_data_few_records| age_years | disbursed_amount | asset_cost | ltv | employment_type | perform_cns_score_description | pri_no_of_accts | pri_active_accts | pri_overdue_accts | pri_current_balance | pri_sanctioned_amount | pri_disbursed_amount | sec_no_of_accts | sec_active_accts | sec_overdue_accts | sec_current_balance | sec_sanctioned_amount | sec_disbursed_amount | primary_instal_amt | sec_instal_amt | new_accts_in_last_six_months | delinquent_accts_in_last_six_months | average_acct_age | credit_history_length | no_of_inquiries | loan_default |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35 | 50578 | 58400 | 89.55 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33 | 47145 | 65550 | 73.23 | self_employed | medium_risk | 1 | 1 | 1 | 27600 | 50200 | 50200 | 0 | 0 | 0 | 0 | 0 | 0 | 1991 | 0 | 0 | 1 | 2 | 2 | 0 | 1 |
| 33 | 53278 | 61360 | 89.63 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 25 | 57513 | 66113 | 88.48 | self_employed | high_risk | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 41 | 52378 | 60300 | 88.39 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 28 | 54513 | 61900 | 89.66 | self_employed | very_low_risk | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1347 | 0 | 0 | 0 | 2 | 2 | 0 | 0 |
| 30 | 46349 | 61500 | 76.42 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 29 | 43894 | 61900 | 71.89 | salaried | low_risk | 1 | 1 | 0 | 72879 | 74500 | 74500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 27 | 53713 | 61973 | 89.56 | self_employed | very_low_risk | 1 | 1 | 0 | -41 | 365384 | 365384 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 5 | 1 | 0 |
| 50 | 52603 | 61300 | 86.95 | salaried | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2608 | 0 | 0 | 0 | 2 | 2 | 0 | 0 |
| 40 | 53278 | 61230 | 89.83 | self_employed | high_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2270 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 34 | 64769 | 74190 | 89.23 | salaried | very_low_risk | 3 | 2 | 0 | 676 | 36154 | 23374 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 1 | 0 |
| 44 | 53278 | 61330 | 89.68 | salaried | high_risk | 7 | 2 | 1 | 0 | 69900 | 69900 | 0 | 0 | 0 | 0 | 0 | 0 | 3300 | 0 | 0 | 0 | 1 | 3 | 0 | 1 |
| 34 | 49478 | 57010 | 89.46 | salaried | very_low_risk | 10 | 5 | 0 | 79750 | 187000 | 187000 | 0 | 0 | 0 | 0 | 0 | 0 | 23309 | 0 | 1 | 0 | 1 | 2 | 4 | 1 |
| 46 | 49278 | 57080 | 89.35 | self_employed | high_risk | 5 | 5 | 3 | 95597 | 179252 | 179252 | 0 | 0 | 0 | 0 | 0 | 0 | 3514 | 0 | 0 | 0 | 4 | 7 | 0 | 1 |
| 24 | 47549 | 61400 | 79.80 | salaried | low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7900 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 42 | 57713 | 65750 | 89.28 | self_employed | high_risk | 6 | 4 | 2 | 29069 | 1067200 | 1067200 | 0 | 0 | 0 | 0 | 0 | 0 | 47100 | 0 | 1 | 1 | 2 | 6 | 0 | 0 |
| 36 | 53503 | 62100 | 87.28 | self_employed | low_risk | 13 | 8 | 0 | 1076657 | 2277048 | 2277048 | 0 | 0 | 0 | 0 | 0 | 0 | 4982 | 0 | 1 | 0 | 2 | 5 | 0 | 0 |
| 30 | 70017 | 86760 | 82.99 | self_employed | medium_risk | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 1 |
| 38 | 58259 | 68500 | 86.13 | self_employed | medium_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 40 | 58013 | 69650 | 84.71 | self_employed | very_low_risk | 9 | 1 | 0 | 134499 | 32198 | 32198 | 0 | 0 | 0 | 0 | 0 | 0 | 557 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
| 34 | 31184 | 57110 | 56.91 | salaried | very_low_risk | 7 | 5 | 0 | 1338774 | 2306289 | 2291743 | 0 | 0 | 0 | 0 | 0 | 0 | 11083 | 0 | 0 | 0 | 3 | 6 | 2 | 0 |
| 24 | 43594 | 78256 | 57.50 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 39 | 54513 | 61900 | 89.66 | self_employed | very_low_risk | 1 | 1 | 0 | 6690 | 25200 | 25200 | 0 | 0 | 0 | 0 | 0 | 0 | 1700 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 30 | 73123 | 92900 | 79.66 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33 | 55213 | 68600 | 83.09 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 50 | 53713 | 61780 | 89.83 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 43 | 57713 | 68040 | 86.27 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 46 | 57513 | 65750 | 88.97 | self_employed | medium_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 1 | 1 |
| 49 | 54513 | 61900 | 89.66 | self_employed | very_low_risk | 5 | 3 | 0 | 101518 | 162800 | 162800 | 0 | 0 | 0 | 0 | 0 | 0 | 8972 | 0 | 1 | 0 | 2 | 5 | 0 | 0 |
| 45 | 45349 | 60300 | 76.29 | self_employed | very_low_risk | 5 | 3 | 0 | 909093 | 1442349 | 1442349 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 0 |
| 28 | 46549 | 69518 | 69.05 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 52 | 57259 | 70100 | 82.74 | salaried | very_low_risk | 7 | 3 | 0 | 324323 | 604845 | 604845 | 0 | 0 | 0 | 0 | 0 | 0 | 1219 | 0 | 1 | 0 | 2 | 5 | 0 | 0 |
| 22 | 42594 | 60630 | 72.57 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 41 | 50678 | 58300 | 89.88 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 25 | 53078 | 64280 | 84.01 | self_employed | medium_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2111 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 32 | 56013 | 63930 | 89.16 | self_employed | low_risk | 9 | 6 | 0 | 3878357 | 4015900 | 4015900 | 0 | 0 | 0 | 0 | 0 | 0 | 126287 | 0 | 4 | 0 | 1 | 4 | 1 | 1 |
| 28 | 58013 | 65838 | 89.61 | self_employed | high_risk | 13 | 3 | 1 | 19121 | 91161 | 91161 | 0 | 0 | 0 | 0 | 0 | 0 | 22427 | 0 | 0 | 2 | 1 | 2 | 4 | 0 |
| 41 | 55759 | 63100 | 89.54 | self_employed | very_low_risk | 4 | 3 | 0 | 18518 | 77480 | 77480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 4 | 4 | 0 |
| 22 | 50578 | 58400 | 89.55 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 49 | 51303 | 66450 | 78.25 | salaried | very_low_risk | 5 | 2 | 0 | 8000 | 145000 | 145000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 |
| 26 | 40894 | 61230 | 67.78 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36 | 42894 | 70600 | 61.61 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 28 | 64282 | 74290 | 89.11 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 42 | 57213 | 64750 | 89.88 | salaried | low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 28 | 68082 | 79806 | 87.71 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 26 | 48349 | 67650 | 72.43 | salaried | low_risk | 1 | 1 | 0 | 155000 | 155000 | 155000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 32 | 61013 | 68850 | 89.76 | self_employed | low_risk | 9 | 2 | 0 | 157671 | 214800 | 214800 | 0 | 0 | 0 | 0 | 0 | 0 | 2667 | 0 | 0 | 0 | 0 | 2 | 0 | 1 |
| 45 | 54078 | 70000 | 78.57 | salaried | medium_risk | 4 | 4 | 1 | 2470898 | 2836417 | 2836417 | 0 | 0 | 0 | 0 | 0 | 0 | 29840 | 0 | 0 | 0 | 2 | 5 | 1 | 0 |
| 38 | 57613 | 68950 | 84.99 | self_employed | very_low_risk | 1 | 1 | 0 | 3793 | 49597 | 49597 | 0 | 0 | 0 | 0 | 0 | 0 | 1956 | 0 | 0 | 0 | 3 | 3 | 1 | 1 |
| 33 | 58413 | 66100 | 89.86 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 47 | 58459 | 71200 | 84.27 | salaried | low_risk | 2 | 2 | 0 | 20865 | 93201 | 93201 | 0 | 0 | 0 | 0 | 0 | 0 | 2000 | 0 | 1 | 0 | 2 | 3 | 0 | 0 |
| 29 | 49478 | 57520 | 88.66 | self_employed | high_risk | 2 | 1 | 1 | 1959 | 364800 | 364800 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 5 | 0 | 1 |
| 24 | 55513 | 67950 | 83.15 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 34 | 53599 | 66115 | 84.70 | salaried | low_risk | 10 | 9 | 0 | 3656027 | 3690603 | 3690603 | 0 | 0 | 0 | 0 | 0 | 0 | 28721 | 0 | 6 | 0 | 0 | 1 | 3 | 0 |
| 34 | 53040 | 67067 | 82.73 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 24 | 50475 | 62413 | 84.60 | salaried | very_low_risk | 3 | 1 | 0 | 2412 | 36920 | 36920 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 25 | 49458 | 63000 | 82.54 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 43 | 48693 | 65500 | 77.86 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 54 | 48500 | 59313 | 83.79 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 42 | 48693 | 62577 | 81.50 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 40 | 54273 | 66855 | 86.75 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 35 | 43869 | 62577 | 71.91 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 47 | 50673 | 62577 | 84.70 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 24 | 49458 | 63000 | 82.54 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 34 | 45268 | 62577 | 76.23 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 34 | 50943 | 63896 | 85.29 | salaried | high_risk | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1780 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
| 28 | 49713 | 68000 | 77.94 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 52 | 46759 | 62577 | 78.30 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 34 | 49458 | 63000 | 82.54 | salaried | very_low_risk | 2 | 2 | 0 | 5428 | 63500 | 63500 | 0 | 0 | 0 | 0 | 0 | 0 | 4248 | 0 | 0 | 0 | 2 | 3 | 0 | 1 |
| 36 | 54343 | 68862 | 82.77 | salaried | low_risk | 9 | 1 | 0 | 79569 | 80000 | 80000 | 0 | 0 | 0 | 0 | 0 | 0 | 5154 | 0 | 1 | 0 | 1 | 2 | 2 | 0 |
| 24 | 42874 | 63840 | 68.92 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 52 | 54273 | 71840 | 80.73 | salaried | low_risk | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 13 | 0 | 0 |
| 39 | 48468 | 65500 | 77.86 | salaried | low_risk | 1 | 1 | 0 | 58558 | 48220 | 48220 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 26 | 50046 | 70516 | 72.75 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 30 | 47773 | 63306 | 80.56 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46 | 46548 | 63306 | 78.82 | salaried | very_low_risk | 1 | 1 | 0 | 0 | 12000 | 12000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 33 | 54131 | 69936 | 82.36 | salaried | very_low_risk | 11 | 2 | 0 | 45639 | 75000 | 75000 | 0 | 0 | 0 | 0 | 0 | 0 | 22267 | 0 | 1 | 0 | 1 | 2 | 5 | 1 |
| 36 | 50743 | 66115 | 78.65 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33 | 48258 | 63896 | 80.60 | salaried | low_risk | 1 | 1 | 0 | 51500 | 51500 | 51500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 49 | 47773 | 63306 | 80.56 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 48 | 44779 | 62852 | 74.78 | salaried | medium_risk | 5 | 2 | 1 | 34922 | 60000 | 60000 | 0 | 0 | 0 | 0 | 0 | 0 | 11759 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
| 26 | 45814 | 66115 | 71.09 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 41 | 48468 | 60410 | 84.42 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 49 | 44819 | 66487 | 69.19 | salaried | very_low_risk | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 4 | 0 | 0 |
| 22 | 50673 | 63840 | 83.02 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 41 | 37939 | 64500 | 62.02 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 20 | 52428 | 67405 | 81.60 | not_reported | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 20 | 51653 | 63896 | 86.08 | not_reported | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36 | 52818 | 63896 | 88.42 | salaried | medium_risk | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 24 | 51428 | 64840 | 84.82 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
| 20 | 49488 | 63306 | 83.72 | not_reported | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 40 | 51663 | 68000 | 79.41 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 26 | 49458 | 62852 | 82.73 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 43 | 52508 | 66246 | 81.51 | salaried | high_risk | 36 | 11 | 2 | 327845 | 489490 | 489490 | 0 | 0 | 0 | 0 | 0 | 0 | 40747 | 0 | 3 | 1 | 1 | 7 | 0 | 0 |
| 36 | 55333 | 73805 | 78.59 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 41 | 38939 | 59313 | 67.44 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 51 | 49713 | 63896 | 82.95 | self_employed | medium_risk | 8 | 2 | 0 | 2131369 | 3254115 | 3254115 | 0 | 0 | 0 | 0 | 0 | 0 | 2884 | 0 | 0 | 0 | 2 | 4 | 0 | 0 |
| 28 | 49458 | 63000 | 82.54 | salaried | low_risk | 1 | 1 | 0 | 282809 | 292906 | 292906 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 21 | 40884 | 59313 | 70.81 | not_reported | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 42 | 50295 | 67528 | 77.92 | salaried | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1328 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 29 | 44575 | 59540 | 78.94 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 25 | 49973 | 63306 | 84.51 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38 | 45769 | 66365 | 72.33 | salaried | very_low_risk | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 0 | 0 |
| 41 | 51653 | 63306 | 86.88 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 30 | 46809 | 59313 | 80.93 | salaried | very_low_risk | 1 | 1 | 0 | 3391 | 25000 | 25000 | 0 | 0 | 0 | 0 | 0 | 0 | 1350 | 0 | 0 | 0 | 2 | 2 | 0 | 1 |
| 40 | 41670 | 59313 | 74.18 | salaried | low_risk | 17 | 3 | 0 | 1002630 | 1028000 | 1028000 | 0 | 0 | 0 | 0 | 0 | 0 | 23951 | 0 | 1 | 0 | 1 | 2 | 0 | 1 |
| 43 | 48693 | 62577 | 81.50 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36 | 51428 | 63306 | 86.88 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 29 | 54273 | 69067 | 83.98 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 41 | 51428 | 63306 | 86.88 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 52 | 45769 | 59313 | 80.93 | self_employed | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2460 | 0 | 0 | 0 | 2 | 2 | 0 | 0 |
| 30 | 42969 | 63840 | 72.06 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 25 | 48258 | 63896 | 80.60 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 28 | 47650 | 62577 | 79.74 | salaried | very_low_risk | 2 | 1 | 0 | 1352 | 15000 | 15000 | 0 | 0 | 0 | 0 | 0 | 0 | 7460 | 0 | 0 | 0 | 1 | 2 | 1 | 0 |
| 30 | 42690 | 63000 | 69.84 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 50 | 41854 | 59300 | 74.20 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 47 | 46759 | 62577 | 78.30 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38 | 48433 | 63896 | 80.15 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38 | 46555 | 59313 | 82.61 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 23 | 48699 | 59313 | 84.13 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 31 | 49250 | 65269 | 77.37 | salaried | low_risk | 1 | 1 | 0 | 27537 | 28196 | 28196 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 32 | 34959 | 60410 | 59.59 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 34 | 43090 | 64800 | 70.22 | salaried | low_risk | 1 | 1 | 0 | 45500 | 45500 | 45500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 42 | 28084 | 63720 | 45.51 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 21 | 49683 | 62577 | 83.10 | not_reported | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 47 | 50673 | 65269 | 81.20 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 36 | 38164 | 63896 | 64.17 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 57 | 50673 | 62577 | 84.70 | salaried | very_low_risk | 3 | 2 | 0 | 2681 | 15749 | 15749 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 |
| 42 | 50458 | 63896 | 84.51 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 36 | 49458 | 62577 | 83.10 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 23 | 42874 | 63800 | 68.97 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 31 | 49683 | 62577 | 83.10 | salaried | very_low_risk | 2 | 1 | 0 | 21323 | 26000 | 26000 | 0 | 0 | 0 | 0 | 0 | 0 | 2856 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
| 27 | 50743 | 63149 | 82.34 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 38 | 60213 | 84398 | 73.46 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 29 | 54303 | 69542 | 79.09 | salaried | high_risk | 5 | 4 | 1 | 312718 | 677000 | 677000 | 0 | 0 | 0 | 0 | 0 | 0 | 12979 | 0 | 0 | 2 | 2 | 4 | 3 | 1 |
| 32 | 49803 | 65368 | 77.25 | salaried | very_low_risk | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4164 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 46 | 51403 | 65687 | 79.32 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 30 | 60971 | 72657 | 85.00 | salaried | low_risk | 2 | 1 | 0 | 10000 | 42500 | 42500 | 0 | 0 | 0 | 0 | 0 | 0 | 2154 | 0 | 0 | 0 | 2 | 3 | 0 | 1 |
| 50 | 45349 | 65368 | 70.37 | salaried | very_low_risk | 2 | 1 | 0 | 33612 | 40000 | 40000 | 0 | 0 | 0 | 0 | 0 | 0 | 3740 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
| 27 | 36439 | 61865 | 59.81 | self_employed | very_low_risk | 3 | 1 | 0 | 13785 | 56173 | 56173 | 0 | 0 | 0 | 0 | 0 | 0 | 4020 | 0 | 0 | 0 | 2 | 5 | 0 | 0 |
| 33 | 51078 | 65368 | 79.55 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 28 | 74951 | 102945 | 74.02 | self_employed | very_low_risk | 2 | 1 | 0 | 21480 | 201381 | 201381 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 0 |
| 34 | 58259 | 66068 | 89.30 | self_employed | very_low_risk | 7 | 4 | 0 | 196020 | 259363 | 259363 | 0 | 0 | 0 | 0 | 0 | 0 | 26627 | 0 | 2 | 0 | 1 | 4 | 0 | 0 |
| 40 | 52303 | 66310 | 79.93 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 45 | 46349 | 65687 | 71.55 | salaried | low_risk | 1 | 1 | 0 | 5470 | 5470 | 5470 | 0 | 0 | 0 | 0 | 0 | 0 | 954 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 46 | 51303 | 65060 | 79.93 | salaried | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1250 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33 | 51078 | 65310 | 79.62 | salaried | very_low_risk | 5 | 4 | 0 | 46481 | 99090 | 99090 | 0 | 0 | 0 | 0 | 0 | 0 | 7210 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 36 | 58259 | 65687 | 89.82 | salaried | very_low_risk | 8 | 4 | 0 | 3259073 | 3610215 | 3587762 | 0 | 0 | 0 | 0 | 0 | 0 | 30113 | 0 | 2 | 0 | 1 | 7 | 3 | 0 |
| 34 | 52003 | 68695 | 76.72 | salaried | low_risk | 1 | 1 | 0 | 8455 | 14500 | 14500 | 0 | 0 | 0 | 0 | 0 | 0 | 1209 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 40 | 49349 | 65368 | 76.49 | self_employed | very_low_risk | 6 | 4 | 0 | 13438 | 48579 | 48579 | 0 | 0 | 0 | 0 | 0 | 0 | 2785 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| 22 | 55567 | 66252 | 84.99 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 44 | 40094 | 61865 | 65.79 | self_employed | low_risk | 9 | 6 | 3 | 20196 | 351003 | 285648 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 3 | 11 | 0 | 0 |
| 42 | 55259 | 65937 | 84.93 | salaried | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1475 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 22 | 40394 | 65368 | 62.72 | salaried | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 1 | 1171994 | 1690000 | 1690000 | 1090 | 9382 | 0 | 0 | 3 | 5 | 0 | 0 |
| 37 | 52303 | 62365 | 84.98 | self_employed | high_risk | 5 | 4 | 1 | 1147365 | 1163250 | 1163250 | 0 | 0 | 0 | 0 | 0 | 0 | 13050 | 0 | 2 | 0 | 1 | 2 | 1 | 0 |
| 25 | 40094 | 65368 | 62.26 | salaried | low_risk | 4 | 4 | 0 | 42063 | 66950 | 66950 | 0 | 0 | 0 | 0 | 0 | 0 | 4037 | 0 | 3 | 0 | 0 | 1 | 0 | 0 |
| 25 | 50303 | 67099 | 76.01 | self_employed | low_risk | 3 | 3 | 1 | 7960 | 38950 | 7960 | 0 | 0 | 0 | 0 | 0 | 0 | 1464 | 0 | 2 | 1 | 1 | 2 | 1 | 1 |
| 42 | 53578 | 68870 | 79.13 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 25 | 52303 | 68695 | 77.15 | self_employed | low_risk | 1 | 1 | 0 | 0 | 15000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 35 | 58259 | 65687 | 89.82 | self_employed | very_low_risk | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2597 | 0 | 0 | 0 | 2 | 2 | 0 | 0 |
| 48 | 54305 | 64760 | 85.00 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 23 | 48835 | 61865 | 79.99 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33 | 56059 | 63349 | 89.66 | salaried | low_risk | 2 | 1 | 0 | 0 | 4968 | 4968 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 0 |
| 30 | 50303 | 64651 | 78.89 | salaried | high_risk | 3 | 2 | 1 | 30206 | 99100 | 105564 | 1 | 1 | 0 | 0 | 40000 | 361 | 2315 | 0 | 1 | 1 | 1 | 3 | 0 | 0 |
| 25 | 42394 | 77968 | 55.15 | self_employed | high_risk | 5 | 1 | 1 | 22075 | 45000 | 34589 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 39 | 57759 | 65368 | 89.49 | self_employed | medium_risk | 5 | 5 | 1 | 476937 | 644797 | 644797 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 2 | 0 | 0 |
| 53 | 54078 | 65197 | 84.36 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 35 | 41094 | 57782 | 72.17 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 61 | 35939 | 61865 | 59.00 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 31 | 54003 | 64389 | 84.95 | self_employed | low_risk | 5 | 3 | 0 | 192520 | 555613 | 507644 | 0 | 0 | 0 | 0 | 0 | 0 | 4671 | 0 | 0 | 0 | 3 | 3 | 0 | 0 |
| 27 | 27229 | 61865 | 44.77 | salaried | low_risk | 4 | 3 | 0 | 4340 | 31809 | 31809 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 44 | 51078 | 65203 | 79.75 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 40 | 49349 | 65690 | 76.12 | self_employed | very_low_risk | 3 | 2 | 0 | 974963 | 1260000 | 1260000 | 0 | 0 | 0 | 0 | 0 | 0 | 19740 | 0 | 0 | 0 | 7 | 11 | 0 | 0 |
| 40 | 51303 | 65249 | 79.69 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 42 | 44749 | 61865 | 73.39 | self_employed | very_low_risk | 10 | 2 | 0 | 1420891 | 1776733 | 1776733 | 0 | 0 | 0 | 0 | 0 | 0 | 31311 | 0 | 0 | 0 | 2 | 4 | 0 | 0 |
| 33 | 47049 | 65215 | 73.14 | salaried | very_low_risk | 2 | 2 | 0 | 5420 | 35271 | 35271 | 0 | 0 | 0 | 0 | 0 | 0 | 3847 | 0 | 1 | 0 | 1 | 1 | 4 | 0 |
| 26 | 43394 | 66068 | 66.60 | self_employed | very_low_risk | 10 | 5 | 0 | 145434 | 180522 | 180522 | 0 | 0 | 0 | 0 | 0 | 0 | 19485 | 0 | 4 | 0 | 0 | 1 | 1 | 0 |
| 29 | 53803 | 68245 | 79.86 | self_employed | very_low_risk | 1 | 1 | 0 | 1200 | 13900 | 13900 | 0 | 0 | 0 | 0 | 0 | 0 | 2317 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 33 | 38439 | 65215 | 59.80 | salaried | very_low_risk | 4 | 3 | 0 | 37855 | 90026 | 90026 | 0 | 0 | 0 | 0 | 0 | 0 | 5721 | 0 | 1 | 0 | 2 | 5 | 0 | 0 |
| 53 | 48349 | 65368 | 74.96 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 41 | 54759 | 65368 | 84.90 | self_employed | very_low_risk | 10 | 6 | 0 | 741288 | 1299454 | 1102429 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 13 | 0 | 0 |
| 42 | 73723 | 99500 | 74.97 | self_employed | very_low_risk | 2 | 1 | 0 | 820301 | 900000 | 900000 | 0 | 0 | 0 | 0 | 0 | 0 | 3000 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
| 34 | 40394 | 69498 | 58.99 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 41 | 56013 | 65687 | 86.78 | self_employed | very_low_risk | 13 | 2 | 0 | 881713 | 1115000 | 1115000 | 0 | 0 | 0 | 0 | 0 | 0 | 3722 | 0 | 0 | 0 | 1 | 4 | 0 | 0 |
| 24 | 46349 | 62465 | 75.24 | self_employed | very_low_risk | 1 | 1 | 0 | 24900 | 50000 | 50000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 51 | 56959 | 68695 | 83.99 | self_employed | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 45 | 58259 | 66068 | 89.30 | self_employed | very_low_risk | 3 | 1 | 0 | 7493 | 14990 | 14990 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| 31 | 53878 | 68601 | 79.88 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 28 | 42394 | 61917 | 69.45 | self_employed | medium_risk | 12 | 7 | 1 | 1074066 | 1353681 | 1341560 | 0 | 0 | 0 | 0 | 0 | 0 | 1565 | 0 | 4 | 1 | 1 | 2 | 0 | 1 |
| 31 | 51303 | 65251 | 79.69 | salaried | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2000 | 0 | 0 | 0 | 2 | 2 | 0 | 0 |
| 55 | 74122 | 103060 | 72.77 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46 | 54576 | 65368 | 85.00 | salaried | low_risk | 3 | 3 | 0 | 8710 | 27154 | 27154 | 0 | 0 | 0 | 0 | 0 | 0 | 1910 | 0 | 1 | 0 | 1 | 1 | 1 | 0 |
| 59 | 52303 | 66552 | 79.64 | self_employed | very_low_risk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 0 |
| 36 | 49049 | 64217 | 77.39 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 39 | 46349 | 65687 | 71.55 | self_employed | very_low_risk | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4592 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 38 | 51003 | 65687 | 78.71 | self_employed | very_low_risk | 2 | 2 | 0 | 13163 | 31251 | 31251 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
| 51 | 53078 | 65687 | 82.21 | salaried | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 25 | 51303 | 67714 | 76.79 | salaried | very_low_risk | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5556 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 26 | 35939 | 61865 | 59.00 | self_employed | low_risk | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
## [1] 233154 26
There are 26 columns (25 predictor variables and 1 response variable)
and 233,154 observations.## Rows: 233,154
## Columns: 26
## $ age_years <dbl> 35, 33, 33, 25, 41, 28, 30, 29, 27…
## $ disbursed_amount <dbl> 50578, 47145, 53278, 57513, 52378,…
## $ asset_cost <dbl> 58400, 65550, 61360, 66113, 60300,…
## $ ltv <dbl> 89.55, 73.23, 89.63, 88.48, 88.39,…
## $ employment_type <chr> "salaried", "self_employed", "self…
## $ perform_cns_score_description <chr> "low_risk", "medium_risk", "low_ri…
## $ pri_no_of_accts <dbl> 0, 1, 0, 3, 0, 2, 0, 1, 1, 1, 1, 3…
## $ pri_active_accts <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2…
## $ pri_overdue_accts <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pri_current_balance <dbl> 0, 27600, 0, 0, 0, 0, 0, 72879, -4…
## $ pri_sanctioned_amount <dbl> 0, 50200, 0, 0, 0, 0, 0, 74500, 36…
## $ pri_disbursed_amount <dbl> 0, 50200, 0, 0, 0, 0, 0, 74500, 36…
## $ sec_no_of_accts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_active_accts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_overdue_accts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_current_balance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_sanctioned_amount <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_disbursed_amount <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ primary_instal_amt <dbl> 0, 1991, 0, 31, 0, 1347, 0, 0, 0, …
## $ sec_instal_amt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ new_accts_in_last_six_months <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ delinquent_accts_in_last_six_months <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ average_acct_age <dbl> 0, 2, 0, 1, 0, 2, 0, 0, 5, 2, 1, 2…
## $ credit_history_length <dbl> 0, 2, 0, 1, 0, 2, 0, 0, 5, 2, 1, 2…
## $ no_of_inquiries <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1…
## $ loan_default <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0…
# use missmap function from the Amelia package to check for NA values
missmap(vehicle_loan_default_train_cleaned,
plot.background = element_rect(fill = "antiquewhite"),
main = "Vehicle Loan Default - Missing Values",
x.cex = 0.45,
y.cex = 0.6,
margins = c(7.1, 7.1),
col = c("yellow", "black"), legend = FALSE)
From the plot of missing values, we can see that there is no
missing value in the data. No NA imputation or handling of nulls is
necessary.
loan_default_grouped <- vehicle_loan_default_train_cleaned %>% group_by(loan_default) %>%
summarise(count = n()) %>%
mutate(percentage = round((count / sum(count) * 100), 2))
loan_default_grouped_displayed <- kable(head(loan_default_grouped, 200), "html") %>%
kable_paper("hover", full_width = F)
loan_default_grouped_displayed| loan_default | count | percentage |
|---|---|---|
| 0 | 182543 | 78.29 |
| 1 | 50611 | 21.71 |
p_bar_loan_default_category <- loan_default_grouped %>% ggplot(aes(x=factor(loan_default),
y = percentage, fill = factor(loan_default))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Loan Default Distribution", x = "Loan Default", y = "Percentage",
fill = "Loan Default") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentages
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
p_bar_loan_default_category
From the values and plot above for the Loan Default categories, we
can see that there are more values for category 0 (No default) than 1
(Default). Before, modeling the data, the data will be enhanced by
oversampling the “default” category to achieve some form of balance in
the data.
# Age distribution - All categories
age_dist_all <- vehicle_loan_default_train_cleaned %>% ggplot(aes(x = age_years)) +
geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
labs(title = "Age Distribution - All Categories", x = "Age", y = "Frequency") + theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age distribution - Loan Default
age_dist_loan_default <- vehicle_loan_default_train_cleaned %>% filter(loan_default == 1) %>% ggplot(aes(x = age_years)) +
geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
labs(title = "Age Distribution - Loan Default", x = "Age", y = "Frequency") + theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Age distribution - No Default
age_dist_no_default <- vehicle_loan_default_train_cleaned %>% filter(loan_default == 0) %>% ggplot(aes(x = age_years)) +
geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
labs(title = "Age Distribution - Loan Default", x = "Age", y = "Frequency") + theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
# Plot all three plots
plot_age_dist <- plot_grid(age_dist_loan_default, age_dist_no_default, age_dist_all, byrow = TRUE, nrow = 3)
plot_age_dist
We can see that most of the age distribution is about the same for
both loan_default and no-loan_default. Also, we can see that most of the
applicants fall between 20 years old and 40 years old with a few
applicants between 40 and 60 and almost none after 70 years old.
The perform_cns_score_description basically categorizes the risk level of the applicant based on their credit score.
# All Categories
cns_score_desc_all <- vehicle_loan_default_train_cleaned %>% group_by(perform_cns_score_description) %>%
summarise(count = n()) %>%
mutate(percentage = round((count / sum(count) * 100), 2)) %>%
ggplot(aes(x = perform_cns_score_description, y = percentage,
fill = factor(perform_cns_score_description))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Risk Level Distribution - All Categories", x = "Risk Level", y = "Percentage",
fill = "Risk Level") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentages
theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
cns_score_desc_all
Significantly high proportion of the applicants fall under low_risk
category and over 85% fall under either very_low_risk or low_risk. This
may be due to the believe that applicants may prefer to work on their
credit to improve their risk level before apply for vehicle loans
especially as low_risk applicants tend to get better interest rates on
their loans.
loan_default_risk_level <- vehicle_loan_default_train_cleaned %>%
count(loan_default, perform_cns_score_description, name = "Record_Count") %>%
ggplot(aes(x=loan_default, y = Record_Count, fill = perform_cns_score_description)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Loan Default Distribution by Risk Level", x = "Loan Default", y = "Percentage",
fill = "Risk Level") + theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
loan_default_risk_level
It is clear that a huge portion of the non-loan_default are
categorized as low_risk or very_low_risk. Also, for the defaulters, the
low_risk still constitute a huge portion.
employment_type <- vehicle_loan_default_train_cleaned %>%
count(loan_default, employment_type, name = "Record_Count") %>%
ggplot(aes(x=loan_default, y = Record_Count, fill = employment_type)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Loan Default Distribution by Employment Type", x = "Loan Default", y = "Percentage",
fill = "Employment Type") + theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"))
employment_type
Salaried or Self-Employed constitute most of the data in almost
equal proportions for both categories of default although the
self-employed are a little more.
vehicle_loan_default_train_cleaned_numeric <- vehicle_loan_default_train_cleaned %>%
select(-employment_type, -perform_cns_score_description)
corr_matrix <- cor(vehicle_loan_default_train_cleaned_numeric)
correlation_plot <- ggcorrplot(corr_matrix,
lab = TRUE, # Show axis labels
lab_size = 2, # Adjust the size of axis labels
hc.order = TRUE, # Reorder the correlation matrix
type = "lower",
outline.col = "white",
colors = c("blue", "white", "red"),
ggtheme = ggplot2::theme_minimal(),
title = "Correlation Plot") + coord_fixed(ratio = 0.9) +
theme(axis.text.x = element_text(size = 9, angle = 45, hjust = 1),
axis.text.y = element_text(size = 9, hjust = 1),
plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "gray80"),
plot.background = element_rect(fill = "antiquewhite"),
axis.title = element_text(size = 10)) +labs(x = NULL, y = NULL)Since the data is imbalanced, we preprocess the data to get a balanced dataset and also standardize the numeric variables in the data using the standard normal distribution.
data_preprocess_scaling <- function(df){
# This helper function standardizes the numeric variables of the df using the standard normal method
df <- as.data.frame(df)
df_char <- df %>% select(loan_default, employment_type, perform_cns_score_description)
df_numeric <- df %>% select(-loan_default, -employment_type, -perform_cns_score_description)
df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
df_scaled_combined <- cbind(df_char, df_numeric_scaled)
return(df_scaled_combined)
}
xgb_nnet_preprocess <- function(df, mode){
# convert the categorical variable to dummies
df2 <- dummy_cols(df, select_columns = c("employment_type","perform_cns_score_description"),
remove_selected_columns = TRUE) %>% as.data.frame()
# prepare the xgb.DMatrix to use in the xgboost training
df2_train <- as.data.frame(df2[, -1])
df2_label <- as.data.frame(df2[, 1])
df_dmatrix <- xgb.DMatrix(as.matrix(sapply(df2_train, as.numeric)), label=as.matrix(df2_label))
if(mode == "xgboost"){
return(df_dmatrix)
} else if(mode == "nnet"){
return(df2)
} else{
print("Mode not supported.")
}
}
data_preprocessing <- function(df, mode = "train"){
# This function pre-processes the cleaned data and get the data ready for training.
tryCatch({
if(is_tibble(df) | is.data.frame(df)){
if(mode == "test"){
df_scaled <- data_preprocess_scaling(df)
print("Data Pre-processing complete")
return(df_scaled)
} else if(mode == "train"){
curr_frame <<- sys.nframe() # sends the current frame to the global environment.
# The ovun.sample function in the ROSE package assumes the data to be in the global env so you have to tell it
# which frame (scope) to find the data else this will fail if executed inside a function.
df_ovun <- ovun.sample(formula = formula(loan_default ~ .), data = get("df", sys.frame(curr_frame)),
N = 1.5 * nrow(data), seed = 1994, method = "both")$data %>% as.data.frame() %>% as_tibble()
print("Oversampling and undersampling completed")
df_scaled <- data_preprocess_scaling(df_ovun)
print("Data Pre-processing complete!")
return(df_scaled)
} else {
print("You did not enter a valid mode type: Enter train or test for mode")
}
}
else{
print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
}
},
#if an error occurs, tell me the error
error=function(e) {
message('An Error Occurred')
print(e)
},
#or if a warning occurs, tell me the warning
warning=function(w) {
message('A Warning Occurred')
print(w)
return(NA)
}
)
}Use the CaTools library to split the cleaned dataset into training and testing datasets in 70:30 ratio.
# Set a seed
set.seed(1994)
#Split the sample
sampling <- sample.split(vehicle_loan_default_train_cleaned$loan_default, SplitRatio = 0.7)
# Training Data
df_train_subset <- subset(vehicle_loan_default_train_cleaned, sampling == TRUE)
# Testing Data
df_test_subset <- subset(vehicle_loan_default_train_cleaned, sampling == FALSE)Pre-process the train dataset
## [1] "Oversampling and undersampling completed"
## [1] "Data Pre-processing complete!"
Pre-process the test dataset
## [1] "Data Pre-processing complete"
The function model_training trains a machine learning
model according to the mode selected (logistic, rf, xgboost, or
nnet).
model_training <- function(df, mode = "logistic"){
if(mode == "logistic"){
print("Training a Logistic Regression Model...")
logistic_model <- glm(formula = loan_default ~ . ,
family = binomial(link = 'logit'), data = df)
print("Logistic Regression Model complete")
return(logistic_model)
} else if(mode == "rf"){
print("Training a Random Forest Classification Model...")
rf_model_ranger <- ranger(
formula = loan_default ~ .,
data = df,
num.trees = 500,
mtry = floor(length(df) / 3),
probability = TRUE,
verbose = FALSE,
classification = TRUE
)
print("Random Forest Classification Model complete")
return(rf_model_ranger)
} else if(mode == "xgboost"){
print("Training an XGBoost Classification Model...")
# pre-process the data to obtain the dmatrix
df_dmatrix <- xgb_nnet_preprocess(df, mode)
xgb_model <- xgboost(data = df_dmatrix, nthread = 4, nrounds = 150,
max.depth = 10, eta = 0.1, objective = "binary:logistic", verbose = FALSE)
print("XGBoost Training complete")
return(xgb_model)
} else if(mode == "nnet"){
print("Training a Neural Network Classification Model...")
# pre-process the model to convert character variables to dummies
df_nnet <- xgb_nnet_preprocess(df, mode)
nnet_model <- nnet(loan_default ~ ., data = df_nnet, decay = 5e-4,
size = 20, maxit = 100, trace = F, set.seed(1994))
print("NNET Training complete")
return(nnet_model)
} else {
print("You did not enter a valid mode type: Enter logistic, rf, xgboost or svm")
}
}The function model_predictions predicts the loan_default
probability of new data for each of the model types.
model_prediction = function(df, trained_model, model_type){
# remove the response variable from the dataframe if it exists
if("loan_default" %in% colnames(df)){
test_data = df %>% select(-loan_default)
} else {
test_data = df
}
# make predictions
if(model_type == "rf"){
predictions = predict(trained_model, data = test_data)$predictions[,1]
} else if (model_type == "logistic"){
predictions = predict(trained_model, newdata = test_data, type = "response")
} else if (model_type == "xgboost"){
test_data_xgb = xgb_nnet_preprocess(df, model_type)
predictions = predict(trained_model, newdata = test_data_xgb)
} else{
test_data_nnet = xgb_nnet_preprocess(test_data, model_type)
predictions = predict(trained_model, newdata = test_data_nnet)
}
# convert probabilities to classes
predicted = ifelse(predictions > 0.5, 1, 0)
return(predicted)
}The function model_evaluations provides metrics such as
accuracy, recall, precision, f1-score, and AUC for the model.
model_metrics = function(actual, predicted){
# accuracy
accuracy_model = round((Metrics::accuracy(actual, predicted)), 4)
# precision
precision_model = round((Metrics::precision(actual, predicted)), 4)
# recall
recall_model = round((Metrics::recall(actual, predicted)), 4)
# auc
auc_model = round((pROC::auc(actual, predicted)), 4)
# model metrics
model_eval_metrics = c(accuracy_model, precision_model, recall_model, auc_model) %>% t()
column_names = c("Accuracy", "Precision", "Recall", "AUC")
evaluation_metrics = data.frame(values = model_eval_metrics)
colnames(evaluation_metrics) = column_names
# confusion Matrix
confusion_table = table(predicted, actual)
confusion_matrix = caret::confusionMatrix(confusion_table)
print("**********************************************************************")
print(confusion_matrix)
print("**********************************************************************")
return(evaluation_metrics)
}Train the logistic model
## [1] "Training a Logistic Regression Model..."
## [1] "Logistic Regression Model complete"
Evaluate the logistic Model
logistic_actual = df_test$loan_default
logistic_predicted = model_prediction(df_test, logistic_model, "logistic")
logistic_model_metrics = model_metrics(logistic_actual, logistic_predicted)## [1] "**********************************************************************"
## Confusion Matrix and Statistics
##
## actual
## predicted 0 1
## 0 27660 4961
## 1 27103 10222
##
## Accuracy : 0.5416
## 95% CI : (0.5379, 0.5453)
## No Information Rate : 0.7829
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1168
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5051
## Specificity : 0.6733
## Pos Pred Value : 0.8479
## Neg Pred Value : 0.2739
## Prevalence : 0.7829
## Detection Rate : 0.3954
## Detection Prevalence : 0.4664
## Balanced Accuracy : 0.5892
##
## 'Positive' Class : 0
##
## [1] "**********************************************************************"
Display Model Metrics - Logistic Regression Model
## Accuracy Precision Recall AUC
## Logistic Model 0.5416 0.2739 0.6733 0.5892
Train a random forest classification model for the data
## [1] "Training a Random Forest Classification Model..."
## [1] "Random Forest Classification Model complete"
Evaluate the Random Forest Model
rf_actual = df_test$loan_default
rf_predicted = model_prediction(df_test, rf_model, "rf")
rf_model_metrics = model_metrics(rf_actual, rf_predicted)## [1] "**********************************************************************"
## Confusion Matrix and Statistics
##
## actual
## predicted 0 1
## 0 8012 3208
## 1 46751 11975
##
## Accuracy : 0.2857
## 95% CI : (0.2824, 0.2891)
## No Information Rate : 0.7829
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0319
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.1463
## Specificity : 0.7887
## Pos Pred Value : 0.7141
## Neg Pred Value : 0.2039
## Prevalence : 0.7829
## Detection Rate : 0.1145
## Detection Prevalence : 0.1604
## Balanced Accuracy : 0.4675
##
## 'Positive' Class : 0
##
## [1] "**********************************************************************"
Display Model Metrics - Random Forest Classification Model
## Accuracy Precision Recall AUC
## Random Forest Model 0.2857 0.2039 0.7887 0.4675
Train an XGBoost classification model for the data
## [1] "Training an XGBoost Classification Model..."
## [1] "XGBoost Training complete"
Evaluate the XGBoost Model
xgb_actual = df_test$loan_default
xgb_predicted = model_prediction(df_test, xgb_model, "xgboost")
xgb_model_metrics = model_metrics(xgb_actual, xgb_predicted)## [1] "**********************************************************************"
## Confusion Matrix and Statistics
##
## actual
## predicted 0 1
## 0 31609 6375
## 1 23154 8808
##
## Accuracy : 0.5778
## 95% CI : (0.5742, 0.5815)
## No Information Rate : 0.7829
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1124
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5772
## Specificity : 0.5801
## Pos Pred Value : 0.8322
## Neg Pred Value : 0.2756
## Prevalence : 0.7829
## Detection Rate : 0.4519
## Detection Prevalence : 0.5430
## Balanced Accuracy : 0.5787
##
## 'Positive' Class : 0
##
## [1] "**********************************************************************"
Display Model Metrics - XGBoost classification Model
## Accuracy Precision Recall AUC
## XGBoost 0.5778 0.2756 0.5801 0.5787
Train a Neural Network model for the data
## [1] "Training a Neural Network Classification Model..."
## [1] "NNET Training complete"
Evaluate the NNET Model
nnet_actual = df_test$loan_default
nnet_predicted = model_prediction(df_test, nnet_model, "nnet")
nnet_model_metrics = model_metrics(nnet_actual, nnet_predicted)## [1] "**********************************************************************"
## Confusion Matrix and Statistics
##
## actual
## predicted 0 1
## 0 22452 3582
## 1 32311 11601
##
## Accuracy : 0.4868
## 95% CI : (0.4831, 0.4906)
## No Information Rate : 0.7829
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1034
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.4100
## Specificity : 0.7641
## Pos Pred Value : 0.8624
## Neg Pred Value : 0.2642
## Prevalence : 0.7829
## Detection Rate : 0.3210
## Detection Prevalence : 0.3722
## Balanced Accuracy : 0.5870
##
## 'Positive' Class : 0
##
## [1] "**********************************************************************"
Display Model Metrics - Neural Network Model
## Accuracy Precision Recall AUC
## Neural Network 0.4868 0.2642 0.7641 0.587
The table below compares the results of the four (4) models trained:
results = rbind(logistic_model_metrics, rf_model_metrics, xgb_model_metrics, nnet_model_metrics)
kable(results, "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "500px", height = "200px")| Accuracy | Precision | Recall | AUC | |
|---|---|---|---|---|
| Logistic Model | 0.5416 | 0.2739 | 0.6733 | 0.5892 |
| Random Forest Model | 0.2857 | 0.2039 | 0.7887 | 0.4675 |
| XGBoost | 0.5778 | 0.2756 | 0.5801 | 0.5787 |
| Neural Network | 0.4868 | 0.2642 | 0.7641 | 0.5870 |
As we can see from the results of the different models. The models
perform differently across the different metrics (Accuracy, Precision,
Recall, AUC). It is important to know that the metric to be used to
select the chosen model is dependent on the goal of the
analysis/model.
In this case of vehicle loan default prediction, the main goal is to
identify applicants that will default on their loans. Also, kindly note
that the positive class in this analysis is the zero (0 - No default)
while the negative class is the 1 (Loan Default).
Since the main concern is to correctly identifying defaults to minimize
financial risk, the Recall is considered to be more
critical. High recall ensures capturing most defaults, even at the cost
of some false alarms.
Recall measures the proportion of actual defaults that are correctly
predicted by the model. It focuses on the model’s ability to capture all
positive instances of defaults. High recall means a low false negative
rate, which is crucial when identifying all actual defaults is
important, even if it means higher false positives.
On the other hand, the precision measures the proportion of correctly
predicted defaults out of all instances predicted as defaults. It
specifically focuses on the accuracy of positive predictions, but here
the focus is mainly on negative predictions which are default to
minimize financial losses on vehicle loans. High precision means a low
false positive rate, which is valuable when correctly identifying
defaults is crucial and minimizing false alarms is a priority.
Also, Accuracy measures the overall correctness of predictions by
considering the ratio of correctly predicted instances (both defaults
and non-defaults) to the total instances. While accuracy is an intuitive
metric, its reliability can be affected by class imbalance which is
usually the case for most loan-default problems as there are usually
fewer defaults than non-defaults.
Lastly, AUC quantifies the model’s ability to distinguish between
classes. It represents the area under the Receiver Operating
Characteristic (ROC) curve, which plots the true positive rate (TPR)
against the false positive rate (FPR) at various threshold settings. It
is a valuable metric for assessing the overall performance of a
classification model, including its ability to discriminate between
default and non-default instances in a loan default prediction problem.
It provides a comprehensive evaluation of the model’s discriminatory
power, especially in scenarios with class imbalance.
Looking at the results of the model, we find that that all of the models are competitive in terms of recall and they all performed badly on the precision and about average for the accuracy for the most part. Also, the AUC values are above average for the most part and I strongly believe that the overall metrics of these models can be improved by having more quality data, feature selection, and hyper-parameter tuning.
One major limitation of the model is the data source. The data was
obtained from Kaggle, which is an open database source, which may
question the reliability of the data for real world usage. There are
some features like credit score classification which may not be
available for young applicants or immigrants to the country who have not
built an history to be classified as low risk. More quality data is
required to get a better performing model.
Also, most of the models have tune-able parameters to obtain better
performing models, but the models trained in this analysis did not
include hyper-parameter tuning and at such, the best models might not
have been obtained for each of the model categories. This tuning process
was avoided here because some of the models like Neural Network, and the
Tree-Based models can become quickly complex and may require more
computational capacity to perform several cross-validation/grid search
and tuning process. For example, for the Random Forest model, a mere 500
trees was randomly selected without knowing if that number of trees is
enough to grow the Forest as well as other parameters. Also, the number
of rounds for the Xtreme Gradient Boosting was randomly selected without
finding the best values for the combination of tunable parameters to get
the best model. Furthermore, for the Neural Network model, we do not
know what depth of network or number of nodes to train the model on to
get the best performing model.
In addition, proper feature selection was not done to see if certain
features can be dropped or to know what features of the data offer the
most predictive value.
Lastly, we did not compare training error to test error to know if the
trained model suffers from either overfitting or underfitting.
One of the most important next steps is to try to obtain better quality and reputable data possibly from a reputable financial institution. This might be very difficult to obtain considering the fact that financial data are highly regulated and also companies consider these data as part of their intellectual property and may not be willing to give these out. If these are done as part of an internal modeling process, better quality data may be obtained to be used in training a loan default model.
In addition, it is very important to conduct hyper-parameter tuning for the models to determine what parameters would be best suited to train the model on. It is also important to know that as more data is included, the computational resources needed to conduct extensive hyper-parameter tuning on large data for complex models like Neural Network, and Tree-Based model would increase significantly especially for large datasets.
Also, we can also try to do feature selection, check for multi-correlation using the Variance Inflation Factor (VIF) method/approach to obtain only relevant features that provide predictive value. Several other feature engineering techniques like principal component analysis (PCA) can also be explored.
Furthermore, next steps should involve comparing training error vs
test error to understand if the model is overfitting or underfitting and
appropriate steps should be taken to minimize either underfitting or
overfitting whichever is the case for the model.
Altman, E. I., & Saunders, A. (1998). Credit risk measurement: Developments over the last 20 years. Journal of Banking and Finance, 21(11-12), 1728–1742.
Agrawal, A., Agrawal, M., & Raizada, D. A. (2014). Predicting defaults in commercial vehicle loans using logistic regression: Case of an indian nbfc. International Journal of Research in Commerce and Management, 5, 22–28.
Brownlee, J. (2019, August 12). Overfitting and Underfitting With Machine Learning Algorithms. Machine Learning Mastery. Retrieved November 4, 2023, from https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/#:%7E:text=Overfitting%3A%20Good%20performance%20on%20the,poor%20generalization%20to%20other%20data
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Crook, J. N., Edelman, D. B., & Thomas, L. C. (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183, 447–1465.
Cromer, O. C., Purdy, K. W., Cromer, G. C., & Foster, C. G. (2023, October 13). Automobile | Definition, History, industry, design, & Facts. Encyclopedia Britannica. https://www.britannica.com/technology/automobile
Diez, D., Barr, C. D., & Cetinkaya-Rundel, M. (2019). OpenIntro statistics 4th Edition.
Education, I. C. (2021, March 25). Underfitting. IBM Cloud Learn. Retrieved November 5, 2023, from https://www.ibm.com/cloud/learn/underfitting
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
GeeksforGeeks. (2021, October 20). ML | Underfitting and Overfitting. Retrieved November 4, 2023, from https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/
Hand, D.J. and Henley, W.E. (1997) Statistical Classification Methods in Consumer Credit Scoring: A Review. Journal of Royal Statistical Society, 160, 523-541.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
Hao, C., Alam, M. M., & Carling, K. (2010). Review of the literature on credit risk modeling: Development of the past 10 years. Banks and Bank Systems, 5 (3).
Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247 (1), 124–136.
Martin, A. (2023). What is an auto loan? Bankrate. https://www.bankrate.com/loans/auto-loans/what-is-an-auto-loan/
Model Fit: Underfitting vs. Overfitting - Amazon Machine Learning. (n.d.). Amazon Machine Learning Developer Guide. Retrieved November 4, 2023, from https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html
O’Brien, S. (2023, February 4). Auto loan delinquencies are rising. Here’s what to do if you’re struggling with payments. CNBC. https://www.cnbc.com/2023/02/04/auto-loan-delinquencies-rise-what-to-do-if-you-struggle-with-payments.html
Probasco, J. (2023). Expert explanation of how auto loans work. Investopedia. https://www.investopedia.com/how-car-loans-work-5202265
U.S. vehicle fleet 1990-2021 | Statista. (2023, August 24). Statista. https://www.statista.com/statistics/183505/number-of-vehicles-in-the-united-states-since-1990/
U.S.: average selling price of new vehicles 2022 | Statista. (2023, June 7). Statista. https://www.statista.com/statistics/274927/new-vehicle-average-selling-price-in-the-united-states/
Witkowski, R. (2023, June 2). For People Under 30, Car Loan
Delinquencies Hit A 15-Year High. Is The Economy Running Out Of Gas?
Forbes Advisor. https://www.forbes.com/advisor/auto-loans/car-loan-late-payments/
# Load libraries
library(Amelia) # To visualize missing data
library(caret)
library(caTools)
library(corrplot) # To plot correlation plot
library(cowplot) # To combine plots in a grid
library(fastDummies) # to convert character variables to dummies
library(ggcorrplot) # To plot correlation plot
library(kableExtra)
library(Metrics) # for model evaluation
library(nnet)
library(pROC)
library(ranger) # for random forest implementation
library(ROSE) # to balance the data
library(tidyverse)
library(xgboost)
# Read the data
url_vehicle_loan_default_train = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_train_data.csv"
url_vehicle_loan_default_test = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_test_data.csv"
vehicle_loan_default_train_raw = read_csv(url_vehicle_loan_default_train) %>% as_tibble()
vehicle_loan_default_test_raw = read_csv(url_vehicle_loan_default_test) %>% as_tibble()
# data_cleaning
data_cleaning <- function(df){
# This function accepts a dataframe (df) as input and returns another dataframe (cleaned_df) that is clean.
tryCatch({
if(is_tibble(df) | is.data.frame(df)){
print("The dataframe is a tibble, will proceed to clean data")
print("Data Cleaning in progress...")
# rename all the columns in the dataframe to lowercase
cleaned_df <- df %>% rename_all(tolower) %>%
# compute the age of the applicant in number of years
mutate(date_of_birth = dmy(date_of_birth),
disbursal_date = dmy(disbursal_date),
age = difftime(disbursal_date, date_of_birth, units = "days"),
age_years = round(as.numeric(age / 365.25), 0)) %>%
# extract the years and month component of the average_acct_age and convert to years
mutate(average_acct_age_year_comp = as.numeric(str_extract(average_acct_age, "\\d+")),
average_acct_age_mon_comp = as.numeric(str_extract(average_acct_age, "\\d+(?=mon)")),
average_acct_age = round((average_acct_age_year_comp + average_acct_age_mon_comp/12), 0)
) %>%
# extract the years and month component of the credit_history_length and convert to years
mutate(credit_history_length_year_comp = as.numeric(str_extract(credit_history_length, "\\d+")),
credit_history_length_comp = as.numeric(str_extract(credit_history_length, "\\d+(?=mon)")),
credit_history_length = round((credit_history_length_year_comp + credit_history_length_comp/12), 0)
) %>%
# clean up the perform_cns_score_distribution to include only a few categories
mutate(lowercase_cns_description = tolower(perform_cns_score_description),
perform_cns_score_description = case_when(
str_detect(lowercase_cns_description, "very low risk") ~ "very_low_risk",
str_detect(lowercase_cns_description, "low risk") ~ "low_risk",
str_detect(lowercase_cns_description, "medium risk") ~ "medium_risk",
str_detect(lowercase_cns_description, "high risk") ~ "high_risk",
str_detect(lowercase_cns_description, "very high risk") ~ "very_high_risk",
str_detect(lowercase_cns_description, "not scored|no bureau") ~ "low_risk",
TRUE ~ "none")) %>%
# clean up the employment type to have only few categories
mutate(lower_case_employment_type = tolower(employment_type),
employment_type = case_when(
str_detect(lower_case_employment_type, "salaried") ~ "salaried",
str_detect(lower_case_employment_type, "self employed") ~ "self_employed",
TRUE ~ "not_reported")) %>%
# select only the required columns
select(
age_years, disbursed_amount, asset_cost, ltv, employment_type, perform_cns_score_description,
pri_no_of_accts, pri_active_accts, pri_overdue_accts, pri_current_balance, pri_sanctioned_amount,
pri_disbursed_amount, sec_no_of_accts, sec_active_accts, sec_overdue_accts, sec_current_balance,
sec_sanctioned_amount, sec_disbursed_amount, primary_instal_amt, sec_instal_amt, new_accts_in_last_six_months,
delinquent_accts_in_last_six_months, average_acct_age, credit_history_length, no_of_inquiries, loan_default
)
print("Data Cleaning complete!!!")
return(cleaned_df)
}
else{
print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
}
},
#if an error occurs, tell me the error
error=function(e) {
message('An Error Occurred')
print(e)
},
#or if a warning occurs, tell me the warning
warning=function(w) {
message('A Warning Occurred')
print(w)
return(NA)
}
)
}
# clean the data
vehicle_loan_default_train_cleaned = data_cleaning(vehicle_loan_default_train_raw)
# data pre-processing
data_preprocess_scaling <- function(df){
# This helper function standardizes the numeric variables of the df using the standard normal method
df <- as.data.frame(df)
df_char <- df %>% select(loan_default, employment_type, perform_cns_score_description)
df_numeric <- df %>% select(-loan_default, -employment_type, -perform_cns_score_description)
df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
df_scaled_combined <- cbind(df_char, df_numeric_scaled)
return(df_scaled_combined)
}
xgb_nnet_preprocess <- function(df, mode){
# convert the categorical variable to dummies
df2 <- dummy_cols(df, select_columns = c("employment_type","perform_cns_score_description"),
remove_selected_columns = TRUE) %>% as.data.frame()
# prepare the xgb.DMatrix to use in the xgboost training
df2_train <- as.data.frame(df2[, -1])
df2_label <- as.data.frame(df2[, 1])
df_dmatrix <- xgb.DMatrix(as.matrix(sapply(df2_train, as.numeric)), label=as.matrix(df2_label))
if(mode == "xgboost"){
return(df_dmatrix)
} else if(mode == "nnet"){
return(df2)
} else{
print("Mode not supported.")
}
}
data_preprocessing <- function(df, mode = "train"){
# This function pre-processes the cleaned data and get the data ready for training.
tryCatch({
if(is_tibble(df) | is.data.frame(df)){
if(mode == "test"){
df_scaled <- data_preprocess_scaling(df)
print("Data Pre-processing complete")
return(df_scaled)
} else if(mode == "train"){
curr_frame <<- sys.nframe() # sends the current frame to the global environment.
# The ovun.sample function in the ROSE package assumes the data to be in the global env so you have to tell it
# which frame (scope) to find the data else this will fail if executed inside a function.
df_ovun <- ovun.sample(formula = formula(loan_default ~ .), data = get("df", sys.frame(curr_frame)),
N = 1.5 * nrow(data), seed = 1994, method = "both")$data %>% as.data.frame() %>% as_tibble()
print("Oversampling and undersampling completed")
df_scaled <- data_preprocess_scaling(df_ovun)
print("Data Pre-processing complete!")
return(df_scaled)
} else {
print("You did not enter a valid mode type: Enter train or test for mode")
}
}
else{
print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
}
},
#if an error occurs, tell me the error
error=function(e) {
message('An Error Occurred')
print(e)
},
#or if a warning occurs, tell me the warning
warning=function(w) {
message('A Warning Occurred')
print(w)
return(NA)
}
)
}
# Train Test Split
# Set a seed
set.seed(1994)
#Split the sample
sampling <- sample.split(vehicle_loan_default_train_cleaned$loan_default, SplitRatio = 0.7)
# Training Data
df_train_subset <- subset(vehicle_loan_default_train_cleaned, sampling == TRUE)
# Testing Data
df_test_subset <- subset(vehicle_loan_default_train_cleaned, sampling == FALSE)
df_train = data_preprocessing(df_train_subset, mode = "train")
df_test = data_preprocessing(df_test_subset, mode = "test")
# Model training function
model_training <- function(df, mode = "logistic"){
if(mode == "logistic"){
print("Training a Logistic Regression Model...")
logistic_model <- glm(formula = loan_default ~ . ,
family = binomial(link = 'logit'), data = df)
print("Logistic Regression Model complete")
return(logistic_model)
} else if(mode == "rf"){
print("Training a Random Forest Classification Model...")
rf_model_ranger <- ranger(
formula = loan_default ~ .,
data = df,
num.trees = 500,
mtry = floor(length(df) / 3),
probability = TRUE,
verbose = FALSE,
classification = TRUE
)
print("Random Forest Classification Model complete")
return(rf_model_ranger)
} else if(mode == "xgboost"){
print("Training an XGBoost Classification Model...")
# pre-process the data to obtain the dmatrix
df_dmatrix <- xgb_nnet_preprocess(df, mode)
xgb_model <- xgboost(data = df_dmatrix, nthread = 4, nrounds = 150,
max.depth = 10, eta = 0.1, objective = "binary:logistic", verbose = FALSE)
print("XGBoost Training complete")
return(xgb_model)
} else if(mode == "nnet"){
print("Training a Neural Network Classification Model...")
# pre-process the model to convert character variables to dummies
df_nnet <- xgb_nnet_preprocess(df, mode)
nnet_model <- nnet(loan_default ~ ., data = df_nnet, decay = 5e-4,
size = 20, maxit = 100, trace = F, set.seed(1994))
print("NNET Training complete")
return(nnet_model)
} else {
print("You did not enter a valid mode type: Enter logistic, rf, xgboost or svm")
}
}
# model prediction function
model_prediction = function(df, trained_model, model_type){
# remove the response variable from the dataframe if it exists
if("loan_default" %in% colnames(df)){
test_data = df %>% select(-loan_default)
} else {
test_data = df
}
# make predictions
if(model_type == "rf"){
predictions = predict(trained_model, data = test_data)$predictions[,1]
} else if (model_type == "logistic"){
predictions = predict(trained_model, newdata = test_data, type = "response")
} else if (model_type == "xgboost"){
test_data_xgb = xgb_nnet_preprocess(df, model_type)
predictions = predict(trained_model, newdata = test_data_xgb)
} else{
test_data_nnet = xgb_nnet_preprocess(test_data, model_type)
predictions = predict(trained_model, newdata = test_data_nnet)
}
# convert probabilities to classes
predicted = ifelse(predictions > 0.5, 1, 0)
return(predicted)
}
# Model Metrics
model_metrics = function(actual, predicted){
# accuracy
accuracy_model = round((Metrics::accuracy(actual, predicted)), 4)
# precision
precision_model = round((Metrics::precision(actual, predicted)), 4)
# recall
recall_model = round((Metrics::recall(actual, predicted)), 4)
# auc
auc_model = round((pROC::auc(actual, predicted)), 4)
# model metrics
model_eval_metrics = c(accuracy_model, precision_model, recall_model, auc_model) %>% t()
column_names = c("Accuracy", "Precision", "Recall", "AUC")
evaluation_metrics = data.frame(values = model_eval_metrics)
colnames(evaluation_metrics) = column_names
# confusion Matrix
confusion_table = table(predicted, actual)
confusion_matrix = caret::confusionMatrix(confusion_table)
print("**********************************************************************")
print(confusion_matrix)
print("**********************************************************************")
return(evaluation_metrics)
}
# Train Logistic Model
logistic_model <- model_training(df_train, mode = "logistic")
logistic_actual = df_test$loan_default
logistic_predicted = model_prediction(df_test, logistic_model, "logistic")
logistic_model_metrics = model_metrics(logistic_actual, logistic_predicted)
rownames(logistic_model_metrics) = "Logistic Model"
# Train RF model
rf_model <- model_training(df_train, mode = "rf")
rf_actual = df_test$loan_default
rf_predicted = model_prediction(df_test, rf_model, "rf")
rf_accuracy = accuracy(rf_actual, rf_predicted)
rf_model_metrics = model_metrics(rf_actual, rf_predicted)
rownames(rf_model_metrics) = "Random Forest Model"
# Train an XGBoost model
xgb_model <- model_training(df_train, mode = "xgboost")
xgb_actual = df_test$loan_default
xgb_predicted = model_prediction(df_test, xgb_model, "xgboost")
xgb_model_metrics = model_metrics(xgb_actual, xgb_predicted)
rownames(xgb_model_metrics) = "XGBoost"
# Train an NNET model
nnet_model <- model_training(df_train, mode = "nnet")
nnet_actual = df_test$loan_default
nnet_predicted = model_prediction(df_test, nnet_model, "nnet")
nnet_model_metrics = model_metrics(nnet_actual, nnet_predicted)
rownames(nnet_model_metrics) = "Neural Network"
# Result
results = rbind(logistic_model_metrics, rf_model_metrics, xgb_model_metrics, nnet_model_metrics)
kable(results, "html") %>%
kable_paper("hover", full_width = F) %>%
scroll_box(width = "500px", height = "200px")