DATA698 Project - Vehicle Loan Default Prediction

ABSTRACT

This project aims to develop a machine learning model to predict auto-loans default using Machine Learning techniques. The dataset used was gotten from kaggle and contains about 41 variables and over 230,000 records of previous loan applicants.
Most banks typically use Logistic Regression Model to make decision whether an applicant is at risk of default or not. However, in this project, several other machine learning techniques were explored such as Tree-based models like Random Forest, and XGBoost (Extreme Gradient Boosting) as well as Artificial Neural Networks. The Logistic Regression Models was developed as well and compared to these other models. We found that the Neural Network model provided the best result based on the recall. Also, the logistic regression model does not perform poorly compared to other models, but instead provided competitive performance as well. However, due to time and resources constraints, the models were not fully exploited to get the best out of each of the models. Simply, hyper-parameter tuning was not done on the models that have tunable parameters, and feature engineering was not done. Ultimately, the Logistic Regression Model was chosen based on its competitive overall performance good enough recall, accuracy, auc value when compared to the other three models. However, even though the Neural Network (NNET) model has better recall, it may be difficult to interpret the model to determine why a customer was refused a loan. Hence, the Logistic Regression Model which is the industry standard is still chosen here because of its simplicity, ease of understanding and interpretation, and very short training time.

INTRODUCTION

1.1 Research Problem

The problem that this project aims to solve is to develop a machine learning model that will predict whether a potential borrower will default on their auto-loan or not using available data collected about the customer.
The auto-loan industry is a multi-billion dollar industry that affects almost all aspects of life in the developed world as well as in developing countries. Automobiles are now the de facto means of movement in suburbs and even in some major cities and the volume of automobiles on the roads has increased tremendously over the past decades. Buying an automobile (referred to in this project as a vehicle) is an important decision for most people and banks are usually at the center of this decision since majority of vehicle purchases are made through financing option. Hence, financial institutions like banks have to face this particular problem all the time to decide whether a borrower will be able to make payments throughout the lifetime of their auto-loan.
Hence, banks do not want to issue loans to individuals that would default and at the same time do not want to deny loans for individuals that would not default. To do this, they want to be able to predict to a high level of confidence whether to approve or deny an auto-loan so that they will be able to minimize losses and write-offs when a borrower is not able to make payments and at the same time make profits from customers who are able to service their loans. This project aims to solve this problem by developing a predictive model using machine learning techniques to help banks decide which potential borrower is likely to default on their loans or not.

1.2 Definition of Terms

Automobile: “Automobile, byname auto, also called motorcar or car, a usually four-wheeled vehicle designed primarily for passenger transportation and commonly propelled by an internal-combustion engine using a volatile fuel” (Cromer et al., 2023). Automobiles could also be referred to as a vehicle, motor vehicle, light trucks, etc. These days, automobiles can be propelled by either an internal-combustion engine or by electric battery. About 282 million vehicles were registered in the United States in 2021 about 90 million more compared to about 193 million in 1990. (U.S. Vehicle Fleet 1990-2021 | Statista, 2023). Also, the average price of motor vehicles in the United States in 2022 is about $46,000 USD (U.S.: Average Selling Price of New Vehicles 2022 | Statista, 2023)

Auto Loan: An auto loan is the money you borrow to pay for your car. You will have to repay the loan with interest in fixed installments (Martin, 2023). Auto Loans are also referred to as car loans, vehicle loans, car financing, etc. These loans are often secured loans meaning that the car is used as the collateral to secure the loan. Typically, consumers borrow money to buy vehicles. In fact, consumers owed about 1.41 trillion US dollars on vehicles they drove in 2022. Also, the average auto loan balance is about $22,000 USD. In addition, about 80% of all new vehicles on the road is financed through a loan or lease (Chris, 2023). This goes to tell us that the auto-loan industry is a multi-trillion dollar industry which signifies the importance of loan default models to help minimize losses.

Vehicle Loan Default: Loan default refers to when a borrower fails to make their installment payments as agreed by the loan terms. Usually for secured loans, the lender (bank in this case) can repossess the asset (car) which is used as collateral for the loan. Banks usually do not want to do that but sometimes are forced to do that if the borrower defaults and does not make an arrangement with the lender. When a lender repossess the car, the value of the car at the time of repossession may not be up to the loan balance and the lender will have no option than to write off that balance as a loss. In the US, default rates for auto loans are on the rise and currently sits at about 2%. The benefits of being able to predict whether a borrower will default cannot be over emphasized as it will not only help lenders to know whether to approve or deny a loan application, it will also help them price the interest rate for lenders appropriately.
The problem of loan default is in all ramifications very probabilistic and it’s often also considered strongly in credit risk modeling/credit scoring. Hence, lenders use a vast amount of credit risk tool to determine whether a borrower is likely to default or not. Entering default simply means that the lender determines that the borrower is not going to pay, usually some time after 90 days of no payments — can translate into your car being repossessed (O’Brien, 2023).

LITERATURE REVIEW

The probability of loan default is a well researched topic especially because the industry is a multi-trillion dollar industry. This area is of significant economic importance and often also regarded as credit risk modeling or credit scoring. We review literature related to credit in general and then those related to auto-loan.

In Crook et. al.(2007), Credit scoring is concerned with developing empirical models to support decision making in the retail credit business. Also, a credit score is a model-based estimate of the probability that a borrower will show some undesirable behavior in the future. In application scoring, for example, lenders employ predictive models, called scorecards, to estimate how likely an applicant is to default. Such PD (probability of default) scorecards are routinely developed using classification algorithms (Hand & Henley, 1997).

Whilst the extension of credit goes back to the Babylonian times, the history of credit scoring began in 1941 with the publication by Durand of a study that distinguished between good and bad loans made by 37 firms (Crook et al. 2007). Since then, the already established techniques of statistical discrimination have been developed and an enormous number of new classification algorithms have been researched and tested. Virtually all major banks use credit scoring with specialized consultancies providing credit scoring services and offering powerful software to score applicants, monitor their performance and manage their accounts.

Altman and Saunders (1998) published an overview of credit risk modelling for the last 20 years. They found that credit risk modelling has evolved drastically for the past 20 years due to new emerging statistical techniques(Altman & Saunders, 1998). Later, another group of researchers published an extension to Altman and Saunders work presenting a further development of credit risk modelling(Hao, Alam, & Carling, 2010). Their work identified more than 1000 articles on this topic, and found that logistic regression (LR) model and discriminant analysis are the most widely used methods for constructing scoring systems.

Also, Crook et al.(2007) conducted a research on credit risk scoring and found that the commonest method of estimating a classifier of applicants into those likely to repay and those unlikely to repay is logistic regression with the logit value compared with a cut off. Basically, this research makes claim that the industry standard for predicting loan default is the logistic regression model.

Lessmann et al.(2015) compared about 41 classifiers based on six performance measures across eight real-world credit scoring data sets from the UK, Europe, and Australia. They investigated the overall model performance using several datasets, and examine the predictive performance in each case. The conclusion from this research suggests that several classifiers predict risk significantly better than the industry standard of Logistic Regression (LR). It went further to recommend the Random Forests(RF) model as a benchmark model because of its effectiveness, precision, and its interpretability.

Agrawal et al.(2014) studied the impact of contract-specific variables as predictors in commercial vehicle loans. In their research, applying a logistic regression model for predicting default, around 11 out of 17 contract-specific variables where identified to provide additional assistance for the credit lending institution(Agrawal, Agrawal, & Raizada, 2014). The authors also suggest that contract information could improve the accuracy in more advanced nonlinear models. Specifically, the authors suggest the use of Neural Networks as one potential predictive model to improve the performance based on contract information(Agrawal et al.,2014).

Keeping the outcome of the above literature in mind, this project aims to contribute to the field of vehicle loan prediction by developing machine learning models that will be able to predict vehicle loan default based on the available data, and also compare three (3) models: Random Forest, XGBoost, and Neural Networks models to the industry standard Logistic Regression model. The data set used in this project contains more data than those used in most of the literature reviewed above. Statistically, more data provides better results and we hope that will be useful in better comparing the model and decide which model provides the best prediction metrics. In this work, we explain the data used, the pre-processing and feature selection involved and also provide a quick overview of predictive analytics as well as quick review to help understand the scoring metrics for classification problems. In addition, each of the models used are briefly explained and then the analysis of the data followed by modeling/testing and then conclusion on the findings.

OVERVIEW OF THE MODELS

3.1 Predictive Analytics Overview

Predictive analytics involves using historical data, statistical algorithms, and machine learning techniques to predict future outcomes. There are different types of algorithms for making predictions. Generally, we have regression models and classification models. Regression algorithms are mainly used when the variable to be predicted is a continuous value while classification algorithms are used to predict categories or classes.Classification algorithms are a fundamental part of predictive analytics and are used to categorize data into classes or groups based on specific features or attributes. Examples of classification algorithms include but not limited to Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM, KNN, Naive Bayes, etc. The problem of vehicle loan default prediction is a classification problem and the classification algorithms that will be used to solve the problem of loan default in this project are briefly explained below:

3.2 Logistic Regression

Logistic Regression is a type of classification algorithm that is used when the response variable is binary in nature. Being binary means that the response variable is a two level categorical variable. The Logistic Regression model belongs to the family of generalized linear models (GLMs) and it’s used when the response is a two-level categorical variable. Typical examples of binary response variables are Yes/No; Male/Female; Cancer/No Cancer; Approve/Deny, etc. These response variables are often coded as 1 or 0. Even though the name contains regression, the logistic regression model is often used when the response variable is discrete. i.e It is a classification algorithm.
Logistic Regressions must not necessarily be binary as there are possible situations where there are more than two-level categories in the response variables and such situations are regarded as multinomial logistic regression which is beyond the scope of this project.
The logistic regression model relates the probability that a response variable would be successful to the predictors $x_{1, i}, x_{2, i}, ..., x_{k, i}$ through a framework like that of multiple regression:
$logit(p_{i}) = log_{e}(\frac{p_{i}}{1-p_{i}}) = \beta_{0} + \beta_{1}x_{1,i} + \beta_{2}x_{2,i} + ... + \beta_{k}x_{k,i}$

Assumptions for Logistic Regression

Each outcome of the response variable is independent of the other outcomes.

The response variable must follow a binomial distribution.

Each predictor $x_{i}$ is linearly related to the $logit(p_{i})$ if other predictors are held constant.

3.3 Tree-Based Methods

Tree-based models are a type of supervised learning algorithm used for both classification and regression problems. They construct a decision tree that recursively splits the data into subsets based on the most significant features. The tree structure consists of nodes representing the features, edges representing the decision rules, and leaves representing the output (class or value). The key advantage of tree-based models is their ability to handle non-linear relationships and interactions between features. However, they are prone to overfitting, especially when the trees become too complex. Tree based models follow two major approaches - bagging or boosting.
Bagging (Bootstrap Aggregating): It’s a technique that aims to reduce variance and prevent overfitting by training multiple models on different bootstrapped subsets of the dataset and then averaging the predictions. Random Forest is an ensemble method based on bagging, utilizing multiple decision trees trained on different subsets of the data.
Boosting: Boosting is an ensemble technique that combines weak learners (typically shallow trees) sequentially to create a strong model. It focuses on improving the shortcomings of its predecessors by assigning higher weight to misclassified data, effectively learning from previous mistakes. Gradient Boosting and XGBoost are examples of boosting algorithms.

3.3.1 Random Forest

It’s an ensemble learning method that constructs multiple decision trees and merges their predictions to improve accuracy and reduce overfitting. It introduces randomness both in feature selection and dataset bootstrapping. By aggregating predictions from various trees, it tends to be more robust and less prone to overfitting compared to a single decision tree.

3.3.2 Gradient Boosting

This is a boosting technique that builds trees sequentially. It fits each tree to the residuals (errors) of the preceding tree, reducing the errors at each step. Gradient boosting is a powerful algorithm known for its ability to handle complex data and achieve high accuracy.

3.3.2 XGBoost (Extreme Gradient Boosting)

It is an optimized and highly efficient implementation of gradient boosting. XGBoost improves upon the traditional gradient boosting method by introducing regularization, parallel computing, and a variety of enhancements that significantly speed up the training process and improve accuracy.

3.4 Neural Networks

Artificial Neural Networks (ANNs) are a class of machine learning models inspired by the structure and function of the human brain. They consist of interconnected nodes, known as neurons, organized in layers to process and learn from data.

Structure of ANNs

Neurons(Nodes): Neurons are the basic units in an ANN. Each neuron receives input signals, performs computations, and produces an output signal that is transmitted to other neurons in the network.

Layers: ANNs typically consist of three types of layers:
Input Layer: Receives input data and passes it to the next layer.
Hidden Layers: Intermediate layers between the input and output layers. They extract and transform features through complex computations.
Output Layer: Produces the final output or prediction based on the learned representations from the hidden layers.

Connections (Weights): Neurons are connected to neurons in adjacent layers by weighted connections. These weights are adjusted during the learning process to optimize the network’s performance.

Training Process

Forward Propagation: Input data is passed through the network, and computations are performed layer by layer, generating an output.

Loss Calculation: The output is compared to the actual target, and a loss function measures the difference between the predicted and actual values.

Backpropagation: The algorithm calculates gradients of the loss function with respect to the weights of the network using chain rule and propagates this information backward through the network.

Weight Update: The weights are adjusted using optimization algorithms (e.g., gradient descent) to minimize the loss function, making the predictions closer to the actual values.

Types of ANNs

Feedforward Neural Networks (FNN): The most basic type where information flows in one direction—from input to output without cycles. This is the type used in this project.

Recurrent Neural Networks (RNN): Networks that allow feedback loops, enabling them to process sequential data by retaining memory of previous inputs.

Convolutional Neural Networks (CNN): Specialized for processing grid-like data, such as images. They use convolutional layers to learn features hierarchically.

Applications

Image and Speech Recognition: CNNs are widely used in image classification, object detection, and speech recognition.
Natural Language Processing (NLP): RNNs and variants like LSTM and GRU are used for text analysis, language translation, and sentiment analysis.
Predictive Modeling: ANNs are applied in various predictive tasks, including regression, classification, and time-series forecasting.

Challenges

Computational Complexity: Training large ANNs can be computationally intensive and require significant resources.
Overfitting: ANNs, especially with complex architectures, can overfit the training data if not properly regularized or trained on diverse data.

3.5 Scoring the Model and Understanding the Scoring Metrics

After obtaining and fitting a classification model, we need certain metrics to evaluate how well the model performs.Below are some metrics that can be used to evaluate the performance of a classification model:

source: https://www.debadityachakravorty.com/ai-ml/cmatrix/
The type of metrics to be used depends on the situation.

Accuracy: This is the most common measure used to evaluate a classification model. Accuracy is the ratio of correctly classified observations to the total. This tells us the percentage of observations that our model is correctly classifying.
$accuracy = \frac{TP + TN}{TP+FP+TN+FN}$
Accuracy is great for symmetric datasets and when the cost of false positives and false negatives are similar.

Precision: This is the percentage of the results that are relevant. It is computed by calculating the number of true positives (TP) divided by all positively classified observations by the model (both TP and FP). The precision tells us the percentage of observations that are actually positive from all positively classified observations by the model. i.e a precision of 90% means that of all the observations that are classified as positive by our model, only 90% of those are actually positive which means that the model correctly classified the positive cases in 90% of the cases.
$precision = \frac{TP}{TP+FP}$
We use precision when we want to be more confident about our true positives. For example, in spam/ham emails, you want to be sure that the email is spam before putting it in the spam box.

Recall (Sensitivity): The recall is also regarded as the sensitivity of the model. This refers to the percentage of total relevant results that are correctly classified by the model. Essentially, it means the ratio of the number of positively classified observations to the total number of actual positives both those correctly classified and incorrectly classified by the model (TP and FN). It is computed by dividing the number of true positives by the total number of observations that are actually positive (whether correctly classified by the model or not). The Recall also known as Sensitivity tells us what proportion of the positive class got correctly classified. i.e What percentage of cancer patients got correctly classified as cancer patients by the model. A recall of 95% means that of all the actually positive cases, only 95% were correctly classified by the model.
$recall = \frac{TP}{TP+FN}$
We use recall when having a false positive is way better than having a false negative. for example, you want to tell someone that they have cancer (FP) when in fact they don’t instead of telling them that they don’t have cancer(FN) when in fact they have it. It would be disastrous to give a False Negative to a cancer patient because they would probably have had time to ameliorate the situation if they had been informed earlier. The recall is better when the cost of false negatives is unacceptable. i.e. False positive is better than false negative.

F1 Score: This is the harmonic mean of the precision and recall. It is also called F-score or F-Measure.
$F1 = \frac{2 * precision * recall}{precision + recall}$
F1 is best for uneven distribution and it can be used to compare different models.

3.6 Underfitting and Overfitting

In supervised machine learning problems, we aim to get a model that properly fits the training data and also generalizes well on new/test data. When the model is unable to generalize on new data, it is not able to perform its purpose. One of such causes could be as a result of underfitting or overffiting.

Undefitting: This is a situation in machine learning where a model does not properly fit the training data resulting in high training error and high test error. In this case, the model performs poorly on the training data as well as on new data. An underfit machine learning model is not a suitable model for the data because it is not able to properly capture the relationship between the input examples (X) and the target values (Y).Poor performance on the training data could mean that the model is too simple to describe the target properly. A typical example of an underfit model is using a linear model for data points with quadratic relationship.
Since the model cannot generalize well on new data, it cannot be leveraged for prediction or classification tasks. High bias and low variance are good indicators of underfitting.

Overfitting: This is simply the opposite of underfitting. It is a situation in machine learning where a model properly fits the training data, but performs poorly on new datasets thereby resulting in low training error, but high test error. It involves good performance on training data, but poor performance on new/test data. The model is essentially learning the noise and details in the training data and memorizing it such that it can not generalize to unseen data. A typical example of overfitting is fitting a quadratic function with a cubic or higher order polynomial model. High variance and low bias are indicators of overfitting.

source: https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

Reducing Underfitting There are different ways to decrease underfitting such are:

Increase model complexity. It could be that the selected model is too simple and a more complex model may be required.

Perform feature selection/feature engineering.

Increase the duration of training to get better results.

Decrease the amount of regularization used.

Reducing Overfitting

Reduce the model complexity. It could be that the model is too complex and a simpler model is required.

Increase the amount of training data

Perform feature selection: Reduce the number of features

Increase the amount of regularization

Early stopping in the training

Perform cross validation

Use ensemble methods

METHODOLOGY

4.1 Dataset

The data used in this analysis was obtained from Kaggle. The csv files containing the data was downloaded from kaggle and placed in github. There are three files which are: train dataset, test dataset, and dictionary. The dictionary file explains what the variables represent. Unfortunately, the test dataset does not contain labels (i.e the response variable) and could not be used to evaluate how well the model does because to do that, you need to have labels to compare to the predicted values. Hence, After cleaning of the data, the train data was split using the CaTools package into training (df_train) and testing data (df_test)

Variables

There are about 41 variables in the dataset. forty (40) of the variables are predictor variables while there is one response variable. The variables in the dataset are:

UniqueID: Identifier for customer

loan_default: Payment default in the first EMI on due date

disbursed_amount: Amount of loan disbursed

asset_cost: Cost of the Asset

ltv: Loan to Value of asset

branch_id: Branch where the loan was disbursed

supplier_id: Vehicle Dealer where the loan was disbursed

manufacturer_id: Vehicle manufacturer (Hero, Honda, TVS, etc.)

Current_pincode: Current pincode of the customer

Date.of.Birth: Date of Birth of the customer

Employment.Type: Employment Type of the customer (Salaried/Self Employed)

DisbursalDate: Date of loan disbursement

State_ID: State of disbursement

MobileNo_Avl_Flag: If Mobile no. was shared by the customer then flagged as 1

Aadhar: If aadhar was shared by the customer then flagged as 1

PAN_flag: If pan was shared by the customer then flagged as 1

VoterID_flag: If voter Id was shared by the customer then flagged as 1

Driving_flag: If Driver license was shared by the customer then flagged as 1

Passport_flag: If passport was shared by the customer then flagged as 1

PERFORM_CNS.SCORE: Bureau Score

PERFORM_CNS.SCORE.DESCRIPTION: Bureau Score description

PRI.NO.OF.ACCTS: Count of total loans taken by the customer at the time of disbursement

PRI.ACTIVE.ACCTS: Count of active loans taken by the customer at the time of disbursement

PRI.OVERDUE.ACCTS: Count of default accounts at the time of disbursement

PRI.CURRENT.BALANCE: Total principal outstanding amount of the active loans at the time of disbursement

PRI.SANCTIONED.AMOUNT: Total amount that was sanctioned for all the loans at the time of disbursement

PRI.DISBURSED.AMOUNT: Total amount that was disbursed for all the loans at the time of disbursement

SEC.NO.OF.ACCTS: Count of total loans taken by the customer at the time of disbursement

SEC.ACTIVE.ACCTS: Count of active loans taken by the customer at the time of disbursement

SEC.OVERDUE.ACCTS: Count of default accounts at the time of disbursement

SEC.CURRENT.BALANCE: Total principal outstanding amount of the active loans at the time of disbursement

SEC.SANCTIONED.AMOUNT: Total amount that was sanctioned for all the loans at the time of disbursement

SEC.DISBURSED.AMOUNT: Total amount that was disbursed for all the loans at the time of disbursement

PRIMARY.INSTAL.AMT: EMI Amount of the primary loan

SEC.INSTAL.AMT: EMI Amount of the secondary loan

NEW.ACCTS.IN.LAST.SIX.MONTHS: New loans taken by the customer in the last 6 months before the disbursement

DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS: Loans defaulted in the last 6 months

AVERAGE.ACCT.AGE: Average loan tenure

CREDIT.HISTORY.LENGTH: Time since first loan

NO.OF_INQUIRIES: Enquiries done by the customer for loans

Note: Primary accounts are those which the customer has taken for his personal use while secondary accounts are those which the customer act as a co-applicant or guarantor.
The response variable is the loan_default variable and it is a two level categorical variable with a value of 1 for default and value 0 for no default.

4.2 Methodology

The method adopted in this analysis follows a certain number of steps and code re-use was essential to avoid errors resulting from code duplication. To perform this analysis, the following steps were taken in R studio:

Read Data: The data was read into memory from the external location and a few records were displayed.

Clean Data: The data was cleaned using the data_cleaning function developed in this analysis. It leverages the tidyverse group of packages.

Explore Data: The data was explored. It was determined that there are about 41 variables (40 predictors and 1 response variable). Also, there were over 230,000 observations and the data was found to be unbalanced after visualizing the two level categorical response variable loan_default in a bar chart. Further exploratory data analysis was done to better understand the data.

Pre-Process Data: The data was pre-processed using the data_preprocessing function also developed in this analysis.

Train Test Split: Split the data into training and test data.

Model Training: After cleaning, and pre-processing, the train data was used to train four (4) different machine learning models: Logistic, Random Forests, XGBoost, and Neural Network models. These training were done using the model_training function developed as part of this analysis as well.

Model Prediction and Evaluation: After training the data and getting a trained model, predictions were made and results of the prediction were compared to the labels in the data to determine how well the model performs using four different metrics: Accuracy, Precision, Recall, and AUC.

Results: The results of the model evaluations were compared to determine the best performing model.

Next Steps and Conclusion: Based on the results of the evaluation metrics and other findings during the analysis, next steps recommendations were made and a conclusion provided on the project findings.

Further details about the functions used in the analysis are provided in the section below:

4.3 Data Cleaning and Pre-processing Details

To clean and pre-process the data, two major functions data_training and data_preprocessing were developed respectively to avoid code duplication and errors such that new data can simply be passed into those functions to be cleaned, pre-processed and ready for either training or testing.

4.3.1 data_cleaning:

This function basically cleans the data. It accepts an R dataframe or a tibble with fixed schema and does the following cleaning:

It starts by removing the ‘period’ in the column names and then converting all the names to snake case (lower) format.

After which it derives the age of applicants by subtracting the date of birth from the date of disbursement rounded to the nearest whole number after both dates have been converted to date from their string formats using the lubridate package in R.

Then, it normalizes the average_account_age by combining their year and month components and round the result to the nearest year.

Normalized the average_credit_history by combining their year and month components and round the result to the nearest year.

Cleaned up the perform_cns_score_description into meaningful risk categories.

Cleaned up the employment_type of applicant into self_employed, employed, and not_reported.

Finally, it selects the relevant columns that will be used further in the analysis and returns the cleaned data as an R dataframe.

4.3.2 data_preprocessing:

This function pre-processes the data for either training or testing purposes. It expects the cleaned data from the data_cleaning function and a mode to specify whether the data will be used for training or testing. It has a further helper function data_preprocess_scaling that uses the standard-scaler approach to scale the data. For testing data, mode is passed as ‘test’ and the function basically calls the helper function to scale the data and returned the scaled data that can then be used for testing purposes while for training data, mode is passes as ‘train’ and the function uses the ROSE package in R to do oversampling/under-sampling to balance the class of data in this binary classification problem after which it calls the helper function to standardize/scale the balanced data.

4.3.3 model_training:

This function does the model training of any of the four (4) models. It expects a pre-processed dataframe and mode of training. It has a further helper function xgb_nnet_preprocess that does further preprocessing for XGB, and Neural Network models. For the XGB model, the train function expects a certain DMatrix which can only be numeric values and in that case, the model_training function calls this helper function which converts the non-numeric predictor variables to dummies, and converts the dataframe into a DMatrix format that can be used to train the data using the XGB algorithm. Also, for the NNET model, the helper function only dummifies the non-numeric predictor variables.
For models that need further pre-processing, the function calls the appropriate helper function to do that and then trains the model using the data and type of model provided to return a trained model.

4.3.4 model_prediction:

This behaves similarly like the model_training function although it does so for predictions. It expects three parameters - data, trained_model, and model_type. Depending on the type of the model, it determines if the data needs further pre-processing like in the case of XGB and NNET models whereby it calls the same helper function xgb_nnet_preprocess to return the same format of data it gave to the training model as that is what would be needed to make predictions. The predictions returned are probabilities of belonging to a certain class (Class 0). However, this function goes further to convert these probabilities to classes using a given threshold which can always be adjusted based on business needs. For this project, a 50% threshold is used.

4.3.5 model_evaluation:

After training of a model and using the model to make predictions, it is very important to determine if the predictions of the model can be useful or taken seriously. Hence, we evaluate the model and this function does the evaluation of the model. It calculates the accuracy, precision, recall, and auc for any one of the four models once the actual and predicted values are provided. It relies heavily on the Metrics package in R for the evaluations except for auc where it uses the pROC package.

Note: Although, there is a section for exploratory data analysis, there is no function that does this.

ANALYSIS, TESTING AND RESULTS

5.1 Analysis

5.1.1 Data Cleaning and Pre-processing

Libraries
Load the required libraries for the analysis

library(Amelia) # To visualize missing data
library(caret)
library(caTools) # for train test split
library(corrplot) # To plot correlation plot
library(cowplot) # To combine plots in a grid
library(fastDummies) # to convert character variables to dummies
library(ggcorrplot) # To plot correlation plot
library(kableExtra)
library(Metrics) # for model evaluation
library(nnet)
library(pROC)
library(ranger) # for random forest implementation
library(ROSE) # to balance the data
library(tidyverse) 
library(xgboost)

Read the datasets into memory from the github location.

url_vehicle_loan_default_train = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_train_data.csv"
url_vehicle_loan_default_test = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_test_data.csv"
vehicle_loan_default_train_raw = read_csv(url_vehicle_loan_default_train) %>% as_tibble()
vehicle_loan_default_test_raw = read_csv(url_vehicle_loan_default_test) %>% as_tibble()

Display a few records of the dataset to have an idea of how the data looks like.

# display a few records of the raw data
raw_data_few_records <- kable(head(vehicle_loan_default_train_raw, 50), "html") %>%
                        kable_paper("hover", full_width = F) %>%
                        scroll_box(width = "850px", height = "350px")
raw_data_few_records

UNIQUEID	DISBURSED_AMOUNT	ASSET_COST	LTV	BRANCH_ID	SUPPLIER_ID	MANUFACTURER_ID	CURRENT_PINCODE_ID	DATE_OF_BIRTH	EMPLOYMENT_TYPE	DISBURSAL_DATE	STATE_ID	EMPLOYEE_CODE_ID	MOBILENO_AVL_FLAG	AADHAR_FLAG	VOTERID_FLAG	PERFORM_CNS_SCORE	PERFORM_CNS_SCORE_DESCRIPTION	PRI_NO_OF_ACCTS	PRI_ACTIVE_ACCTS	PRI_OVERDUE_ACCTS	PRI_CURRENT_BALANCE	PRI_SANCTIONED_AMOUNT	PRI_DISBURSED_AMOUNT	PRIMARY_INSTAL_AMT	NEW_ACCTS_IN_LAST_SIX_MONTHS	DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS	AVERAGE_ACCT_AGE	CREDIT_HISTORY_LENGTH	NO_OF_INQUIRIES	LOAN_DEFAULT
420825	50578	58400	89.55	67	22807	45	1441	01-01-1984	Salaried	03-08-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
537409	47145	65550	73.23	67	22807	45	1502	31-07-1985	Self employed	26-09-2018	6	1998	1	1	0	598	I-Medium Risk	1	1	1	27600	50200	50200	1991	0	1	1yrs 11mon	1yrs 11mon	0	1
417566	53278	61360	89.63	67	22807	45	1497	24-08-1985	Self employed	01-08-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
624493	57513	66113	88.48	67	22807	45	1501	30-12-1993	Self employed	26-10-2018	6	1998	1	1	0	305	L-Very High Risk	3	0	0	0	0	0	31	0	0	0yrs 8mon	1yrs 3mon	1	1
539055	52378	60300	88.39	67	22807	45	1495	09-12-1977	Self employed	26-09-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	1	1
518279	54513	61900	89.66	67	22807	45	1501	08-09-1990	Self employed	19-09-2018	6	1998	1	1	0	825	A-Very Low Risk	2	0	0	0	0	0	1347	0	0	1yrs 9mon	2yrs 0mon	0	0
529269	46349	61500	76.42	67	22807	45	1502	01-06-1988	Salaried	23-09-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
510278	43894	61900	71.89	67	22807	45	1501	04-10-1989	Salaried	16-09-2018	6	1998	1	1	0	17	Not Scored: Not Enough Info available on the customer	1	1	0	72879	74500	74500	0	0	0	0yrs 2mon	0yrs 2mon	0	0
490213	53713	61973	89.56	67	22807	45	1497	15-11-1991	Self employed	05-09-2018	6	1998	1	1	0	718	D-Very Low Risk	1	1	0	-41	365384	365384	0	0	0	4yrs 8mon	4yrs 8mon	1	0
510980	52603	61300	86.95	67	22807	45	1492	01-06-1968	Salaried	16-09-2018	6	1998	1	0	1	818	A-Very Low Risk	1	0	0	0	0	0	2608	0	0	1yrs 7mon	1yrs 7mon	0	0
548567	53278	61230	89.83	67	22807	45	1493	01-01-1979	Self employed	29-09-2018	6	1998	1	1	0	300	M-Very High Risk	1	0	0	0	0	0	2270	0	0	0yrs 7mon	0yrs 7mon	0	1
486821	64769	74190	89.23	67	22807	45	1446	07-09-1984	Salaried	03-09-2018	6	1998	1	1	0	786	B-Very Low Risk	3	2	0	676	36154	23374	0	0	0	2yrs 1mon	2yrs 3mon	1	0
478647	53278	61330	89.68	67	22807	45	1497	01-06-1974	Salaried	30-08-2018	6	1998	1	0	1	300	M-Very High Risk	7	2	1	0	69900	69900	3300	0	0	1yrs 3mon	2yrs 9mon	0	1
479533	49478	57010	89.46	67	22807	45	1497	16-08-1984	Salaried	30-08-2018	6	1998	1	1	0	738	C-Very Low Risk	10	5	0	79750	187000	187000	23309	1	0	1yrs 0mon	2yrs 1mon	4	1
483869	49278	57080	89.35	67	22807	45	1495	18-02-1973	Self employed	31-08-2018	6	1998	1	1	0	300	M-Very High Risk	5	5	3	95597	179252	179252	3514	0	0	3yrs 11mon	7yrs 2mon	0	1
600655	47549	61400	79.80	67	22807	45	1440	05-07-1994	Salaried	22-10-2018	6	1998	1	0	1	17	Not Scored: Not Enough Info available on the customer	1	0	0	0	0	0	7900	1	0	0yrs 1mon	0yrs 1mon	0	1
513916	57713	65750	89.28	67	22807	45	1440	01-06-1976	Self employed	18-09-2018	6	1998	1	1	0	300	M-Very High Risk	6	4	2	29069	1067200	1067200	47100	1	1	2yrs 6mon	5yrs 6mon	0	0
522020	53503	62100	87.28	67	22807	45	1498	27-02-1983	Self employed	20-09-2018	6	1998	1	1	0	688	E-Low Risk	13	8	0	1076657	2277048	2277048	4982	1	0	1yrs 10mon	4yrs 7mon	0	0
492995	70017	86760	82.99	67	22807	45	1479	10-08-1988	Self employed	06-09-2018	6	1998	1	0	1	585	I-Medium Risk	1	0	1	0	0	0	0	0	0	1yrs 9mon	1yrs 9mon	0	1
568857	58259	68500	86.13	67	22807	45	1468	16-04-1980	Self employed	11-10-2018	6	1998	1	1	0	615	H-Medium Risk	1	0	0	0	0	0	0	0	0	0yrs 11mon	0yrs 11mon	1	1
590630	58013	69650	84.71	67	22807	45	1497	01-11-1978	Self employed	20-10-2018	6	1998	1	1	0	750	C-Very Low Risk	9	1	0	134499	32198	32198	557	1	0	0yrs 6mon	0yrs 10mon	1	0
467015	31184	57110	56.91	67	22807	45	1498	29-02-1984	Salaried	27-08-2018	6	1998	1	1	0	801	B-Very Low Risk	7	5	0	1338774	2306289	2291743	11083	0	0	2yrs 9mon	5yrs 10mon	2	0
563215	43594	78256	57.50	67	22744	86	1499	14-07-1994	Self employed	08-10-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
513139	54513	61900	89.66	67	22807	45	1468	31-05-1979	Self employed	17-09-2018	6	1998	1	0	1	738	C-Very Low Risk	1	1	0	6690	25200	25200	1700	0	0	1yrs 3mon	1yrs 3mon	0	0
498082	73123	92900	79.66	67	22807	45	1480	02-01-1989	Self employed	10-09-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
586411	55213	68600	83.09	67	22807	45	1494	01-01-1986	Salaried	18-10-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
440293	53713	61780	89.83	67	22807	45	1468	02-08-1968	Self employed	16-08-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	1
566763	57713	68040	86.27	67	22807	45	1497	01-01-1976	Self employed	10-10-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	1
605314	57513	65750	88.97	67	22807	45	1497	12-09-1972	Self employed	23-10-2018	6	1998	1	1	0	615	H-Medium Risk	1	0	0	0	0	0	0	0	0	3yrs 1mon	3yrs 1mon	1	1
519075	54513	61900	89.66	67	22807	45	1473	27-06-1969	Self employed	19-09-2018	6	1998	1	1	0	730	D-Very Low Risk	5	3	0	101518	162800	162800	8972	1	0	2yrs 0mon	5yrs 4mon	0	0
551137	45349	60300	76.29	67	22807	45	1501	01-01-1974	Self employed	30-09-2018	6	1998	1	1	0	758	C-Very Low Risk	5	3	0	909093	1442349	1442349	0	0	0	2yrs 0mon	2yrs 9mon	0	0
525983	46549	69518	69.05	67	22744	86	1480	23-05-1990	Salaried	21-09-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
501823	57259	70100	82.74	67	22807	45	1497	01-06-1966	Salaried	12-09-2018	6	1998	1	0	1	768	B-Very Low Risk	7	3	0	324323	604845	604845	1219	1	0	1yrs 10mon	4yrs 10mon	0	0
451537	42594	60630	72.57	67	22807	45	1497	29-07-1996	Self employed	21-08-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
439084	50678	58300	89.88	67	22807	45	1474	01-06-1977	Self employed	14-08-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
584660	53078	64280	84.01	67	22807	45	1497	05-10-1993	Self employed	17-10-2018	6	1998	1	1	0	610	H-Medium Risk	1	0	0	0	0	0	2111	0	0	0yrs 4mon	0yrs 4mon	1	1
606338	56013	63930	89.16	67	22807	45	1502	22-09-1986	Self employed	23-10-2018	6	1998	1	1	0	653	F-Low Risk	9	6	0	3878357	4015900	4015900	126287	4	0	0yrs 11mon	4yrs 0mon	1	1
641415	58013	65838	89.61	67	22807	45	1497	01-01-1991	Self employed	30-10-2018	6	1998	1	1	0	413	K-High Risk	13	3	1	19121	91161	91161	22427	0	2	1yrs 1mon	2yrs 1mon	4	0
590213	55759	63100	89.54	67	22807	45	1492	19-06-1977	Self employed	20-10-2018	6	1998	1	1	0	709	D-Very Low Risk	4	3	0	18518	77480	77480	0	2	0	1yrs 9mon	3yrs 9mon	4	0
422926	50578	58400	89.55	67	22807	45	1577	16-08-1996	Salaried	06-08-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	1
557071	51303	66450	78.25	67	22807	45	1499	01-06-1969	Salaried	04-10-2018	6	1998	1	1	0	719	D-Very Low Risk	5	2	0	8000	145000	145000	0	0	0	1yrs 3mon	2yrs 11mon	0	0
582949	40894	61230	67.78	67	22807	45	1497	07-02-1993	Self employed	16-10-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
596436	42894	70600	61.61	67	22807	45	1497	20-06-1982	Self employed	21-10-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
507978	64282	74290	89.11	67	22807	45	1474	01-06-1990	Salaried	15-09-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
529381	57213	64750	89.88	67	22807	45	1502	27-11-1976	Salaried	23-09-2018	6	1998	1	1	0	16	Not Scored: No Activity seen on the customer (Inactive)	1	0	0	0	0	0	0	0	0	0yrs 11mon	0yrs 11mon	0	0
480958	68082	79806	87.71	67	22744	86	1504	05-03-1990	Salaried	31-08-2018	6	1998	1	1	0	0	No Bureau History Available	0	0	0	0	0	0	0	0	0	0yrs 0mon	0yrs 0mon	0	0
566809	48349	67650	72.43	67	22807	45	1497	15-01-1993	Salaried	10-10-2018	6	1998	1	1	0	15	Not Scored: Sufficient History Not Available	1	1	0	155000	155000	155000	0	1	0	0yrs 0mon	0yrs 0mon	1	0
585779	61013	68850	89.76	67	22807	45	1501	01-06-1986	Self employed	17-10-2018	6	1998	1	1	0	701	E-Low Risk	9	2	0	157671	214800	214800	2667	0	0	0yrs 5mon	1yrs 8mon	0	1
559601	54078	70000	78.57	67	22807	45	1497	25-03-1974	Salaried	06-10-2018	6	1998	1	1	0	626	H-Medium Risk	4	4	1	2470898	2836417	2836417	29840	0	0	2yrs 6mon	5yrs 2mon	1	0
612741	57613	68950	84.99	67	22807	45	1499	14-09-1980	Self employed	24-10-2018	6	1998	1	1	0	717	D-Very Low Risk	1	1	0	3793	49597	49597	1956	0	0	2yrs 10mon	2yrs 10mon	1	1

data_cleaning function

The function does cleaning of the data. It only accepts a dataframe (or a tibble)

data_cleaning <- function(df){
  # This function accepts a dataframe (df) as input and returns another dataframe (cleaned_df) that is clean.
  tryCatch({
  if(is_tibble(df) | is.data.frame(df)){
    print("The dataframe is a tibble, will proceed to clean data")
    print("Data Cleaning in progress...")
    # rename all the columns in the dataframe to lowercase
    cleaned_df <- df %>% rename_all(tolower) %>%  
                # compute the age of the applicant in number of years
                mutate(date_of_birth = dmy(date_of_birth),  
                      disbursal_date = dmy(disbursal_date), 
                      age = difftime(disbursal_date, date_of_birth, units = "days"),
                      age_years = round(as.numeric(age / 365.25), 0)) %>%
               # extract the years and month component of the average_acct_age  and convert to years
                mutate(average_acct_age_year_comp = as.numeric(str_extract(average_acct_age, "\\d+")),
                      average_acct_age_mon_comp = as.numeric(str_extract(average_acct_age, "\\d+(?=mon)")),
                      average_acct_age = round((average_acct_age_year_comp + average_acct_age_mon_comp/12), 0)
                       ) %>%
                # extract the years and month component of the credit_history_length and convert to years
                mutate(credit_history_length_year_comp = as.numeric(str_extract(credit_history_length, "\\d+")),
                       credit_history_length_comp = as.numeric(str_extract(credit_history_length, "\\d+(?=mon)")),
                       credit_history_length = round((credit_history_length_year_comp + credit_history_length_comp/12), 0)
                       )  %>%
                # clean up the perform_cns_score_distribution to include only a few categories
                mutate(lowercase_cns_description = tolower(perform_cns_score_description),
                       perform_cns_score_description = case_when(
                                str_detect(lowercase_cns_description, "very low risk") ~ "very_low_risk",
                                str_detect(lowercase_cns_description, "low risk") ~ "low_risk",
                                str_detect(lowercase_cns_description, "medium risk") ~ "medium_risk",
                                str_detect(lowercase_cns_description, "high risk") ~ "high_risk",
                                str_detect(lowercase_cns_description, "very high risk") ~ "very_high_risk",
                                str_detect(lowercase_cns_description, "not scored|no bureau") ~ "low_risk",
                                TRUE ~ "none"))  %>%
                # clean up the employment type to have only few categories
                mutate(lower_case_employment_type = tolower(employment_type),
                       employment_type = case_when(
                                str_detect(lower_case_employment_type, "salaried") ~ "salaried",
                                str_detect(lower_case_employment_type, "self employed") ~ "self_employed",
                                TRUE ~ "not_reported")) %>%
                # select only the required columns
                select(
                  age_years, disbursed_amount, asset_cost, ltv, employment_type, perform_cns_score_description,
                  pri_no_of_accts, pri_active_accts, pri_overdue_accts, pri_current_balance, pri_sanctioned_amount,
                  pri_disbursed_amount, sec_no_of_accts, sec_active_accts, sec_overdue_accts, sec_current_balance,
                  sec_sanctioned_amount, sec_disbursed_amount, primary_instal_amt, sec_instal_amt, new_accts_in_last_six_months,
                  delinquent_accts_in_last_six_months, average_acct_age, credit_history_length, no_of_inquiries, loan_default
                )
                print("Data Cleaning complete!!!")
    return(cleaned_df)
  }
  else{
  print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
  }
    
  },
  #if an error occurs, tell me the error
  error=function(e) {
        message('An Error Occurred')
        print(e)
        },
  #or if a warning occurs, tell me the warning
  warning=function(w) {
        message('A Warning Occurred')
        print(w)
        return(NA)
        }
    )
  
}

Use the data_cleaning function to clean the data.

# clean the data
vehicle_loan_default_train_cleaned = data_cleaning(vehicle_loan_default_train_raw)

## [1] "The dataframe is a tibble, will proceed to clean data"
## [1] "Data Cleaning in progress..."
## [1] "Data Cleaning complete!!!"

Display a few records of the cleaned data.

# display a few records of the cleaned data
cleaned_data_few_records <- kable(head(vehicle_loan_default_train_cleaned, 200), "html") %>% 
                            kable_paper("hover", full_width = F) %>%
                            scroll_box(width = "850px", height = "350px")
cleaned_data_few_records

age_years	disbursed_amount	asset_cost	ltv	employment_type	perform_cns_score_description	pri_no_of_accts	pri_active_accts	pri_overdue_accts	pri_current_balance	pri_sanctioned_amount	pri_disbursed_amount	sec_no_of_accts	sec_active_accts	sec_overdue_accts	sec_current_balance	sec_sanctioned_amount	sec_disbursed_amount	primary_instal_amt	sec_instal_amt	new_accts_in_last_six_months	delinquent_accts_in_last_six_months	average_acct_age	credit_history_length	no_of_inquiries	loan_default
35	50578	58400	89.55	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
33	47145	65550	73.23	self_employed	medium_risk	1	1	1	27600	50200	50200	0	0	0	0	0	0	1991	0	0	1	2	2	0	1
33	53278	61360	89.63	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
25	57513	66113	88.48	self_employed	high_risk	3	0	0	0	0	0	0	0	0	0	0	0	31	0	0	0	1	1	1	1
41	52378	60300	88.39	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1
28	54513	61900	89.66	self_employed	very_low_risk	2	0	0	0	0	0	0	0	0	0	0	0	1347	0	0	0	2	2	0	0
30	46349	61500	76.42	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
29	43894	61900	71.89	salaried	low_risk	1	1	0	72879	74500	74500	0	0	0	0	0	0	0	0	0	0	0	0	0	0
27	53713	61973	89.56	self_employed	very_low_risk	1	1	0	-41	365384	365384	0	0	0	0	0	0	0	0	0	0	5	5	1	0
50	52603	61300	86.95	salaried	very_low_risk	1	0	0	0	0	0	0	0	0	0	0	0	2608	0	0	0	2	2	0	0
40	53278	61230	89.83	self_employed	high_risk	1	0	0	0	0	0	0	0	0	0	0	0	2270	0	0	0	1	1	0	1
34	64769	74190	89.23	salaried	very_low_risk	3	2	0	676	36154	23374	0	0	0	0	0	0	0	0	0	0	2	2	1	0
44	53278	61330	89.68	salaried	high_risk	7	2	1	0	69900	69900	0	0	0	0	0	0	3300	0	0	0	1	3	0	1
34	49478	57010	89.46	salaried	very_low_risk	10	5	0	79750	187000	187000	0	0	0	0	0	0	23309	0	1	0	1	2	4	1
46	49278	57080	89.35	self_employed	high_risk	5	5	3	95597	179252	179252	0	0	0	0	0	0	3514	0	0	0	4	7	0	1
24	47549	61400	79.80	salaried	low_risk	1	0	0	0	0	0	0	0	0	0	0	0	7900	0	1	0	0	0	0	1
42	57713	65750	89.28	self_employed	high_risk	6	4	2	29069	1067200	1067200	0	0	0	0	0	0	47100	0	1	1	2	6	0	0
36	53503	62100	87.28	self_employed	low_risk	13	8	0	1076657	2277048	2277048	0	0	0	0	0	0	4982	0	1	0	2	5	0	0
30	70017	86760	82.99	self_employed	medium_risk	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	2	0	1
38	58259	68500	86.13	self_employed	medium_risk	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1
40	58013	69650	84.71	self_employed	very_low_risk	9	1	0	134499	32198	32198	0	0	0	0	0	0	557	0	1	0	0	1	1	0
34	31184	57110	56.91	salaried	very_low_risk	7	5	0	1338774	2306289	2291743	0	0	0	0	0	0	11083	0	0	0	3	6	2	0
24	43594	78256	57.50	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
39	54513	61900	89.66	self_employed	very_low_risk	1	1	0	6690	25200	25200	0	0	0	0	0	0	1700	0	0	0	1	1	0	0
30	73123	92900	79.66	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
33	55213	68600	83.09	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
50	53713	61780	89.83	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
43	57713	68040	86.27	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
46	57513	65750	88.97	self_employed	medium_risk	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	3	1	1
49	54513	61900	89.66	self_employed	very_low_risk	5	3	0	101518	162800	162800	0	0	0	0	0	0	8972	0	1	0	2	5	0	0
45	45349	60300	76.29	self_employed	very_low_risk	5	3	0	909093	1442349	1442349	0	0	0	0	0	0	0	0	0	0	2	3	0	0
28	46549	69518	69.05	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
52	57259	70100	82.74	salaried	very_low_risk	7	3	0	324323	604845	604845	0	0	0	0	0	0	1219	0	1	0	2	5	0	0
22	42594	60630	72.57	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
41	50678	58300	89.88	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
25	53078	64280	84.01	self_employed	medium_risk	1	0	0	0	0	0	0	0	0	0	0	0	2111	0	0	0	0	0	1	1
32	56013	63930	89.16	self_employed	low_risk	9	6	0	3878357	4015900	4015900	0	0	0	0	0	0	126287	0	4	0	1	4	1	1
28	58013	65838	89.61	self_employed	high_risk	13	3	1	19121	91161	91161	0	0	0	0	0	0	22427	0	0	2	1	2	4	0
41	55759	63100	89.54	self_employed	very_low_risk	4	3	0	18518	77480	77480	0	0	0	0	0	0	0	0	2	0	2	4	4	0
22	50578	58400	89.55	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
49	51303	66450	78.25	salaried	very_low_risk	5	2	0	8000	145000	145000	0	0	0	0	0	0	0	0	0	0	1	3	0	0
26	40894	61230	67.78	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
36	42894	70600	61.61	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
28	64282	74290	89.11	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
42	57213	64750	89.88	salaried	low_risk	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0
28	68082	79806	87.71	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
26	48349	67650	72.43	salaried	low_risk	1	1	0	155000	155000	155000	0	0	0	0	0	0	0	0	1	0	0	0	1	0
32	61013	68850	89.76	self_employed	low_risk	9	2	0	157671	214800	214800	0	0	0	0	0	0	2667	0	0	0	0	2	0	1
45	54078	70000	78.57	salaried	medium_risk	4	4	1	2470898	2836417	2836417	0	0	0	0	0	0	29840	0	0	0	2	5	1	0
38	57613	68950	84.99	self_employed	very_low_risk	1	1	0	3793	49597	49597	0	0	0	0	0	0	1956	0	0	0	3	3	1	1
33	58413	66100	89.86	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
47	58459	71200	84.27	salaried	low_risk	2	2	0	20865	93201	93201	0	0	0	0	0	0	2000	0	1	0	2	3	0	0
29	49478	57520	88.66	self_employed	high_risk	2	1	1	1959	364800	364800	0	0	0	0	0	0	0	0	0	0	4	5	0	1
24	55513	67950	83.15	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
34	53599	66115	84.70	salaried	low_risk	10	9	0	3656027	3690603	3690603	0	0	0	0	0	0	28721	0	6	0	0	1	3	0
34	53040	67067	82.73	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
24	50475	62413	84.60	salaried	very_low_risk	3	1	0	2412	36920	36920	0	0	0	0	0	0	0	0	0	0	1	1	0	0
25	49458	63000	82.54	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
43	48693	65500	77.86	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
54	48500	59313	83.79	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
42	48693	62577	81.50	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
40	54273	66855	86.75	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
35	43869	62577	71.91	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
47	50673	62577	84.70	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
24	49458	63000	82.54	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
34	45268	62577	76.23	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
34	50943	63896	85.29	salaried	high_risk	4	0	1	0	0	0	0	0	0	0	0	0	1780	0	0	0	1	2	0	0
28	49713	68000	77.94	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
52	46759	62577	78.30	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
34	49458	63000	82.54	salaried	very_low_risk	2	2	0	5428	63500	63500	0	0	0	0	0	0	4248	0	0	0	2	3	0	1
36	54343	68862	82.77	salaried	low_risk	9	1	0	79569	80000	80000	0	0	0	0	0	0	5154	0	1	0	1	2	2	0
24	42874	63840	68.92	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
52	54273	71840	80.73	salaried	low_risk	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	13	13	0	0
39	48468	65500	77.86	salaried	low_risk	1	1	0	58558	48220	48220	0	0	0	0	0	0	0	0	1	0	0	0	0	1
26	50046	70516	72.75	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
30	47773	63306	80.56	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
46	46548	63306	78.82	salaried	very_low_risk	1	1	0	0	12000	12000	0	0	0	0	0	0	0	0	0	0	1	1	0	0
33	54131	69936	82.36	salaried	very_low_risk	11	2	0	45639	75000	75000	0	0	0	0	0	0	22267	0	1	0	1	2	5	1
36	50743	66115	78.65	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
33	48258	63896	80.60	salaried	low_risk	1	1	0	51500	51500	51500	0	0	0	0	0	0	0	0	1	0	0	0	0	0
49	47773	63306	80.56	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
48	44779	62852	74.78	salaried	medium_risk	5	2	1	34922	60000	60000	0	0	0	0	0	0	11759	0	1	1	1	1	1	0
26	45814	66115	71.09	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
41	48468	60410	84.42	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
49	44819	66487	69.19	salaried	very_low_risk	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	4	0	0
22	50673	63840	83.02	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
41	37939	64500	62.02	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
20	52428	67405	81.60	not_reported	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
20	51653	63896	86.08	not_reported	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
36	52818	63896	88.42	salaried	medium_risk	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1
24	51428	64840	84.82	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	4	0
20	49488	63306	83.72	not_reported	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
40	51663	68000	79.41	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
26	49458	62852	82.73	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
43	52508	66246	81.51	salaried	high_risk	36	11	2	327845	489490	489490	0	0	0	0	0	0	40747	0	3	1	1	7	0	0
36	55333	73805	78.59	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
41	38939	59313	67.44	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
51	49713	63896	82.95	self_employed	medium_risk	8	2	0	2131369	3254115	3254115	0	0	0	0	0	0	2884	0	0	0	2	4	0	0
28	49458	63000	82.54	salaried	low_risk	1	1	0	282809	292906	292906	0	0	0	0	0	0	0	0	1	0	0	0	0	1
21	40884	59313	70.81	not_reported	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
42	50295	67528	77.92	salaried	very_low_risk	1	0	0	0	0	0	0	0	0	0	0	0	1328	0	0	0	1	1	0	0
29	44575	59540	78.94	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
25	49973	63306	84.51	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
38	45769	66365	72.33	salaried	very_low_risk	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	4	4	0	0
41	51653	63306	86.88	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
30	46809	59313	80.93	salaried	very_low_risk	1	1	0	3391	25000	25000	0	0	0	0	0	0	1350	0	0	0	2	2	0	1
40	41670	59313	74.18	salaried	low_risk	17	3	0	1002630	1028000	1028000	0	0	0	0	0	0	23951	0	1	0	1	2	0	1
43	48693	62577	81.50	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
36	51428	63306	86.88	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
29	54273	69067	83.98	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
41	51428	63306	86.88	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
52	45769	59313	80.93	self_employed	very_low_risk	1	0	0	0	0	0	0	0	0	0	0	0	2460	0	0	0	2	2	0	0
30	42969	63840	72.06	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
25	48258	63896	80.60	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
28	47650	62577	79.74	salaried	very_low_risk	2	1	0	1352	15000	15000	0	0	0	0	0	0	7460	0	0	0	1	2	1	0
30	42690	63000	69.84	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
50	41854	59300	74.20	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
47	46759	62577	78.30	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
38	48433	63896	80.15	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
38	46555	59313	82.61	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23	48699	59313	84.13	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
31	49250	65269	77.37	salaried	low_risk	1	1	0	27537	28196	28196	0	0	0	0	0	0	0	0	0	0	1	1	0	1
32	34959	60410	59.59	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
34	43090	64800	70.22	salaried	low_risk	1	1	0	45500	45500	45500	0	0	0	0	0	0	0	0	1	0	0	0	0	0
42	28084	63720	45.51	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
21	49683	62577	83.10	not_reported	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
47	50673	65269	81.20	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
36	38164	63896	64.17	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
57	50673	62577	84.70	salaried	very_low_risk	3	2	0	2681	15749	15749	0	0	0	0	0	0	0	0	2	0	0	1	0	0
42	50458	63896	84.51	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
36	49458	62577	83.10	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
23	42874	63800	68.97	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
31	49683	62577	83.10	salaried	very_low_risk	2	1	0	21323	26000	26000	0	0	0	0	0	0	2856	0	1	0	1	1	1	1
27	50743	63149	82.34	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
38	60213	84398	73.46	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
29	54303	69542	79.09	salaried	high_risk	5	4	1	312718	677000	677000	0	0	0	0	0	0	12979	0	0	2	2	4	3	1
32	49803	65368	77.25	salaried	very_low_risk	2	0	0	0	0	0	0	0	0	0	0	0	4164	0	0	0	0	0	0	1
46	51403	65687	79.32	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
30	60971	72657	85.00	salaried	low_risk	2	1	0	10000	42500	42500	0	0	0	0	0	0	2154	0	0	0	2	3	0	1
50	45349	65368	70.37	salaried	very_low_risk	2	1	0	33612	40000	40000	0	0	0	0	0	0	3740	0	0	0	1	2	0	0
27	36439	61865	59.81	self_employed	very_low_risk	3	1	0	13785	56173	56173	0	0	0	0	0	0	4020	0	0	0	2	5	0	0
33	51078	65368	79.55	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
28	74951	102945	74.02	self_employed	very_low_risk	2	1	0	21480	201381	201381	0	0	0	0	0	0	0	0	0	0	3	3	0	0
34	58259	66068	89.30	self_employed	very_low_risk	7	4	0	196020	259363	259363	0	0	0	0	0	0	26627	0	2	0	1	4	0	0
40	52303	66310	79.93	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
45	46349	65687	71.55	salaried	low_risk	1	1	0	5470	5470	5470	0	0	0	0	0	0	954	0	1	0	0	0	1	0
46	51303	65060	79.93	salaried	very_low_risk	1	0	0	0	0	0	0	0	0	0	0	0	1250	0	0	0	0	0	0	0
33	51078	65310	79.62	salaried	very_low_risk	5	4	0	46481	99090	99090	0	0	0	0	0	0	7210	0	0	0	1	1	1	1
36	58259	65687	89.82	salaried	very_low_risk	8	4	0	3259073	3610215	3587762	0	0	0	0	0	0	30113	0	2	0	1	7	3	0
34	52003	68695	76.72	salaried	low_risk	1	1	0	8455	14500	14500	0	0	0	0	0	0	1209	0	1	0	0	0	0	0
40	49349	65368	76.49	self_employed	very_low_risk	6	4	0	13438	48579	48579	0	0	0	0	0	0	2785	0	1	0	1	1	0	0
22	55567	66252	84.99	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
44	40094	61865	65.79	self_employed	low_risk	9	6	3	20196	351003	285648	0	0	0	0	0	0	0	0	2	1	3	11	0	0
42	55259	65937	84.93	salaried	very_low_risk	1	0	0	0	0	0	0	0	0	0	0	0	1475	0	0	0	1	1	0	0
22	40394	65368	62.72	salaried	very_low_risk	1	0	0	0	0	0	2	2	1	1171994	1690000	1690000	1090	9382	0	0	3	5	0	0
37	52303	62365	84.98	self_employed	high_risk	5	4	1	1147365	1163250	1163250	0	0	0	0	0	0	13050	0	2	0	1	2	1	0
25	40094	65368	62.26	salaried	low_risk	4	4	0	42063	66950	66950	0	0	0	0	0	0	4037	0	3	0	0	1	0	0
25	50303	67099	76.01	self_employed	low_risk	3	3	1	7960	38950	7960	0	0	0	0	0	0	1464	0	2	1	1	2	1	1
42	53578	68870	79.13	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
25	52303	68695	77.15	self_employed	low_risk	1	1	0	0	15000	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
35	58259	65687	89.82	self_employed	very_low_risk	2	0	0	0	0	0	0	0	0	0	0	0	2597	0	0	0	2	2	0	0
48	54305	64760	85.00	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
23	48835	61865	79.99	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
33	56059	63349	89.66	salaried	low_risk	2	1	0	0	4968	4968	0	0	0	0	0	0	0	0	0	0	3	3	0	0
30	50303	64651	78.89	salaried	high_risk	3	2	1	30206	99100	105564	1	1	0	0	40000	361	2315	0	1	1	1	3	0	0
25	42394	77968	55.15	self_employed	high_risk	5	1	1	22075	45000	34589	0	0	0	0	0	0	0	0	0	0	1	1	0	0
39	57759	65368	89.49	self_employed	medium_risk	5	5	1	476937	644797	644797	0	0	0	0	0	0	0	0	4	0	0	2	0	0
53	54078	65197	84.36	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
35	41094	57782	72.17	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
61	35939	61865	59.00	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
31	54003	64389	84.95	self_employed	low_risk	5	3	0	192520	555613	507644	0	0	0	0	0	0	4671	0	0	0	3	3	0	0
27	27229	61865	44.77	salaried	low_risk	4	3	0	4340	31809	31809	0	0	0	0	0	0	0	0	1	0	0	1	0	0
44	51078	65203	79.75	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
40	49349	65690	76.12	self_employed	very_low_risk	3	2	0	974963	1260000	1260000	0	0	0	0	0	0	19740	0	0	0	7	11	0	0
40	51303	65249	79.69	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1
42	44749	61865	73.39	self_employed	very_low_risk	10	2	0	1420891	1776733	1776733	0	0	0	0	0	0	31311	0	0	0	2	4	0	0
33	47049	65215	73.14	salaried	very_low_risk	2	2	0	5420	35271	35271	0	0	0	0	0	0	3847	0	1	0	1	1	4	0
26	43394	66068	66.60	self_employed	very_low_risk	10	5	0	145434	180522	180522	0	0	0	0	0	0	19485	0	4	0	0	1	1	0
29	53803	68245	79.86	self_employed	very_low_risk	1	1	0	1200	13900	13900	0	0	0	0	0	0	2317	0	0	0	1	1	0	1
33	38439	65215	59.80	salaried	very_low_risk	4	3	0	37855	90026	90026	0	0	0	0	0	0	5721	0	1	0	2	5	0	0
53	48349	65368	74.96	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
41	54759	65368	84.90	self_employed	very_low_risk	10	6	0	741288	1299454	1102429	0	0	0	0	0	0	0	0	0	0	4	13	0	0
42	73723	99500	74.97	self_employed	very_low_risk	2	1	0	820301	900000	900000	0	0	0	0	0	0	3000	0	0	0	1	2	0	0
34	40394	69498	58.99	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
41	56013	65687	86.78	self_employed	very_low_risk	13	2	0	881713	1115000	1115000	0	0	0	0	0	0	3722	0	0	0	1	4	0	0
24	46349	62465	75.24	self_employed	very_low_risk	1	1	0	24900	50000	50000	0	0	0	0	0	0	0	0	0	0	1	1	0	0
51	56959	68695	83.99	self_employed	very_low_risk	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0
45	58259	66068	89.30	self_employed	very_low_risk	3	1	0	7493	14990	14990	0	0	0	0	0	0	0	0	1	0	1	1	0	0
31	53878	68601	79.88	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
28	42394	61917	69.45	self_employed	medium_risk	12	7	1	1074066	1353681	1341560	0	0	0	0	0	0	1565	0	4	1	1	2	0	1
31	51303	65251	79.69	salaried	very_low_risk	1	0	0	0	0	0	0	0	0	0	0	0	2000	0	0	0	2	2	0	0
55	74122	103060	72.77	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
46	54576	65368	85.00	salaried	low_risk	3	3	0	8710	27154	27154	0	0	0	0	0	0	1910	0	1	0	1	1	1	0
59	52303	66552	79.64	self_employed	very_low_risk	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	2	0	0
36	49049	64217	77.39	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
39	46349	65687	71.55	self_employed	very_low_risk	4	0	0	0	0	0	0	0	0	0	0	0	4592	0	0	0	0	1	0	0
38	51003	65687	78.71	self_employed	very_low_risk	2	2	0	13163	31251	31251	0	0	0	0	0	0	0	0	2	0	0	0	0	0
51	53078	65687	82.21	salaried	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
25	51303	67714	76.79	salaried	very_low_risk	2	0	0	0	0	0	0	0	0	0	0	0	5556	0	0	0	1	1	0	0
26	35939	61865	59.00	self_employed	low_risk	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

5.1.2 Exploratory Data Analysis

Check Shape of data

dim(vehicle_loan_default_train_cleaned)

## [1] 233154     26

There are 26 columns (25 predictor variables and 1 response variable) and 233,154 observations.

Take a Glimpse at the data

glimpse(vehicle_loan_default_train_cleaned)

## Rows: 233,154
## Columns: 26
## $ age_years                           <dbl> 35, 33, 33, 25, 41, 28, 30, 29, 27…
## $ disbursed_amount                    <dbl> 50578, 47145, 53278, 57513, 52378,…
## $ asset_cost                          <dbl> 58400, 65550, 61360, 66113, 60300,…
## $ ltv                                 <dbl> 89.55, 73.23, 89.63, 88.48, 88.39,…
## $ employment_type                     <chr> "salaried", "self_employed", "self…
## $ perform_cns_score_description       <chr> "low_risk", "medium_risk", "low_ri…
## $ pri_no_of_accts                     <dbl> 0, 1, 0, 3, 0, 2, 0, 1, 1, 1, 1, 3…
## $ pri_active_accts                    <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2…
## $ pri_overdue_accts                   <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pri_current_balance                 <dbl> 0, 27600, 0, 0, 0, 0, 0, 72879, -4…
## $ pri_sanctioned_amount               <dbl> 0, 50200, 0, 0, 0, 0, 0, 74500, 36…
## $ pri_disbursed_amount                <dbl> 0, 50200, 0, 0, 0, 0, 0, 74500, 36…
## $ sec_no_of_accts                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_active_accts                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_overdue_accts                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_current_balance                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_sanctioned_amount               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sec_disbursed_amount                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ primary_instal_amt                  <dbl> 0, 1991, 0, 31, 0, 1347, 0, 0, 0, …
## $ sec_instal_amt                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ new_accts_in_last_six_months        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ delinquent_accts_in_last_six_months <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ average_acct_age                    <dbl> 0, 2, 0, 1, 0, 2, 0, 0, 5, 2, 1, 2…
## $ credit_history_length               <dbl> 0, 2, 0, 1, 0, 2, 0, 0, 5, 2, 1, 2…
## $ no_of_inquiries                     <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1…
## $ loan_default                        <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0…

Check for Null Values

# use missmap function from the Amelia package to check for NA values
missmap(vehicle_loan_default_train_cleaned,
        plot.background = element_rect(fill = "antiquewhite"),
        main = "Vehicle Loan Default - Missing Values", 
        x.cex = 0.45,
        y.cex = 0.6,
        margins = c(7.1, 7.1),
        col = c("yellow", "black"), legend = FALSE)

From the plot of missing values, we can see that there is no missing value in the data. No NA imputation or handling of nulls is necessary.

Check for Imbalance of data for the loan_default variable

loan_default_grouped <- vehicle_loan_default_train_cleaned %>% group_by(loan_default) %>% 
                        summarise(count = n()) %>%
                        mutate(percentage = round((count / sum(count) * 100), 2))

loan_default_grouped_displayed <- kable(head(loan_default_grouped, 200), "html") %>% 
                                  kable_paper("hover", full_width = F)
loan_default_grouped_displayed

loan_default	count	percentage
0	182543	78.29
1	50611	21.71

p_bar_loan_default_category <- loan_default_grouped %>% ggplot(aes(x=factor(loan_default), 
                                                                   y = percentage, fill = factor(loan_default))) +
                               geom_bar(stat = "identity", position = "dodge") +
                               labs(title = "Loan Default Distribution", x = "Loan Default", y = "Percentage", 
                                    fill = "Loan Default") +
                               scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentages
                               theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))                          
p_bar_loan_default_category

From the values and plot above for the Loan Default categories, we can see that there are more values for category 0 (No default) than 1 (Default). Before, modeling the data, the data will be enhanced by oversampling the “default” category to achieve some form of balance in the data.

Age Distribution

# Age distribution - All categories
age_dist_all <- vehicle_loan_default_train_cleaned %>% ggplot(aes(x = age_years)) +
                    geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
                    labs(title = "Age Distribution - All Categories", x = "Age", y = "Frequency") + theme_minimal() +
                    theme(plot.title = element_text(hjust = 0.5),
                                                    panel.background = element_rect(fill = "gray80"),
                                                    plot.background = element_rect(fill = "antiquewhite"))

# Age distribution - Loan Default
age_dist_loan_default <- vehicle_loan_default_train_cleaned %>% filter(loan_default == 1) %>% ggplot(aes(x = age_years)) +
                    geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
                    labs(title = "Age Distribution - Loan Default", x = "Age", y = "Frequency") + theme_minimal() +
                    theme(plot.title = element_text(hjust = 0.5),
                                                    panel.background = element_rect(fill = "gray80"),
                                                    plot.background = element_rect(fill = "antiquewhite"))

# Age distribution - No Default
age_dist_no_default <- vehicle_loan_default_train_cleaned %>% filter(loan_default == 0) %>% ggplot(aes(x = age_years)) +
                    geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
                    labs(title = "Age Distribution - Loan Default", x = "Age", y = "Frequency") + theme_minimal() +
                    theme(plot.title = element_text(hjust = 0.5),
                                                    panel.background = element_rect(fill = "gray80"),
                                                    plot.background = element_rect(fill = "antiquewhite"))

# Plot all three plots
plot_age_dist <- plot_grid(age_dist_loan_default, age_dist_no_default, age_dist_all, byrow = TRUE, nrow = 3) 
plot_age_dist

We can see that most of the age distribution is about the same for both loan_default and no-loan_default. Also, we can see that most of the applicants fall between 20 years old and 40 years old with a few applicants between 40 and 60 and almost none after 70 years old.

CNS Score Description

The perform_cns_score_description basically categorizes the risk level of the applicant based on their credit score.

# All Categories
cns_score_desc_all <- vehicle_loan_default_train_cleaned %>% group_by(perform_cns_score_description) %>% 
                                summarise(count = n()) %>%
                                mutate(percentage = round((count / sum(count) * 100), 2)) %>%
                                ggplot(aes(x = perform_cns_score_description, y = percentage,
                                                                    fill = factor(perform_cns_score_description))) +
                                geom_bar(stat = "identity", position = "dodge") +
                                labs(title = "Risk Level Distribution - All Categories", x = "Risk Level", y = "Percentage", 
                                     fill = "Risk Level") +
                                scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentages
                                theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                         panel.background = element_rect(fill = "gray80"),
                                                         plot.background = element_rect(fill = "antiquewhite"))

cns_score_desc_all

Significantly high proportion of the applicants fall under low_risk category and over 85% fall under either very_low_risk or low_risk. This may be due to the believe that applicants may prefer to work on their credit to improve their risk level before apply for vehicle loans especially as low_risk applicants tend to get better interest rates on their loans.

Loan Default Distribution By Risk Level

loan_default_risk_level <- vehicle_loan_default_train_cleaned %>% 
                               count(loan_default, perform_cns_score_description, name = "Record_Count") %>%
                               ggplot(aes(x=loan_default, y = Record_Count, fill = perform_cns_score_description)) +
                               geom_bar(stat = "identity", position = "stack") +
                               labs(title = "Loan Default Distribution by Risk Level", x = "Loan Default", y = "Percentage", 
                                    fill = "Risk Level") + theme_minimal()  + 
                               theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite")) 
loan_default_risk_level

It is clear that a huge portion of the non-loan_default are categorized as low_risk or very_low_risk. Also, for the defaulters, the low_risk still constitute a huge portion.

Employment Type

employment_type <- vehicle_loan_default_train_cleaned %>% 
                               count(loan_default, employment_type, name = "Record_Count") %>%
                               ggplot(aes(x=loan_default, y = Record_Count, fill = employment_type)) +
                               geom_bar(stat = "identity", position = "stack") +
                               labs(title = "Loan Default Distribution by Employment Type", x = "Loan Default", y = "Percentage", 
                                    fill = "Employment Type") + theme_minimal()  + 
                               theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite")) 
employment_type

Salaried or Self-Employed constitute most of the data in almost equal proportions for both categories of default although the self-employed are a little more.

Correlation Plot

vehicle_loan_default_train_cleaned_numeric <- vehicle_loan_default_train_cleaned %>% 
                                              select(-employment_type, -perform_cns_score_description)
corr_matrix <- cor(vehicle_loan_default_train_cleaned_numeric)
correlation_plot <- ggcorrplot(corr_matrix, 
                               lab = TRUE, # Show axis labels
                               lab_size = 2, # Adjust the size of axis labels
                               hc.order = TRUE, # Reorder the correlation matrix
                               type = "lower", 
                               outline.col = "white", 
                               colors = c("blue", "white", "red"), 
                               ggtheme = ggplot2::theme_minimal(),
                               title = "Correlation Plot") + coord_fixed(ratio = 0.9) + 
                               theme(axis.text.x = element_text(size = 9, angle = 45, hjust = 1),
                                     axis.text.y = element_text(size = 9, hjust = 1),
                                     plot.title = element_text(hjust = 0.5),
                                     panel.background = element_rect(fill = "gray80"),
                                     plot.background = element_rect(fill = "antiquewhite"),
                                     axis.title = element_text(size = 10)) +labs(x = NULL, y = NULL)

correlation_plot

5.2.2 Data Pre-processing, Model Training, and Testing

Preprocess the data

Since the data is imbalanced, we preprocess the data to get a balanced dataset and also standardize the numeric variables in the data using the standard normal distribution.

data_preprocess_scaling <- function(df){
                      # This helper function standardizes the numeric variables of the df using the standard normal method
                      df <- as.data.frame(df)
                      df_char <- df %>% select(loan_default, employment_type, perform_cns_score_description) 
                      df_numeric <- df %>% select(-loan_default, -employment_type, -perform_cns_score_description)
                      df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
                      df_scaled_combined <- cbind(df_char, df_numeric_scaled)
                      return(df_scaled_combined)
}


xgb_nnet_preprocess <- function(df, mode){
 # convert the categorical variable to dummies
              df2 <- dummy_cols(df, select_columns = c("employment_type","perform_cns_score_description"), 
                                remove_selected_columns = TRUE) %>% as.data.frame()
              # prepare the xgb.DMatrix to use in the xgboost training
              df2_train <- as.data.frame(df2[, -1])
              df2_label <- as.data.frame(df2[, 1])
              df_dmatrix <- xgb.DMatrix(as.matrix(sapply(df2_train, as.numeric)), label=as.matrix(df2_label))
              
              if(mode == "xgboost"){
                return(df_dmatrix)
                
              } else if(mode == "nnet"){
                return(df2)
                
              } else{
                print("Mode not supported.")
              }
              
}


data_preprocessing <- function(df, mode = "train"){
    # This function pre-processes the cleaned data and get the data ready for training.
    tryCatch({
      if(is_tibble(df) | is.data.frame(df)){
        if(mode == "test"){
           df_scaled <- data_preprocess_scaling(df)
           print("Data Pre-processing complete")
           return(df_scaled)
           
        } else if(mode == "train"){
          curr_frame <<- sys.nframe() # sends the current frame to the global environment.
          # The ovun.sample function in the ROSE package assumes the data to be in the global env so you have to tell it  
          # which frame (scope) to find the data else this will fail if executed inside a function.
          df_ovun <- ovun.sample(formula = formula(loan_default ~ .), data = get("df", sys.frame(curr_frame)),
                                 N = 1.5 * nrow(data), seed = 1994, method = "both")$data %>% as.data.frame() %>% as_tibble()
          print("Oversampling and undersampling completed")
          df_scaled <- data_preprocess_scaling(df_ovun)
          print("Data Pre-processing complete!")
          return(df_scaled)
          
        } else {
          print("You did not enter a valid mode type: Enter train or test for mode")
          
        }
      }
      
    else{
    print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
  }
    
  },
  #if an error occurs, tell me the error
  error=function(e) {
        message('An Error Occurred')
        print(e)
        },
  #or if a warning occurs, tell me the warning
  warning=function(w) {
        message('A Warning Occurred')
        print(w)
        return(NA)
        }
    )
  
}

Train Test Split

Use the CaTools library to split the cleaned dataset into training and testing datasets in 70:30 ratio.

# Set a seed
set.seed(1994)
#Split the sample
sampling <- sample.split(vehicle_loan_default_train_cleaned$loan_default, SplitRatio = 0.7) 
# Training Data
df_train_subset <- subset(vehicle_loan_default_train_cleaned, sampling == TRUE)
# Testing Data
df_test_subset <- subset(vehicle_loan_default_train_cleaned, sampling == FALSE)

Pre-process the train dataset

df_train = data_preprocessing(df_train_subset, mode = "train")

## [1] "Oversampling and undersampling completed"
## [1] "Data Pre-processing complete!"

Pre-process the test dataset

df_test = data_preprocessing(df_test_subset, mode = "test")

## [1] "Data Pre-processing complete"

Develop function to train the data

The function model_training trains a machine learning model according to the mode selected (logistic, rf, xgboost, or nnet).

model_training <- function(df, mode = "logistic"){
  
          if(mode == "logistic"){
              print("Training a Logistic Regression Model...")
              logistic_model <- glm(formula = loan_default ~ . , 
                                       family = binomial(link = 'logit'), data = df)
              print("Logistic Regression Model complete")
              return(logistic_model)
           
        } else if(mode == "rf"){
              print("Training a Random Forest Classification Model...")
              rf_model_ranger <- ranger(
                                 formula   = loan_default ~ ., 
                                 data      = df, 
                                 num.trees = 500,
                                 mtry      = floor(length(df) / 3),
                                 probability = TRUE,
                                 verbose = FALSE,
                                 classification = TRUE
                                 )
              print("Random Forest Classification Model complete")
              return(rf_model_ranger)         
          
        } else if(mode == "xgboost"){
              print("Training an XGBoost Classification Model...")
              # pre-process the data to obtain the dmatrix
              df_dmatrix <- xgb_nnet_preprocess(df, mode)
              xgb_model <- xgboost(data = df_dmatrix, nthread = 4, nrounds = 150,
                                   max.depth = 10, eta = 0.1, objective = "binary:logistic", verbose = FALSE)
              print("XGBoost Training complete")
              return(xgb_model)
          
        } else if(mode == "nnet"){
              print("Training a Neural Network Classification Model...")
              # pre-process the model to convert character variables to dummies
              df_nnet <- xgb_nnet_preprocess(df, mode)
              nnet_model <- nnet(loan_default ~ ., data = df_nnet, decay = 5e-4, 
                                 size = 20, maxit = 100, trace = F, set.seed(1994))
              print("NNET Training complete")
              return(nnet_model)
          
        } else {
          print("You did not enter a valid mode type: Enter logistic, rf, xgboost or svm")
          
        }
      
}

Develop function to predict data

The function model_predictions predicts the loan_default probability of new data for each of the model types.

model_prediction = function(df, trained_model, model_type){
  
  # remove the response variable from the dataframe if it exists
  if("loan_default" %in% colnames(df)){
    test_data = df %>% select(-loan_default)
  } else {
    test_data = df
  }
  
  # make predictions
  if(model_type == "rf"){
    predictions = predict(trained_model, data = test_data)$predictions[,1]
  } else if (model_type == "logistic"){
    predictions = predict(trained_model, newdata = test_data, type = "response")
  }  else if (model_type == "xgboost"){
    test_data_xgb = xgb_nnet_preprocess(df, model_type)
    predictions = predict(trained_model, newdata = test_data_xgb)
  } else{
    test_data_nnet = xgb_nnet_preprocess(test_data, model_type)
    predictions = predict(trained_model, newdata = test_data_nnet)
  }
  
  # convert probabilities to classes
  predicted = ifelse(predictions > 0.5, 1, 0)
  
  return(predicted)
}

Develop function to evaluate metrics of the model

The function model_evaluations provides metrics such as accuracy, recall, precision, f1-score, and AUC for the model.

model_metrics = function(actual, predicted){
  
  # accuracy
  accuracy_model = round((Metrics::accuracy(actual, predicted)), 4)
  # precision
  precision_model = round((Metrics::precision(actual, predicted)), 4)
  # recall 
  recall_model = round((Metrics::recall(actual, predicted)), 4)
  # auc
  auc_model = round((pROC::auc(actual, predicted)), 4)
  
  # model metrics
  model_eval_metrics = c(accuracy_model, precision_model, recall_model, auc_model) %>% t()
  column_names = c("Accuracy", "Precision", "Recall", "AUC")
  evaluation_metrics = data.frame(values = model_eval_metrics)
  colnames(evaluation_metrics) = column_names
  
  # confusion Matrix
  confusion_table = table(predicted, actual)
  confusion_matrix = caret::confusionMatrix(confusion_table)
  print("**********************************************************************")
  print(confusion_matrix)
  print("**********************************************************************")
  
  return(evaluation_metrics)
}

5.2.2.1 Model Training and Testing - Logistic Regression

Train the logistic model

logistic_model <- model_training(df_train, mode = "logistic")

## [1] "Training a Logistic Regression Model..."
## [1] "Logistic Regression Model complete"

Evaluate the logistic Model

logistic_actual = df_test$loan_default
logistic_predicted = model_prediction(df_test, logistic_model, "logistic")
logistic_model_metrics = model_metrics(logistic_actual, logistic_predicted)

## [1] "**********************************************************************"
## Confusion Matrix and Statistics
## 
##          actual
## predicted     0     1
##         0 27660  4961
##         1 27103 10222
##                                           
##                Accuracy : 0.5416          
##                  95% CI : (0.5379, 0.5453)
##     No Information Rate : 0.7829          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1168          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5051          
##             Specificity : 0.6733          
##          Pos Pred Value : 0.8479          
##          Neg Pred Value : 0.2739          
##              Prevalence : 0.7829          
##          Detection Rate : 0.3954          
##    Detection Prevalence : 0.4664          
##       Balanced Accuracy : 0.5892          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] "**********************************************************************"

Display Model Metrics - Logistic Regression Model

rownames(logistic_model_metrics) = "Logistic Model"
logistic_model_metrics

##                Accuracy Precision Recall    AUC
## Logistic Model   0.5416    0.2739 0.6733 0.5892

5.2.2.2 Model Training and Testing - Random Forest

Train a random forest classification model for the data

rf_model <- model_training(df_train, mode = "rf")

## [1] "Training a Random Forest Classification Model..."
## [1] "Random Forest Classification Model complete"

Evaluate the Random Forest Model

rf_actual = df_test$loan_default
rf_predicted = model_prediction(df_test, rf_model, "rf")
rf_model_metrics = model_metrics(rf_actual, rf_predicted)

## [1] "**********************************************************************"
## Confusion Matrix and Statistics
## 
##          actual
## predicted     0     1
##         0  8012  3208
##         1 46751 11975
##                                           
##                Accuracy : 0.2857          
##                  95% CI : (0.2824, 0.2891)
##     No Information Rate : 0.7829          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0319         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.1463          
##             Specificity : 0.7887          
##          Pos Pred Value : 0.7141          
##          Neg Pred Value : 0.2039          
##              Prevalence : 0.7829          
##          Detection Rate : 0.1145          
##    Detection Prevalence : 0.1604          
##       Balanced Accuracy : 0.4675          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] "**********************************************************************"

Display Model Metrics - Random Forest Classification Model

rownames(rf_model_metrics) = "Random Forest Model"
rf_model_metrics

##                     Accuracy Precision Recall    AUC
## Random Forest Model   0.2857    0.2039 0.7887 0.4675

5.2.2.3 Model Training and Testing - XGBoost

Train an XGBoost classification model for the data

xgb_model <- model_training(df_train, mode = "xgboost")

## [1] "Training an XGBoost Classification Model..."
## [1] "XGBoost Training complete"

Evaluate the XGBoost Model

xgb_actual = df_test$loan_default
xgb_predicted = model_prediction(df_test, xgb_model, "xgboost")
xgb_model_metrics = model_metrics(xgb_actual, xgb_predicted)

## [1] "**********************************************************************"
## Confusion Matrix and Statistics
## 
##          actual
## predicted     0     1
##         0 31609  6375
##         1 23154  8808
##                                           
##                Accuracy : 0.5778          
##                  95% CI : (0.5742, 0.5815)
##     No Information Rate : 0.7829          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1124          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5772          
##             Specificity : 0.5801          
##          Pos Pred Value : 0.8322          
##          Neg Pred Value : 0.2756          
##              Prevalence : 0.7829          
##          Detection Rate : 0.4519          
##    Detection Prevalence : 0.5430          
##       Balanced Accuracy : 0.5787          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] "**********************************************************************"

Display Model Metrics - XGBoost classification Model

rownames(xgb_model_metrics) = "XGBoost"
xgb_model_metrics

##         Accuracy Precision Recall    AUC
## XGBoost   0.5778    0.2756 0.5801 0.5787

5.2.2.4 Model Training and Testing - Artificial Neural Networks

Train a Neural Network model for the data

nnet_model <- model_training(df_train, mode = "nnet")

## [1] "Training a Neural Network Classification Model..."
## [1] "NNET Training complete"

Evaluate the NNET Model

nnet_actual = df_test$loan_default
nnet_predicted = model_prediction(df_test, nnet_model, "nnet")
nnet_model_metrics = model_metrics(nnet_actual, nnet_predicted)

## [1] "**********************************************************************"
## Confusion Matrix and Statistics
## 
##          actual
## predicted     0     1
##         0 22452  3582
##         1 32311 11601
##                                           
##                Accuracy : 0.4868          
##                  95% CI : (0.4831, 0.4906)
##     No Information Rate : 0.7829          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1034          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4100          
##             Specificity : 0.7641          
##          Pos Pred Value : 0.8624          
##          Neg Pred Value : 0.2642          
##              Prevalence : 0.7829          
##          Detection Rate : 0.3210          
##    Detection Prevalence : 0.3722          
##       Balanced Accuracy : 0.5870          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] "**********************************************************************"

Display Model Metrics - Neural Network Model

rownames(nnet_model_metrics) = "Neural Network"
nnet_model_metrics

##                Accuracy Precision Recall   AUC
## Neural Network   0.4868    0.2642 0.7641 0.587

5.3 Results

The table below compares the results of the four (4) models trained:

results = rbind(logistic_model_metrics, rf_model_metrics, xgb_model_metrics, nnet_model_metrics)
kable(results, "html") %>%
                        kable_paper("hover", full_width = F) %>%
                        scroll_box(width = "500px", height = "200px")

	Accuracy	Precision	Recall	AUC
Logistic Model	0.5416	0.2739	0.6733	0.5892
Random Forest Model	0.2857	0.2039	0.7887	0.4675
XGBoost	0.5778	0.2756	0.5801	0.5787
Neural Network	0.4868	0.2642	0.7641	0.5870

As we can see from the results of the different models. The models perform differently across the different metrics (Accuracy, Precision, Recall, AUC). It is important to know that the metric to be used to select the chosen model is dependent on the goal of the analysis/model.
In this case of vehicle loan default prediction, the main goal is to identify applicants that will default on their loans. Also, kindly note that the positive class in this analysis is the zero (0 - No default) while the negative class is the 1 (Loan Default).
Since the main concern is to correctly identifying defaults to minimize financial risk, the Recall is considered to be more critical. High recall ensures capturing most defaults, even at the cost of some false alarms.
Recall measures the proportion of actual defaults that are correctly predicted by the model. It focuses on the model’s ability to capture all positive instances of defaults. High recall means a low false negative rate, which is crucial when identifying all actual defaults is important, even if it means higher false positives.
On the other hand, the precision measures the proportion of correctly predicted defaults out of all instances predicted as defaults. It specifically focuses on the accuracy of positive predictions, but here the focus is mainly on negative predictions which are default to minimize financial losses on vehicle loans. High precision means a low false positive rate, which is valuable when correctly identifying defaults is crucial and minimizing false alarms is a priority.
Also, Accuracy measures the overall correctness of predictions by considering the ratio of correctly predicted instances (both defaults and non-defaults) to the total instances. While accuracy is an intuitive metric, its reliability can be affected by class imbalance which is usually the case for most loan-default problems as there are usually fewer defaults than non-defaults.
Lastly, AUC quantifies the model’s ability to distinguish between classes. It represents the area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. It is a valuable metric for assessing the overall performance of a classification model, including its ability to discriminate between default and non-default instances in a loan default prediction problem. It provides a comprehensive evaluation of the model’s discriminatory power, especially in scenarios with class imbalance.

Looking at the results of the model, we find that that all of the models are competitive in terms of recall and they all performed badly on the precision and about average for the accuracy for the most part. Also, the AUC values are above average for the most part and I strongly believe that the overall metrics of these models can be improved by having more quality data, feature selection, and hyper-parameter tuning.

5.4 Limitations of the model and data

One major limitation of the model is the data source. The data was obtained from Kaggle, which is an open database source, which may question the reliability of the data for real world usage. There are some features like credit score classification which may not be available for young applicants or immigrants to the country who have not built an history to be classified as low risk. More quality data is required to get a better performing model.
Also, most of the models have tune-able parameters to obtain better performing models, but the models trained in this analysis did not include hyper-parameter tuning and at such, the best models might not have been obtained for each of the model categories. This tuning process was avoided here because some of the models like Neural Network, and the Tree-Based models can become quickly complex and may require more computational capacity to perform several cross-validation/grid search and tuning process. For example, for the Random Forest model, a mere 500 trees was randomly selected without knowing if that number of trees is enough to grow the Forest as well as other parameters. Also, the number of rounds for the Xtreme Gradient Boosting was randomly selected without finding the best values for the combination of tunable parameters to get the best model. Furthermore, for the Neural Network model, we do not know what depth of network or number of nodes to train the model on to get the best performing model.
In addition, proper feature selection was not done to see if certain features can be dropped or to know what features of the data offer the most predictive value.
Lastly, we did not compare training error to test error to know if the trained model suffers from either overfitting or underfitting.

5.5 Next Steps

One of the most important next steps is to try to obtain better quality and reputable data possibly from a reputable financial institution. This might be very difficult to obtain considering the fact that financial data are highly regulated and also companies consider these data as part of their intellectual property and may not be willing to give these out. If these are done as part of an internal modeling process, better quality data may be obtained to be used in training a loan default model.

In addition, it is very important to conduct hyper-parameter tuning for the models to determine what parameters would be best suited to train the model on. It is also important to know that as more data is included, the computational resources needed to conduct extensive hyper-parameter tuning on large data for complex models like Neural Network, and Tree-Based model would increase significantly especially for large datasets.

Also, we can also try to do feature selection, check for multi-correlation using the Variance Inflation Factor (VIF) method/approach to obtain only relevant features that provide predictive value. Several other feature engineering techniques like principal component analysis (PCA) can also be explored.

Furthermore, next steps should involve comparing training error vs test error to understand if the model is overfitting or underfitting and appropriate steps should be taken to minimize either underfitting or overfitting whichever is the case for the model.

CONCLUSION

The purpose of this project was to develop a machine learning model for loan default prediction used in the automotive credit industry to minimize financial losses to financial institutions due to loan default. That is to say that banks are very much interested in identifying customers that are more likely to default on their auto loans and avoid extending credit to those customers. From the various literature reviews, we see that the industry standard for credit risk modeling/loan default prediction is currently the Logistic Regression. Throughout this study, machine learning algorithms such as Logistic regression, Random Forest, Extreme Gradient Boosting Machines and Neural Network have been examined and the findings is that the non-conventional tree-based models and the Neural Network models perform slightly better than the industry standard. The Industry standard of logistic regression is good enough and perform well compared to the other models but not necessarily better that others. However, since the Random Forest, XGBoost, and Neural Networks are very difficult to interpret, it may be difficult for banks to understand why a customer is rejected for their loan application. While on the other hand, the Logistic Regression model is relatively more easier to interpret than its counterparts, it might be difficult to sway loan officers/decision makers away from the logistic regression without extracting way better performance from the other models since the Logistic Regression performance is not too far from the others.
Hence, I will still recommend the industry standard Logistic Regression model because of its simplicity, easier to understand and interpret, take less time to train and predict the results, and also provides competitive performance when compared to the other more complex models. Even though I recommend the Logistic Regression, I do not shut the doors on other models because as we continue to obtain more information about the behavior of customers in this increasingly digital world, the dimension of each observation might continue to increase and the Logistic Regression might suffer from getting predictive ability on highly dimensional data. Hence, the other models like Neural Network and Tree-based models might come in handy.

REFERENCES

Agarwal, S., Ambrose, B. W., & Chomsisengphet, S. (2008). Determinants of automobile loan default and prepayment. Economic Perspectives - Federal Reserve Bank of Chicago.

Altman, E. I., & Saunders, A. (1998). Credit risk measurement: Developments over the last 20 years. Journal of Banking and Finance, 21(11-12), 1728–1742.

Agrawal, A., Agrawal, M., & Raizada, D. A. (2014). Predicting defaults in commercial vehicle loans using logistic regression: Case of an indian nbfc. International Journal of Research in Commerce and Management, 5, 22–28.

Brownlee, J. (2019, August 12). Overfitting and Underfitting With Machine Learning Algorithms. Machine Learning Mastery. Retrieved November 4, 2023, from https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/#:%7E:text=Overfitting%3A%20Good%20performance%20on%20the,poor%20generalization%20to%20other%20data

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.

Crook, J. N., Edelman, D. B., & Thomas, L. C. (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183, 447–1465.

Cromer, O. C., Purdy, K. W., Cromer, G. C., & Foster, C. G. (2023, October 13). Automobile | Definition, History, industry, design, & Facts. Encyclopedia Britannica. https://www.britannica.com/technology/automobile

Diez, D., Barr, C. D., & Cetinkaya-Rundel, M. (2019). OpenIntro statistics 4th Edition.

Education, I. C. (2021, March 25). Underfitting. IBM Cloud Learn. Retrieved November 5, 2023, from https://www.ibm.com/cloud/learn/underfitting

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

GeeksforGeeks. (2021, October 20). ML | Underfitting and Overfitting. Retrieved November 4, 2023, from https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/

Hand, D.J. and Henley, W.E. (1997) Statistical Classification Methods in Consumer Credit Scoring: A Review. Journal of Royal Statistical Society, 160, 523-541.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

Hao, C., Alam, M. M., & Carling, K. (2010). Review of the literature on credit risk modeling: Development of the past 10 years. Banks and Bank Systems, 5 (3).

Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247 (1), 124–136.

Martin, A. (2023). What is an auto loan? Bankrate. https://www.bankrate.com/loans/auto-loans/what-is-an-auto-loan/

Model Fit: Underfitting vs. Overfitting - Amazon Machine Learning. (n.d.). Amazon Machine Learning Developer Guide. Retrieved November 4, 2023, from https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

O’Brien, S. (2023, February 4). Auto loan delinquencies are rising. Here’s what to do if you’re struggling with payments. CNBC. https://www.cnbc.com/2023/02/04/auto-loan-delinquencies-rise-what-to-do-if-you-struggle-with-payments.html

Probasco, J. (2023). Expert explanation of how auto loans work. Investopedia. https://www.investopedia.com/how-car-loans-work-5202265

U.S. vehicle fleet 1990-2021 | Statista. (2023, August 24). Statista. https://www.statista.com/statistics/183505/number-of-vehicles-in-the-united-states-since-1990/

U.S.: average selling price of new vehicles 2022 | Statista. (2023, June 7). Statista. https://www.statista.com/statistics/274927/new-vehicle-average-selling-price-in-the-united-states/

Witkowski, R. (2023, June 2). For People Under 30, Car Loan Delinquencies Hit A 15-Year High. Is The Economy Running Out Of Gas? Forbes Advisor. https://www.forbes.com/advisor/auto-loans/car-loan-late-payments/

APPENDIX

This section contains all codes used in the analysis in the correct order. However, no output is generated.

# Load libraries
library(Amelia) # To visualize missing data
library(caret)
library(caTools)
library(corrplot) # To plot correlation plot
library(cowplot) # To combine plots in a grid
library(fastDummies) # to convert character variables to dummies
library(ggcorrplot) # To plot correlation plot
library(kableExtra)
library(Metrics) # for model evaluation
library(nnet)
library(pROC)
library(ranger) # for random forest implementation
library(ROSE) # to balance the data
library(tidyverse)
library(xgboost)

# Read the data
url_vehicle_loan_default_train = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_train_data.csv"
url_vehicle_loan_default_test = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_test_data.csv"
vehicle_loan_default_train_raw = read_csv(url_vehicle_loan_default_train) %>% as_tibble()
vehicle_loan_default_test_raw = read_csv(url_vehicle_loan_default_test) %>% as_tibble()

# data_cleaning
data_cleaning <- function(df){
  # This function accepts a dataframe (df) as input and returns another dataframe (cleaned_df) that is clean.
  tryCatch({
  if(is_tibble(df) | is.data.frame(df)){
    print("The dataframe is a tibble, will proceed to clean data")
    print("Data Cleaning in progress...")
    # rename all the columns in the dataframe to lowercase
    cleaned_df <- df %>% rename_all(tolower) %>%  
                # compute the age of the applicant in number of years
                mutate(date_of_birth = dmy(date_of_birth),  
                      disbursal_date = dmy(disbursal_date), 
                      age = difftime(disbursal_date, date_of_birth, units = "days"),
                      age_years = round(as.numeric(age / 365.25), 0)) %>%
               # extract the years and month component of the average_acct_age  and convert to years
                mutate(average_acct_age_year_comp = as.numeric(str_extract(average_acct_age, "\\d+")),
                      average_acct_age_mon_comp = as.numeric(str_extract(average_acct_age, "\\d+(?=mon)")),
                      average_acct_age = round((average_acct_age_year_comp + average_acct_age_mon_comp/12), 0)
                       ) %>%
                # extract the years and month component of the credit_history_length and convert to years
                mutate(credit_history_length_year_comp = as.numeric(str_extract(credit_history_length, "\\d+")),
                       credit_history_length_comp = as.numeric(str_extract(credit_history_length, "\\d+(?=mon)")),
                       credit_history_length = round((credit_history_length_year_comp + credit_history_length_comp/12), 0)
                       )  %>%
                # clean up the perform_cns_score_distribution to include only a few categories
                mutate(lowercase_cns_description = tolower(perform_cns_score_description),
                       perform_cns_score_description = case_when(
                                str_detect(lowercase_cns_description, "very low risk") ~ "very_low_risk",
                                str_detect(lowercase_cns_description, "low risk") ~ "low_risk",
                                str_detect(lowercase_cns_description, "medium risk") ~ "medium_risk",
                                str_detect(lowercase_cns_description, "high risk") ~ "high_risk",
                                str_detect(lowercase_cns_description, "very high risk") ~ "very_high_risk",
                                str_detect(lowercase_cns_description, "not scored|no bureau") ~ "low_risk",
                                TRUE ~ "none"))  %>%
                # clean up the employment type to have only few categories
                mutate(lower_case_employment_type = tolower(employment_type),
                       employment_type = case_when(
                                str_detect(lower_case_employment_type, "salaried") ~ "salaried",
                                str_detect(lower_case_employment_type, "self employed") ~ "self_employed",
                                TRUE ~ "not_reported")) %>%
                # select only the required columns
                select(
                  age_years, disbursed_amount, asset_cost, ltv, employment_type, perform_cns_score_description,
                  pri_no_of_accts, pri_active_accts, pri_overdue_accts, pri_current_balance, pri_sanctioned_amount,
                  pri_disbursed_amount, sec_no_of_accts, sec_active_accts, sec_overdue_accts, sec_current_balance,
                  sec_sanctioned_amount, sec_disbursed_amount, primary_instal_amt, sec_instal_amt, new_accts_in_last_six_months,
                  delinquent_accts_in_last_six_months, average_acct_age, credit_history_length, no_of_inquiries, loan_default
                )
                print("Data Cleaning complete!!!")
    return(cleaned_df)
    
  }
  else{
  print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
  }
    
  },
  #if an error occurs, tell me the error
  error=function(e) {
        message('An Error Occurred')
        print(e)
        },
  #or if a warning occurs, tell me the warning
  warning=function(w) {
        message('A Warning Occurred')
        print(w)
        return(NA)
        }
    )
  
}

# clean the data
vehicle_loan_default_train_cleaned = data_cleaning(vehicle_loan_default_train_raw)

# data pre-processing
data_preprocess_scaling <- function(df){
                      # This helper function standardizes the numeric variables of the df using the standard normal method
                      df <- as.data.frame(df)
                      df_char <- df %>% select(loan_default, employment_type, perform_cns_score_description) 
                      df_numeric <- df %>% select(-loan_default, -employment_type, -perform_cns_score_description)
                      df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
                      df_scaled_combined <- cbind(df_char, df_numeric_scaled)
                      return(df_scaled_combined)
}


xgb_nnet_preprocess <- function(df, mode){
 # convert the categorical variable to dummies
              df2 <- dummy_cols(df, select_columns = c("employment_type","perform_cns_score_description"), 
                                remove_selected_columns = TRUE) %>% as.data.frame()
              # prepare the xgb.DMatrix to use in the xgboost training
              df2_train <- as.data.frame(df2[, -1])
              df2_label <- as.data.frame(df2[, 1])
              df_dmatrix <- xgb.DMatrix(as.matrix(sapply(df2_train, as.numeric)), label=as.matrix(df2_label))
              
              if(mode == "xgboost"){
                return(df_dmatrix)
                
              } else if(mode == "nnet"){
                return(df2)
                
              } else{
                print("Mode not supported.")
              }
              
}


data_preprocessing <- function(df, mode = "train"){
    # This function pre-processes the cleaned data and get the data ready for training.
    tryCatch({
      if(is_tibble(df) | is.data.frame(df)){
        if(mode == "test"){
           df_scaled <- data_preprocess_scaling(df)
           print("Data Pre-processing complete")
           return(df_scaled)
           
        } else if(mode == "train"){
          curr_frame <<- sys.nframe() # sends the current frame to the global environment.
          # The ovun.sample function in the ROSE package assumes the data to be in the global env so you have to tell it  
          # which frame (scope) to find the data else this will fail if executed inside a function.
          df_ovun <- ovun.sample(formula = formula(loan_default ~ .), data = get("df", sys.frame(curr_frame)),
                                 N = 1.5 * nrow(data), seed = 1994, method = "both")$data %>% as.data.frame() %>% as_tibble()
          print("Oversampling and undersampling completed")
          df_scaled <- data_preprocess_scaling(df_ovun)
          print("Data Pre-processing complete!")
          return(df_scaled)
          
        } else {
          print("You did not enter a valid mode type: Enter train or test for mode")
          
        }
      }
      
    else{
    print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
  }
    
  },
  #if an error occurs, tell me the error
  error=function(e) {
        message('An Error Occurred')
        print(e)
        },
  #or if a warning occurs, tell me the warning
  warning=function(w) {
        message('A Warning Occurred')
        print(w)
        return(NA)
        }
    )
  
}

# Train Test Split
# Set a seed
set.seed(1994)
#Split the sample
sampling <- sample.split(vehicle_loan_default_train_cleaned$loan_default, SplitRatio = 0.7) 
# Training Data
df_train_subset <- subset(vehicle_loan_default_train_cleaned, sampling == TRUE)
# Testing Data
df_test_subset <- subset(vehicle_loan_default_train_cleaned, sampling == FALSE)


df_train = data_preprocessing(df_train_subset, mode = "train")
df_test = data_preprocessing(df_test_subset, mode = "test")

# Model training function
model_training <- function(df, mode = "logistic"){
  
          if(mode == "logistic"){
              print("Training a Logistic Regression Model...")
              logistic_model <- glm(formula = loan_default ~ . , 
                                       family = binomial(link = 'logit'), data = df)
              print("Logistic Regression Model complete")
              return(logistic_model)
           
        } else if(mode == "rf"){
              print("Training a Random Forest Classification Model...")
              rf_model_ranger <- ranger(
                                 formula   = loan_default ~ ., 
                                 data      = df, 
                                 num.trees = 500,
                                 mtry      = floor(length(df) / 3),
                                 probability = TRUE,
                                 verbose = FALSE,
                                 classification = TRUE
                                 )
              print("Random Forest Classification Model complete")
              return(rf_model_ranger)         
          
        } else if(mode == "xgboost"){
              print("Training an XGBoost Classification Model...")
              # pre-process the data to obtain the dmatrix
              df_dmatrix <- xgb_nnet_preprocess(df, mode)
              xgb_model <- xgboost(data = df_dmatrix, nthread = 4, nrounds = 150,
                                   max.depth = 10, eta = 0.1, objective = "binary:logistic", verbose = FALSE)
              print("XGBoost Training complete")
              return(xgb_model)
          
        } else if(mode == "nnet"){
              print("Training a Neural Network Classification Model...")
              # pre-process the model to convert character variables to dummies
              df_nnet <- xgb_nnet_preprocess(df, mode)
              nnet_model <- nnet(loan_default ~ ., data = df_nnet, decay = 5e-4, 
                                 size = 20, maxit = 100, trace = F, set.seed(1994))
              print("NNET Training complete")
              return(nnet_model)
          
        } else {
          print("You did not enter a valid mode type: Enter logistic, rf, xgboost or svm")
          
        }
      
}

# model prediction function
model_prediction = function(df, trained_model, model_type){
  
  # remove the response variable from the dataframe if it exists
  if("loan_default" %in% colnames(df)){
    test_data = df %>% select(-loan_default)
  } else {
    test_data = df
  }
  
  # make predictions
  if(model_type == "rf"){
    predictions = predict(trained_model, data = test_data)$predictions[,1]
  } else if (model_type == "logistic"){
    predictions = predict(trained_model, newdata = test_data, type = "response")
  }  else if (model_type == "xgboost"){
    test_data_xgb = xgb_nnet_preprocess(df, model_type)
    predictions = predict(trained_model, newdata = test_data_xgb)
  } else{
    test_data_nnet = xgb_nnet_preprocess(test_data, model_type)
    predictions = predict(trained_model, newdata = test_data_nnet)
  }
  
  # convert probabilities to classes
  predicted = ifelse(predictions > 0.5, 1, 0)
  
  return(predicted)
}

# Model Metrics
model_metrics = function(actual, predicted){
  
  # accuracy
  accuracy_model = round((Metrics::accuracy(actual, predicted)), 4)
  # precision
  precision_model = round((Metrics::precision(actual, predicted)), 4)
  # recall 
  recall_model = round((Metrics::recall(actual, predicted)), 4)
  # auc
  auc_model = round((pROC::auc(actual, predicted)), 4)
  
  # model metrics
  model_eval_metrics = c(accuracy_model, precision_model, recall_model, auc_model) %>% t()
  column_names = c("Accuracy", "Precision", "Recall", "AUC")
  evaluation_metrics = data.frame(values = model_eval_metrics)
  colnames(evaluation_metrics) = column_names
  
  # confusion Matrix
  confusion_table = table(predicted, actual)
  confusion_matrix = caret::confusionMatrix(confusion_table)
  print("**********************************************************************")
  print(confusion_matrix)
  print("**********************************************************************")
  
  return(evaluation_metrics)
}

# Train Logistic Model
logistic_model <- model_training(df_train, mode = "logistic")
logistic_actual = df_test$loan_default
logistic_predicted = model_prediction(df_test, logistic_model, "logistic")
logistic_model_metrics = model_metrics(logistic_actual, logistic_predicted)
rownames(logistic_model_metrics) = "Logistic Model"

# Train RF model
rf_model <- model_training(df_train, mode = "rf")
rf_actual = df_test$loan_default
rf_predicted = model_prediction(df_test, rf_model, "rf")
rf_accuracy = accuracy(rf_actual, rf_predicted)
rf_model_metrics = model_metrics(rf_actual, rf_predicted)
rownames(rf_model_metrics) = "Random Forest Model"

# Train an XGBoost model
xgb_model <- model_training(df_train, mode = "xgboost")
xgb_actual = df_test$loan_default
xgb_predicted = model_prediction(df_test, xgb_model, "xgboost")
xgb_model_metrics = model_metrics(xgb_actual, xgb_predicted)
rownames(xgb_model_metrics) = "XGBoost"

# Train an NNET model
nnet_model <- model_training(df_train, mode = "nnet")
nnet_actual = df_test$loan_default
nnet_predicted = model_prediction(df_test, nnet_model, "nnet")
nnet_model_metrics = model_metrics(nnet_actual, nnet_predicted)
rownames(nnet_model_metrics) = "Neural Network"

# Result
results = rbind(logistic_model_metrics, rf_model_metrics, xgb_model_metrics, nnet_model_metrics)
kable(results, "html") %>%
                        kable_paper("hover", full_width = F) %>%
                        scroll_box(width = "500px", height = "200px")