1. ABSTRACT

This project aims to develop a machine learning model to predict auto-loans default using Machine Learning techniques. The dataset used was gotten from kaggle and contains about 41 variables and over 230,000 records of previous loan applicants.
Most banks typically use Logistic Regression Model to make decision whether an applicant is at risk of default or not. However, in this project, several other machine learning techniques were explored such as Tree-based models like Random Forest, and XGBoost (Extreme Gradient Boosting) as well as Artificial Neural Networks. The Logistic Regression Models was developed as well and compared to these other models. We found that the Neural Network model provided the best result based on the recall. Also, the logistic regression model does not perform poorly compared to other models, but instead provided competitive performance as well. However, due to time and resources constraints, the models were not fully exploited to get the best out of each of the models. Simply, hyper-parameter tuning was not done on the models that have tunable parameters, and feature engineering was not done. Ultimately, the Logistic Regression Model was chosen based on its competitive overall performance good enough recall, accuracy, auc value when compared to the other three models. However, even though the Neural Network (NNET) model has better recall, it may be difficult to interpret the model to determine why a customer was refused a loan. Hence, the Logistic Regression Model which is the industry standard is still chosen here because of its simplicity, ease of understanding and interpretation, and very short training time.

  1. INTRODUCTION

1.1 Research Problem

The problem that this project aims to solve is to develop a machine learning model that will predict whether a potential borrower will default on their auto-loan or not using available data collected about the customer.
The auto-loan industry is a multi-billion dollar industry that affects almost all aspects of life in the developed world as well as in developing countries. Automobiles are now the de facto means of movement in suburbs and even in some major cities and the volume of automobiles on the roads has increased tremendously over the past decades. Buying an automobile (referred to in this project as a vehicle) is an important decision for most people and banks are usually at the center of this decision since majority of vehicle purchases are made through financing option. Hence, financial institutions like banks have to face this particular problem all the time to decide whether a borrower will be able to make payments throughout the lifetime of their auto-loan.
Hence, banks do not want to issue loans to individuals that would default and at the same time do not want to deny loans for individuals that would not default. To do this, they want to be able to predict to a high level of confidence whether to approve or deny an auto-loan so that they will be able to minimize losses and write-offs when a borrower is not able to make payments and at the same time make profits from customers who are able to service their loans. This project aims to solve this problem by developing a predictive model using machine learning techniques to help banks decide which potential borrower is likely to default on their loans or not.

1.2 Definition of Terms

Automobile: “Automobile, byname auto, also called motorcar or car, a usually four-wheeled vehicle designed primarily for passenger transportation and commonly propelled by an internal-combustion engine using a volatile fuel” (Cromer et al., 2023). Automobiles could also be referred to as a vehicle, motor vehicle, light trucks, etc. These days, automobiles can be propelled by either an internal-combustion engine or by electric battery. About 282 million vehicles were registered in the United States in 2021 about 90 million more compared to about 193 million in 1990. (U.S. Vehicle Fleet 1990-2021 | Statista, 2023). Also, the average price of motor vehicles in the United States in 2022 is about $46,000 USD (U.S.: Average Selling Price of New Vehicles 2022 | Statista, 2023)

Auto Loan: An auto loan is the money you borrow to pay for your car. You will have to repay the loan with interest in fixed installments (Martin, 2023). Auto Loans are also referred to as car loans, vehicle loans, car financing, etc. These loans are often secured loans meaning that the car is used as the collateral to secure the loan. Typically, consumers borrow money to buy vehicles. In fact, consumers owed about 1.41 trillion US dollars on vehicles they drove in 2022. Also, the average auto loan balance is about $22,000 USD. In addition, about 80% of all new vehicles on the road is financed through a loan or lease (Chris, 2023). This goes to tell us that the auto-loan industry is a multi-trillion dollar industry which signifies the importance of loan default models to help minimize losses.

Vehicle Loan Default: Loan default refers to when a borrower fails to make their installment payments as agreed by the loan terms. Usually for secured loans, the lender (bank in this case) can repossess the asset (car) which is used as collateral for the loan. Banks usually do not want to do that but sometimes are forced to do that if the borrower defaults and does not make an arrangement with the lender. When a lender repossess the car, the value of the car at the time of repossession may not be up to the loan balance and the lender will have no option than to write off that balance as a loss. In the US, default rates for auto loans are on the rise and currently sits at about 2%. The benefits of being able to predict whether a borrower will default cannot be over emphasized as it will not only help lenders to know whether to approve or deny a loan application, it will also help them price the interest rate for lenders appropriately.
The problem of loan default is in all ramifications very probabilistic and it’s often also considered strongly in credit risk modeling/credit scoring. Hence, lenders use a vast amount of credit risk tool to determine whether a borrower is likely to default or not. Entering default simply means that the lender determines that the borrower is not going to pay, usually some time after 90 days of no payments — can translate into your car being repossessed (O’Brien, 2023).


  1. LITERATURE REVIEW

The probability of loan default is a well researched topic especially because the industry is a multi-trillion dollar industry. This area is of significant economic importance and often also regarded as credit risk modeling or credit scoring. We review literature related to credit in general and then those related to auto-loan.

In Crook et. al.(2007), Credit scoring is concerned with developing empirical models to support decision making in the retail credit business. Also, a credit score is a model-based estimate of the probability that a borrower will show some undesirable behavior in the future. In application scoring, for example, lenders employ predictive models, called scorecards, to estimate how likely an applicant is to default. Such PD (probability of default) scorecards are routinely developed using classification algorithms (Hand & Henley, 1997).

Whilst the extension of credit goes back to the Babylonian times, the history of credit scoring began in 1941 with the publication by Durand of a study that distinguished between good and bad loans made by 37 firms (Crook et al. 2007). Since then, the already established techniques of statistical discrimination have been developed and an enormous number of new classification algorithms have been researched and tested. Virtually all major banks use credit scoring with specialized consultancies providing credit scoring services and offering powerful software to score applicants, monitor their performance and manage their accounts.

Altman and Saunders (1998) published an overview of credit risk modelling for the last 20 years. They found that credit risk modelling has evolved drastically for the past 20 years due to new emerging statistical techniques(Altman & Saunders, 1998). Later, another group of researchers published an extension to Altman and Saunders work presenting a further development of credit risk modelling(Hao, Alam, & Carling, 2010). Their work identified more than 1000 articles on this topic, and found that logistic regression (LR) model and discriminant analysis are the most widely used methods for constructing scoring systems.

Also, Crook et al.(2007) conducted a research on credit risk scoring and found that the commonest method of estimating a classifier of applicants into those likely to repay and those unlikely to repay is logistic regression with the logit value compared with a cut off. Basically, this research makes claim that the industry standard for predicting loan default is the logistic regression model.

Lessmann et al.(2015) compared about 41 classifiers based on six performance measures across eight real-world credit scoring data sets from the UK, Europe, and Australia. They investigated the overall model performance using several datasets, and examine the predictive performance in each case. The conclusion from this research suggests that several classifiers predict risk significantly better than the industry standard of Logistic Regression (LR). It went further to recommend the Random Forests(RF) model as a benchmark model because of its effectiveness, precision, and its interpretability.

Agrawal et al.(2014) studied the impact of contract-specific variables as predictors in commercial vehicle loans. In their research, applying a logistic regression model for predicting default, around 11 out of 17 contract-specific variables where identified to provide additional assistance for the credit lending institution(Agrawal, Agrawal, & Raizada, 2014). The authors also suggest that contract information could improve the accuracy in more advanced nonlinear models. Specifically, the authors suggest the use of Neural Networks as one potential predictive model to improve the performance based on contract information(Agrawal et al.,2014).

Keeping the outcome of the above literature in mind, this project aims to contribute to the field of vehicle loan prediction by developing machine learning models that will be able to predict vehicle loan default based on the available data, and also compare three (3) models: Random Forest, XGBoost, and Neural Networks models to the industry standard Logistic Regression model. The data set used in this project contains more data than those used in most of the literature reviewed above. Statistically, more data provides better results and we hope that will be useful in better comparing the model and decide which model provides the best prediction metrics. In this work, we explain the data used, the pre-processing and feature selection involved and also provide a quick overview of predictive analytics as well as quick review to help understand the scoring metrics for classification problems. In addition, each of the models used are briefly explained and then the analysis of the data followed by modeling/testing and then conclusion on the findings.


  1. OVERVIEW OF THE MODELS

3.1 Predictive Analytics Overview

Predictive analytics involves using historical data, statistical algorithms, and machine learning techniques to predict future outcomes. There are different types of algorithms for making predictions. Generally, we have regression models and classification models. Regression algorithms are mainly used when the variable to be predicted is a continuous value while classification algorithms are used to predict categories or classes.Classification algorithms are a fundamental part of predictive analytics and are used to categorize data into classes or groups based on specific features or attributes. Examples of classification algorithms include but not limited to Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM, KNN, Naive Bayes, etc. The problem of vehicle loan default prediction is a classification problem and the classification algorithms that will be used to solve the problem of loan default in this project are briefly explained below:

3.2 Logistic Regression

Logistic Regression is a type of classification algorithm that is used when the response variable is binary in nature. Being binary means that the response variable is a two level categorical variable. The Logistic Regression model belongs to the family of generalized linear models (GLMs) and it’s used when the response is a two-level categorical variable. Typical examples of binary response variables are Yes/No; Male/Female; Cancer/No Cancer; Approve/Deny, etc. These response variables are often coded as 1 or 0. Even though the name contains regression, the logistic regression model is often used when the response variable is discrete. i.e It is a classification algorithm.
Logistic Regressions must not necessarily be binary as there are possible situations where there are more than two-level categories in the response variables and such situations are regarded as multinomial logistic regression which is beyond the scope of this project.
The logistic regression model relates the probability that a response variable would be successful to the predictors \(x_{1, i}, x_{2, i}, ..., x_{k, i}\) through a framework like that of multiple regression:
\(logit(p_{i}) = log_{e}(\frac{p_{i}}{1-p_{i}}) = \beta_{0} + \beta_{1}x_{1,i} + \beta_{2}x_{2,i} + ... + \beta_{k}x_{k,i}\)

Assumptions for Logistic Regression
  • Each outcome of the response variable is independent of the other outcomes.
  • The response variable must follow a binomial distribution.
  • Each predictor \(x_{i}\) is linearly related to the \(logit(p_{i})\) if other predictors are held constant.
  • 3.3 Tree-Based Methods

    Tree-based models are a type of supervised learning algorithm used for both classification and regression problems. They construct a decision tree that recursively splits the data into subsets based on the most significant features. The tree structure consists of nodes representing the features, edges representing the decision rules, and leaves representing the output (class or value). The key advantage of tree-based models is their ability to handle non-linear relationships and interactions between features. However, they are prone to overfitting, especially when the trees become too complex. Tree based models follow two major approaches - bagging or boosting.
    Bagging (Bootstrap Aggregating): It’s a technique that aims to reduce variance and prevent overfitting by training multiple models on different bootstrapped subsets of the dataset and then averaging the predictions. Random Forest is an ensemble method based on bagging, utilizing multiple decision trees trained on different subsets of the data.
    Boosting: Boosting is an ensemble technique that combines weak learners (typically shallow trees) sequentially to create a strong model. It focuses on improving the shortcomings of its predecessors by assigning higher weight to misclassified data, effectively learning from previous mistakes. Gradient Boosting and XGBoost are examples of boosting algorithms.

    3.3.1 Random Forest

    It’s an ensemble learning method that constructs multiple decision trees and merges their predictions to improve accuracy and reduce overfitting. It introduces randomness both in feature selection and dataset bootstrapping. By aggregating predictions from various trees, it tends to be more robust and less prone to overfitting compared to a single decision tree.

    3.3.2 Gradient Boosting

    This is a boosting technique that builds trees sequentially. It fits each tree to the residuals (errors) of the preceding tree, reducing the errors at each step. Gradient boosting is a powerful algorithm known for its ability to handle complex data and achieve high accuracy.

    3.3.2 XGBoost (Extreme Gradient Boosting)

    It is an optimized and highly efficient implementation of gradient boosting. XGBoost improves upon the traditional gradient boosting method by introducing regularization, parallel computing, and a variety of enhancements that significantly speed up the training process and improve accuracy.

    3.4 Neural Networks

    Artificial Neural Networks (ANNs) are a class of machine learning models inspired by the structure and function of the human brain. They consist of interconnected nodes, known as neurons, organized in layers to process and learn from data.

    Structure of ANNs
  • Neurons(Nodes): Neurons are the basic units in an ANN. Each neuron receives input signals, performs computations, and produces an output signal that is transmitted to other neurons in the network.
  • Layers: ANNs typically consist of three types of layers:
    Input Layer: Receives input data and passes it to the next layer.
    Hidden Layers: Intermediate layers between the input and output layers. They extract and transform features through complex computations.
    Output Layer: Produces the final output or prediction based on the learned representations from the hidden layers.
  • Connections (Weights): Neurons are connected to neurons in adjacent layers by weighted connections. These weights are adjusted during the learning process to optimize the network’s performance.
  • Training Process
  • Forward Propagation: Input data is passed through the network, and computations are performed layer by layer, generating an output.
  • Loss Calculation: The output is compared to the actual target, and a loss function measures the difference between the predicted and actual values.
  • Backpropagation: The algorithm calculates gradients of the loss function with respect to the weights of the network using chain rule and propagates this information backward through the network.
  • Weight Update: The weights are adjusted using optimization algorithms (e.g., gradient descent) to minimize the loss function, making the predictions closer to the actual values.
  • Types of ANNs
  • Feedforward Neural Networks (FNN): The most basic type where information flows in one direction—from input to output without cycles. This is the type used in this project.
  • Recurrent Neural Networks (RNN): Networks that allow feedback loops, enabling them to process sequential data by retaining memory of previous inputs.
  • Convolutional Neural Networks (CNN): Specialized for processing grid-like data, such as images. They use convolutional layers to learn features hierarchically.
  • Applications

    Image and Speech Recognition: CNNs are widely used in image classification, object detection, and speech recognition.
    Natural Language Processing (NLP): RNNs and variants like LSTM and GRU are used for text analysis, language translation, and sentiment analysis.
    Predictive Modeling: ANNs are applied in various predictive tasks, including regression, classification, and time-series forecasting.

    Challenges

    Computational Complexity: Training large ANNs can be computationally intensive and require significant resources.
    Overfitting: ANNs, especially with complex architectures, can overfit the training data if not properly regularized or trained on diverse data.

    3.5 Scoring the Model and Understanding the Scoring Metrics

    After obtaining and fitting a classification model, we need certain metrics to evaluate how well the model performs.Below are some metrics that can be used to evaluate the performance of a classification model:

    source: https://www.debadityachakravorty.com/ai-ml/cmatrix/

    The type of metrics to be used depends on the situation.
  • Accuracy: This is the most common measure used to evaluate a classification model. Accuracy is the ratio of correctly classified observations to the total. This tells us the percentage of observations that our model is correctly classifying.
    \(accuracy = \frac{TP + TN}{TP+FP+TN+FN}\)
    Accuracy is great for symmetric datasets and when the cost of false positives and false negatives are similar.
  • Precision: This is the percentage of the results that are relevant. It is computed by calculating the number of true positives (TP) divided by all positively classified observations by the model (both TP and FP). The precision tells us the percentage of observations that are actually positive from all positively classified observations by the model. i.e a precision of 90% means that of all the observations that are classified as positive by our model, only 90% of those are actually positive which means that the model correctly classified the positive cases in 90% of the cases.
    \(precision = \frac{TP}{TP+FP}\)
    We use precision when we want to be more confident about our true positives. For example, in spam/ham emails, you want to be sure that the email is spam before putting it in the spam box.
  • Recall (Sensitivity): The recall is also regarded as the sensitivity of the model. This refers to the percentage of total relevant results that are correctly classified by the model. Essentially, it means the ratio of the number of positively classified observations to the total number of actual positives both those correctly classified and incorrectly classified by the model (TP and FN). It is computed by dividing the number of true positives by the total number of observations that are actually positive (whether correctly classified by the model or not). The Recall also known as Sensitivity tells us what proportion of the positive class got correctly classified. i.e What percentage of cancer patients got correctly classified as cancer patients by the model. A recall of 95% means that of all the actually positive cases, only 95% were correctly classified by the model.
    \(recall = \frac{TP}{TP+FN}\)
    We use recall when having a false positive is way better than having a false negative. for example, you want to tell someone that they have cancer (FP) when in fact they don’t instead of telling them that they don’t have cancer(FN) when in fact they have it. It would be disastrous to give a False Negative to a cancer patient because they would probably have had time to ameliorate the situation if they had been informed earlier. The recall is better when the cost of false negatives is unacceptable. i.e. False positive is better than false negative.
  • F1 Score: This is the harmonic mean of the precision and recall. It is also called F-score or F-Measure.
    \(F1 = \frac{2 * precision * recall}{precision + recall}\)
    F1 is best for uneven distribution and it can be used to compare different models.
  • 3.6 Underfitting and Overfitting

    In supervised machine learning problems, we aim to get a model that properly fits the training data and also generalizes well on new/test data. When the model is unable to generalize on new data, it is not able to perform its purpose. One of such causes could be as a result of underfitting or overffiting.
  • Undefitting: This is a situation in machine learning where a model does not properly fit the training data resulting in high training error and high test error. In this case, the model performs poorly on the training data as well as on new data. An underfit machine learning model is not a suitable model for the data because it is not able to properly capture the relationship between the input examples (X) and the target values (Y).Poor performance on the training data could mean that the model is too simple to describe the target properly. A typical example of an underfit model is using a linear model for data points with quadratic relationship.
    Since the model cannot generalize well on new data, it cannot be leveraged for prediction or classification tasks. High bias and low variance are good indicators of underfitting.
  • Overfitting: This is simply the opposite of underfitting. It is a situation in machine learning where a model properly fits the training data, but performs poorly on new datasets thereby resulting in low training error, but high test error. It involves good performance on training data, but poor performance on new/test data. The model is essentially learning the noise and details in the training data and memorizing it such that it can not generalize to unseen data. A typical example of overfitting is fitting a quadratic function with a cubic or higher order polynomial model. High variance and low bias are indicators of overfitting.

  • source: https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html


    Reducing Underfitting
    There are different ways to decrease underfitting such are:
  • Increase model complexity. It could be that the selected model is too simple and a more complex model may be required.
  • Perform feature selection/feature engineering.
  • Increase the duration of training to get better results.
  • Decrease the amount of regularization used.
  • Reducing Overfitting
  • Reduce the model complexity. It could be that the model is too complex and a simpler model is required.
  • Increase the amount of training data
  • Perform feature selection: Reduce the number of features
  • Increase the amount of regularization
  • Early stopping in the training
  • Perform cross validation
  • Use ensemble methods


    1. METHODOLOGY

    4.1 Dataset

    The data used in this analysis was obtained from Kaggle. The csv files containing the data was downloaded from kaggle and placed in github. There are three files which are: train dataset, test dataset, and dictionary. The dictionary file explains what the variables represent. Unfortunately, the test dataset does not contain labels (i.e the response variable) and could not be used to evaluate how well the model does because to do that, you need to have labels to compare to the predicted values. Hence, After cleaning of the data, the train data was split using the CaTools package into training (df_train) and testing data (df_test)
    Variables
    There are about 41 variables in the dataset. forty (40) of the variables are predictor variables while there is one response variable. The variables in the dataset are:
  • UniqueID: Identifier for customer
  • loan_default: Payment default in the first EMI on due date
  • disbursed_amount: Amount of loan disbursed
  • asset_cost: Cost of the Asset
  • ltv: Loan to Value of asset
  • branch_id: Branch where the loan was disbursed
  • supplier_id: Vehicle Dealer where the loan was disbursed
  • manufacturer_id: Vehicle manufacturer (Hero, Honda, TVS, etc.)
  • Current_pincode: Current pincode of the customer
  • Date.of.Birth: Date of Birth of the customer
  • Employment.Type: Employment Type of the customer (Salaried/Self Employed)
  • DisbursalDate: Date of loan disbursement
  • State_ID: State of disbursement
  • MobileNo_Avl_Flag: If Mobile no. was shared by the customer then flagged as 1
  • Aadhar: If aadhar was shared by the customer then flagged as 1
  • PAN_flag: If pan was shared by the customer then flagged as 1
  • VoterID_flag: If voter Id was shared by the customer then flagged as 1
  • Driving_flag: If Driver license was shared by the customer then flagged as 1
  • Passport_flag: If passport was shared by the customer then flagged as 1
  • PERFORM_CNS.SCORE: Bureau Score
  • PERFORM_CNS.SCORE.DESCRIPTION: Bureau Score description
  • PRI.NO.OF.ACCTS: Count of total loans taken by the customer at the time of disbursement
  • PRI.ACTIVE.ACCTS: Count of active loans taken by the customer at the time of disbursement
  • PRI.OVERDUE.ACCTS: Count of default accounts at the time of disbursement
  • PRI.CURRENT.BALANCE: Total principal outstanding amount of the active loans at the time of disbursement
  • PRI.SANCTIONED.AMOUNT: Total amount that was sanctioned for all the loans at the time of disbursement
  • PRI.DISBURSED.AMOUNT: Total amount that was disbursed for all the loans at the time of disbursement
  • SEC.NO.OF.ACCTS: Count of total loans taken by the customer at the time of disbursement
  • SEC.ACTIVE.ACCTS: Count of active loans taken by the customer at the time of disbursement
  • SEC.OVERDUE.ACCTS: Count of default accounts at the time of disbursement
  • SEC.CURRENT.BALANCE: Total principal outstanding amount of the active loans at the time of disbursement
  • SEC.SANCTIONED.AMOUNT: Total amount that was sanctioned for all the loans at the time of disbursement
  • SEC.DISBURSED.AMOUNT: Total amount that was disbursed for all the loans at the time of disbursement
  • PRIMARY.INSTAL.AMT: EMI Amount of the primary loan
  • SEC.INSTAL.AMT: EMI Amount of the secondary loan
  • NEW.ACCTS.IN.LAST.SIX.MONTHS: New loans taken by the customer in the last 6 months before the disbursement
  • DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS: Loans defaulted in the last 6 months
  • AVERAGE.ACCT.AGE: Average loan tenure
  • CREDIT.HISTORY.LENGTH: Time since first loan
  • NO.OF_INQUIRIES: Enquiries done by the customer for loans
  • Note: Primary accounts are those which the customer has taken for his personal use while secondary accounts are those which the customer act as a co-applicant or guarantor.
    The response variable is the loan_default variable and it is a two level categorical variable with a value of 1 for default and value 0 for no default.

    4.2 Methodology

    The method adopted in this analysis follows a certain number of steps and code re-use was essential to avoid errors resulting from code duplication. To perform this analysis, the following steps were taken in R studio:

  • Read Data: The data was read into memory from the external location and a few records were displayed.
  • Clean Data: The data was cleaned using the data_cleaning function developed in this analysis. It leverages the tidyverse group of packages.
  • Explore Data: The data was explored. It was determined that there are about 41 variables (40 predictors and 1 response variable). Also, there were over 230,000 observations and the data was found to be unbalanced after visualizing the two level categorical response variable loan_default in a bar chart. Further exploratory data analysis was done to better understand the data.
  • Pre-Process Data: The data was pre-processed using the data_preprocessing function also developed in this analysis.
  • Train Test Split: Split the data into training and test data.
  • Model Training: After cleaning, and pre-processing, the train data was used to train four (4) different machine learning models: Logistic, Random Forests, XGBoost, and Neural Network models. These training were done using the model_training function developed as part of this analysis as well.
  • Model Prediction and Evaluation: After training the data and getting a trained model, predictions were made and results of the prediction were compared to the labels in the data to determine how well the model performs using four different metrics: Accuracy, Precision, Recall, and AUC.
  • Results: The results of the model evaluations were compared to determine the best performing model.
  • Next Steps and Conclusion: Based on the results of the evaluation metrics and other findings during the analysis, next steps recommendations were made and a conclusion provided on the project findings.
  • Further details about the functions used in the analysis are provided in the section below:

    4.3 Data Cleaning and Pre-processing Details

    To clean and pre-process the data, two major functions data_training and data_preprocessing were developed respectively to avoid code duplication and errors such that new data can simply be passed into those functions to be cleaned, pre-processed and ready for either training or testing.

    4.3.1 data_cleaning:

    This function basically cleans the data. It accepts an R dataframe or a tibble with fixed schema and does the following cleaning:
  • It starts by removing the ‘period’ in the column names and then converting all the names to snake case (lower) format.
  • After which it derives the age of applicants by subtracting the date of birth from the date of disbursement rounded to the nearest whole number after both dates have been converted to date from their string formats using the lubridate package in R.
  • Then, it normalizes the average_account_age by combining their year and month components and round the result to the nearest year.
  • Normalized the average_credit_history by combining their year and month components and round the result to the nearest year.
  • Cleaned up the perform_cns_score_description into meaningful risk categories.
  • Cleaned up the employment_type of applicant into self_employed, employed, and not_reported.
  • Finally, it selects the relevant columns that will be used further in the analysis and returns the cleaned data as an R dataframe.
  • 4.3.2 data_preprocessing:

    This function pre-processes the data for either training or testing purposes. It expects the cleaned data from the data_cleaning function and a mode to specify whether the data will be used for training or testing. It has a further helper function data_preprocess_scaling that uses the standard-scaler approach to scale the data. For testing data, mode is passed as ‘test’ and the function basically calls the helper function to scale the data and returned the scaled data that can then be used for testing purposes while for training data, mode is passes as ‘train’ and the function uses the ROSE package in R to do oversampling/under-sampling to balance the class of data in this binary classification problem after which it calls the helper function to standardize/scale the balanced data.

    4.3.3 model_training:

    This function does the model training of any of the four (4) models. It expects a pre-processed dataframe and mode of training. It has a further helper function xgb_nnet_preprocess that does further preprocessing for XGB, and Neural Network models. For the XGB model, the train function expects a certain DMatrix which can only be numeric values and in that case, the model_training function calls this helper function which converts the non-numeric predictor variables to dummies, and converts the dataframe into a DMatrix format that can be used to train the data using the XGB algorithm. Also, for the NNET model, the helper function only dummifies the non-numeric predictor variables.
    For models that need further pre-processing, the function calls the appropriate helper function to do that and then trains the model using the data and type of model provided to return a trained model.

    4.3.4 model_prediction:

    This behaves similarly like the model_training function although it does so for predictions. It expects three parameters - data, trained_model, and model_type. Depending on the type of the model, it determines if the data needs further pre-processing like in the case of XGB and NNET models whereby it calls the same helper function xgb_nnet_preprocess to return the same format of data it gave to the training model as that is what would be needed to make predictions. The predictions returned are probabilities of belonging to a certain class (Class 0). However, this function goes further to convert these probabilities to classes using a given threshold which can always be adjusted based on business needs. For this project, a 50% threshold is used.

    4.3.5 model_evaluation:

    After training of a model and using the model to make predictions, it is very important to determine if the predictions of the model can be useful or taken seriously. Hence, we evaluate the model and this function does the evaluation of the model. It calculates the accuracy, precision, recall, and auc for any one of the four models once the actual and predicted values are provided. It relies heavily on the Metrics package in R for the evaluations except for auc where it uses the pROC package.

    Note: Although, there is a section for exploratory data analysis, there is no function that does this.


    1. ANALYSIS, TESTING AND RESULTS

    5.1 Analysis

    5.1.1 Data Cleaning and Pre-processing

    Libraries
    Load the required libraries for the analysis

    library(Amelia) # To visualize missing data
    library(caret)
    library(caTools) # for train test split
    library(corrplot) # To plot correlation plot
    library(cowplot) # To combine plots in a grid
    library(fastDummies) # to convert character variables to dummies
    library(ggcorrplot) # To plot correlation plot
    library(kableExtra)
    library(Metrics) # for model evaluation
    library(nnet)
    library(pROC)
    library(ranger) # for random forest implementation
    library(ROSE) # to balance the data
    library(tidyverse) 
    library(xgboost)

    Read the datasets into memory from the github location.

    url_vehicle_loan_default_train = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_train_data.csv"
    url_vehicle_loan_default_test = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_test_data.csv"
    vehicle_loan_default_train_raw = read_csv(url_vehicle_loan_default_train) %>% as_tibble()
    vehicle_loan_default_test_raw = read_csv(url_vehicle_loan_default_test) %>% as_tibble()

    Display a few records of the dataset to have an idea of how the data looks like.

    # display a few records of the raw data
    raw_data_few_records <- kable(head(vehicle_loan_default_train_raw, 50), "html") %>%
                            kable_paper("hover", full_width = F) %>%
                            scroll_box(width = "850px", height = "350px")
    raw_data_few_records
    UNIQUEID DISBURSED_AMOUNT ASSET_COST LTV BRANCH_ID SUPPLIER_ID MANUFACTURER_ID CURRENT_PINCODE_ID DATE_OF_BIRTH EMPLOYMENT_TYPE DISBURSAL_DATE STATE_ID EMPLOYEE_CODE_ID MOBILENO_AVL_FLAG AADHAR_FLAG PAN_FLAG VOTERID_FLAG DRIVING_FLAG PASSPORT_FLAG PERFORM_CNS_SCORE PERFORM_CNS_SCORE_DESCRIPTION PRI_NO_OF_ACCTS PRI_ACTIVE_ACCTS PRI_OVERDUE_ACCTS PRI_CURRENT_BALANCE PRI_SANCTIONED_AMOUNT PRI_DISBURSED_AMOUNT SEC_NO_OF_ACCTS SEC_ACTIVE_ACCTS SEC_OVERDUE_ACCTS SEC_CURRENT_BALANCE SEC_SANCTIONED_AMOUNT SEC_DISBURSED_AMOUNT PRIMARY_INSTAL_AMT SEC_INSTAL_AMT NEW_ACCTS_IN_LAST_SIX_MONTHS DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS AVERAGE_ACCT_AGE CREDIT_HISTORY_LENGTH NO_OF_INQUIRIES LOAN_DEFAULT
    420825 50578 58400 89.55 67 22807 45 1441 01-01-1984 Salaried 03-08-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    537409 47145 65550 73.23 67 22807 45 1502 31-07-1985 Self employed 26-09-2018 6 1998 1 1 0 0 0 0 598 I-Medium Risk 1 1 1 27600 50200 50200 0 0 0 0 0 0 1991 0 0 1 1yrs 11mon 1yrs 11mon 0 1
    417566 53278 61360 89.63 67 22807 45 1497 24-08-1985 Self employed 01-08-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    624493 57513 66113 88.48 67 22807 45 1501 30-12-1993 Self employed 26-10-2018 6 1998 1 1 0 0 0 0 305 L-Very High Risk 3 0 0 0 0 0 0 0 0 0 0 0 31 0 0 0 0yrs 8mon 1yrs 3mon 1 1
    539055 52378 60300 88.39 67 22807 45 1495 09-12-1977 Self employed 26-09-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 1 1
    518279 54513 61900 89.66 67 22807 45 1501 08-09-1990 Self employed 19-09-2018 6 1998 1 1 0 0 0 0 825 A-Very Low Risk 2 0 0 0 0 0 0 0 0 0 0 0 1347 0 0 0 1yrs 9mon 2yrs 0mon 0 0
    529269 46349 61500 76.42 67 22807 45 1502 01-06-1988 Salaried 23-09-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    510278 43894 61900 71.89 67 22807 45 1501 04-10-1989 Salaried 16-09-2018 6 1998 1 1 0 0 0 0 17 Not Scored: Not Enough Info available on the customer 1 1 0 72879 74500 74500 0 0 0 0 0 0 0 0 0 0 0yrs 2mon 0yrs 2mon 0 0
    490213 53713 61973 89.56 67 22807 45 1497 15-11-1991 Self employed 05-09-2018 6 1998 1 1 0 0 0 0 718 D-Very Low Risk 1 1 0 -41 365384 365384 0 0 0 0 0 0 0 0 0 0 4yrs 8mon 4yrs 8mon 1 0
    510980 52603 61300 86.95 67 22807 45 1492 01-06-1968 Salaried 16-09-2018 6 1998 1 0 0 1 0 0 818 A-Very Low Risk 1 0 0 0 0 0 0 0 0 0 0 0 2608 0 0 0 1yrs 7mon 1yrs 7mon 0 0
    548567 53278 61230 89.83 67 22807 45 1493 01-01-1979 Self employed 29-09-2018 6 1998 1 1 0 0 0 0 300 M-Very High Risk 1 0 0 0 0 0 0 0 0 0 0 0 2270 0 0 0 0yrs 7mon 0yrs 7mon 0 1
    486821 64769 74190 89.23 67 22807 45 1446 07-09-1984 Salaried 03-09-2018 6 1998 1 1 0 0 0 0 786 B-Very Low Risk 3 2 0 676 36154 23374 0 0 0 0 0 0 0 0 0 0 2yrs 1mon 2yrs 3mon 1 0
    478647 53278 61330 89.68 67 22807 45 1497 01-06-1974 Salaried 30-08-2018 6 1998 1 0 0 1 0 0 300 M-Very High Risk 7 2 1 0 69900 69900 0 0 0 0 0 0 3300 0 0 0 1yrs 3mon 2yrs 9mon 0 1
    479533 49478 57010 89.46 67 22807 45 1497 16-08-1984 Salaried 30-08-2018 6 1998 1 1 0 0 0 0 738 C-Very Low Risk 10 5 0 79750 187000 187000 0 0 0 0 0 0 23309 0 1 0 1yrs 0mon 2yrs 1mon 4 1
    483869 49278 57080 89.35 67 22807 45 1495 18-02-1973 Self employed 31-08-2018 6 1998 1 1 0 0 0 0 300 M-Very High Risk 5 5 3 95597 179252 179252 0 0 0 0 0 0 3514 0 0 0 3yrs 11mon 7yrs 2mon 0 1
    600655 47549 61400 79.80 67 22807 45 1440 05-07-1994 Salaried 22-10-2018 6 1998 1 0 0 1 0 0 17 Not Scored: Not Enough Info available on the customer 1 0 0 0 0 0 0 0 0 0 0 0 7900 0 1 0 0yrs 1mon 0yrs 1mon 0 1
    513916 57713 65750 89.28 67 22807 45 1440 01-06-1976 Self employed 18-09-2018 6 1998 1 1 0 0 0 0 300 M-Very High Risk 6 4 2 29069 1067200 1067200 0 0 0 0 0 0 47100 0 1 1 2yrs 6mon 5yrs 6mon 0 0
    522020 53503 62100 87.28 67 22807 45 1498 27-02-1983 Self employed 20-09-2018 6 1998 1 1 0 0 0 0 688 E-Low Risk 13 8 0 1076657 2277048 2277048 0 0 0 0 0 0 4982 0 1 0 1yrs 10mon 4yrs 7mon 0 0
    492995 70017 86760 82.99 67 22807 45 1479 10-08-1988 Self employed 06-09-2018 6 1998 1 0 0 1 0 0 585 I-Medium Risk 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1yrs 9mon 1yrs 9mon 0 1
    568857 58259 68500 86.13 67 22807 45 1468 16-04-1980 Self employed 11-10-2018 6 1998 1 1 0 0 0 0 615 H-Medium Risk 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 11mon 0yrs 11mon 1 1
    590630 58013 69650 84.71 67 22807 45 1497 01-11-1978 Self employed 20-10-2018 6 1998 1 1 0 0 0 0 750 C-Very Low Risk 9 1 0 134499 32198 32198 0 0 0 0 0 0 557 0 1 0 0yrs 6mon 0yrs 10mon 1 0
    467015 31184 57110 56.91 67 22807 45 1498 29-02-1984 Salaried 27-08-2018 6 1998 1 1 0 0 0 0 801 B-Very Low Risk 7 5 0 1338774 2306289 2291743 0 0 0 0 0 0 11083 0 0 0 2yrs 9mon 5yrs 10mon 2 0
    563215 43594 78256 57.50 67 22744 86 1499 14-07-1994 Self employed 08-10-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    513139 54513 61900 89.66 67 22807 45 1468 31-05-1979 Self employed 17-09-2018 6 1998 1 0 0 1 0 0 738 C-Very Low Risk 1 1 0 6690 25200 25200 0 0 0 0 0 0 1700 0 0 0 1yrs 3mon 1yrs 3mon 0 0
    498082 73123 92900 79.66 67 22807 45 1480 02-01-1989 Self employed 10-09-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    586411 55213 68600 83.09 67 22807 45 1494 01-01-1986 Salaried 18-10-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    440293 53713 61780 89.83 67 22807 45 1468 02-08-1968 Self employed 16-08-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 1
    566763 57713 68040 86.27 67 22807 45 1497 01-01-1976 Self employed 10-10-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 1
    605314 57513 65750 88.97 67 22807 45 1497 12-09-1972 Self employed 23-10-2018 6 1998 1 1 0 0 0 0 615 H-Medium Risk 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3yrs 1mon 3yrs 1mon 1 1
    519075 54513 61900 89.66 67 22807 45 1473 27-06-1969 Self employed 19-09-2018 6 1998 1 1 0 0 0 0 730 D-Very Low Risk 5 3 0 101518 162800 162800 0 0 0 0 0 0 8972 0 1 0 2yrs 0mon 5yrs 4mon 0 0
    551137 45349 60300 76.29 67 22807 45 1501 01-01-1974 Self employed 30-09-2018 6 1998 1 1 0 0 0 0 758 C-Very Low Risk 5 3 0 909093 1442349 1442349 0 0 0 0 0 0 0 0 0 0 2yrs 0mon 2yrs 9mon 0 0
    525983 46549 69518 69.05 67 22744 86 1480 23-05-1990 Salaried 21-09-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    501823 57259 70100 82.74 67 22807 45 1497 01-06-1966 Salaried 12-09-2018 6 1998 1 0 0 1 0 0 768 B-Very Low Risk 7 3 0 324323 604845 604845 0 0 0 0 0 0 1219 0 1 0 1yrs 10mon 4yrs 10mon 0 0
    451537 42594 60630 72.57 67 22807 45 1497 29-07-1996 Self employed 21-08-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    439084 50678 58300 89.88 67 22807 45 1474 01-06-1977 Self employed 14-08-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    584660 53078 64280 84.01 67 22807 45 1497 05-10-1993 Self employed 17-10-2018 6 1998 1 1 0 0 0 0 610 H-Medium Risk 1 0 0 0 0 0 0 0 0 0 0 0 2111 0 0 0 0yrs 4mon 0yrs 4mon 1 1
    606338 56013 63930 89.16 67 22807 45 1502 22-09-1986 Self employed 23-10-2018 6 1998 1 1 0 0 0 0 653 F-Low Risk 9 6 0 3878357 4015900 4015900 0 0 0 0 0 0 126287 0 4 0 0yrs 11mon 4yrs 0mon 1 1
    641415 58013 65838 89.61 67 22807 45 1497 01-01-1991 Self employed 30-10-2018 6 1998 1 1 0 0 0 0 413 K-High Risk 13 3 1 19121 91161 91161 0 0 0 0 0 0 22427 0 0 2 1yrs 1mon 2yrs 1mon 4 0
    590213 55759 63100 89.54 67 22807 45 1492 19-06-1977 Self employed 20-10-2018 6 1998 1 1 0 0 0 0 709 D-Very Low Risk 4 3 0 18518 77480 77480 0 0 0 0 0 0 0 0 2 0 1yrs 9mon 3yrs 9mon 4 0
    422926 50578 58400 89.55 67 22807 45 1577 16-08-1996 Salaried 06-08-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 1
    557071 51303 66450 78.25 67 22807 45 1499 01-06-1969 Salaried 04-10-2018 6 1998 1 1 0 0 0 0 719 D-Very Low Risk 5 2 0 8000 145000 145000 0 0 0 0 0 0 0 0 0 0 1yrs 3mon 2yrs 11mon 0 0
    582949 40894 61230 67.78 67 22807 45 1497 07-02-1993 Self employed 16-10-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    596436 42894 70600 61.61 67 22807 45 1497 20-06-1982 Self employed 21-10-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    507978 64282 74290 89.11 67 22807 45 1474 01-06-1990 Salaried 15-09-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    529381 57213 64750 89.88 67 22807 45 1502 27-11-1976 Salaried 23-09-2018 6 1998 1 1 0 0 0 0 16 Not Scored: No Activity seen on the customer (Inactive) 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 11mon 0yrs 11mon 0 0
    480958 68082 79806 87.71 67 22744 86 1504 05-03-1990 Salaried 31-08-2018 6 1998 1 1 0 0 0 0 0 No Bureau History Available 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
    566809 48349 67650 72.43 67 22807 45 1497 15-01-1993 Salaried 10-10-2018 6 1998 1 1 0 0 0 0 15 Not Scored: Sufficient History Not Available 1 1 0 155000 155000 155000 0 0 0 0 0 0 0 0 1 0 0yrs 0mon 0yrs 0mon 1 0
    585779 61013 68850 89.76 67 22807 45 1501 01-06-1986 Self employed 17-10-2018 6 1998 1 1 0 0 0 0 701 E-Low Risk 9 2 0 157671 214800 214800 0 0 0 0 0 0 2667 0 0 0 0yrs 5mon 1yrs 8mon 0 1
    559601 54078 70000 78.57 67 22807 45 1497 25-03-1974 Salaried 06-10-2018 6 1998 1 1 0 0 0 0 626 H-Medium Risk 4 4 1 2470898 2836417 2836417 0 0 0 0 0 0 29840 0 0 0 2yrs 6mon 5yrs 2mon 1 0
    612741 57613 68950 84.99 67 22807 45 1499 14-09-1980 Self employed 24-10-2018 6 1998 1 1 0 0 0 0 717 D-Very Low Risk 1 1 0 3793 49597 49597 0 0 0 0 0 0 1956 0 0 0 2yrs 10mon 2yrs 10mon 1 1

    data_cleaning function

    The function does cleaning of the data. It only accepts a dataframe (or a tibble)

    data_cleaning <- function(df){
      # This function accepts a dataframe (df) as input and returns another dataframe (cleaned_df) that is clean.
      tryCatch({
      if(is_tibble(df) | is.data.frame(df)){
        print("The dataframe is a tibble, will proceed to clean data")
        print("Data Cleaning in progress...")
        # rename all the columns in the dataframe to lowercase
        cleaned_df <- df %>% rename_all(tolower) %>%  
                    # compute the age of the applicant in number of years
                    mutate(date_of_birth = dmy(date_of_birth),  
                          disbursal_date = dmy(disbursal_date), 
                          age = difftime(disbursal_date, date_of_birth, units = "days"),
                          age_years = round(as.numeric(age / 365.25), 0)) %>%
                   # extract the years and month component of the average_acct_age  and convert to years
                    mutate(average_acct_age_year_comp = as.numeric(str_extract(average_acct_age, "\\d+")),
                          average_acct_age_mon_comp = as.numeric(str_extract(average_acct_age, "\\d+(?=mon)")),
                          average_acct_age = round((average_acct_age_year_comp + average_acct_age_mon_comp/12), 0)
                           ) %>%
                    # extract the years and month component of the credit_history_length and convert to years
                    mutate(credit_history_length_year_comp = as.numeric(str_extract(credit_history_length, "\\d+")),
                           credit_history_length_comp = as.numeric(str_extract(credit_history_length, "\\d+(?=mon)")),
                           credit_history_length = round((credit_history_length_year_comp + credit_history_length_comp/12), 0)
                           )  %>%
                    # clean up the perform_cns_score_distribution to include only a few categories
                    mutate(lowercase_cns_description = tolower(perform_cns_score_description),
                           perform_cns_score_description = case_when(
                                    str_detect(lowercase_cns_description, "very low risk") ~ "very_low_risk",
                                    str_detect(lowercase_cns_description, "low risk") ~ "low_risk",
                                    str_detect(lowercase_cns_description, "medium risk") ~ "medium_risk",
                                    str_detect(lowercase_cns_description, "high risk") ~ "high_risk",
                                    str_detect(lowercase_cns_description, "very high risk") ~ "very_high_risk",
                                    str_detect(lowercase_cns_description, "not scored|no bureau") ~ "low_risk",
                                    TRUE ~ "none"))  %>%
                    # clean up the employment type to have only few categories
                    mutate(lower_case_employment_type = tolower(employment_type),
                           employment_type = case_when(
                                    str_detect(lower_case_employment_type, "salaried") ~ "salaried",
                                    str_detect(lower_case_employment_type, "self employed") ~ "self_employed",
                                    TRUE ~ "not_reported")) %>%
                    # select only the required columns
                    select(
                      age_years, disbursed_amount, asset_cost, ltv, employment_type, perform_cns_score_description,
                      pri_no_of_accts, pri_active_accts, pri_overdue_accts, pri_current_balance, pri_sanctioned_amount,
                      pri_disbursed_amount, sec_no_of_accts, sec_active_accts, sec_overdue_accts, sec_current_balance,
                      sec_sanctioned_amount, sec_disbursed_amount, primary_instal_amt, sec_instal_amt, new_accts_in_last_six_months,
                      delinquent_accts_in_last_six_months, average_acct_age, credit_history_length, no_of_inquiries, loan_default
                    )
                    print("Data Cleaning complete!!!")
        return(cleaned_df)
      }
      else{
      print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
      }
        
      },
      #if an error occurs, tell me the error
      error=function(e) {
            message('An Error Occurred')
            print(e)
            },
      #or if a warning occurs, tell me the warning
      warning=function(w) {
            message('A Warning Occurred')
            print(w)
            return(NA)
            }
        )
      
    }

    Use the data_cleaning function to clean the data.

    # clean the data
    vehicle_loan_default_train_cleaned = data_cleaning(vehicle_loan_default_train_raw)
    ## [1] "The dataframe is a tibble, will proceed to clean data"
    ## [1] "Data Cleaning in progress..."
    ## [1] "Data Cleaning complete!!!"

    Display a few records of the cleaned data.

    # display a few records of the cleaned data
    cleaned_data_few_records <- kable(head(vehicle_loan_default_train_cleaned, 200), "html") %>% 
                                kable_paper("hover", full_width = F) %>%
                                scroll_box(width = "850px", height = "350px")
    cleaned_data_few_records
    age_years disbursed_amount asset_cost ltv employment_type perform_cns_score_description pri_no_of_accts pri_active_accts pri_overdue_accts pri_current_balance pri_sanctioned_amount pri_disbursed_amount sec_no_of_accts sec_active_accts sec_overdue_accts sec_current_balance sec_sanctioned_amount sec_disbursed_amount primary_instal_amt sec_instal_amt new_accts_in_last_six_months delinquent_accts_in_last_six_months average_acct_age credit_history_length no_of_inquiries loan_default
    35 50578 58400 89.55 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    33 47145 65550 73.23 self_employed medium_risk 1 1 1 27600 50200 50200 0 0 0 0 0 0 1991 0 0 1 2 2 0 1
    33 53278 61360 89.63 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    25 57513 66113 88.48 self_employed high_risk 3 0 0 0 0 0 0 0 0 0 0 0 31 0 0 0 1 1 1 1
    41 52378 60300 88.39 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
    28 54513 61900 89.66 self_employed very_low_risk 2 0 0 0 0 0 0 0 0 0 0 0 1347 0 0 0 2 2 0 0
    30 46349 61500 76.42 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    29 43894 61900 71.89 salaried low_risk 1 1 0 72879 74500 74500 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    27 53713 61973 89.56 self_employed very_low_risk 1 1 0 -41 365384 365384 0 0 0 0 0 0 0 0 0 0 5 5 1 0
    50 52603 61300 86.95 salaried very_low_risk 1 0 0 0 0 0 0 0 0 0 0 0 2608 0 0 0 2 2 0 0
    40 53278 61230 89.83 self_employed high_risk 1 0 0 0 0 0 0 0 0 0 0 0 2270 0 0 0 1 1 0 1
    34 64769 74190 89.23 salaried very_low_risk 3 2 0 676 36154 23374 0 0 0 0 0 0 0 0 0 0 2 2 1 0
    44 53278 61330 89.68 salaried high_risk 7 2 1 0 69900 69900 0 0 0 0 0 0 3300 0 0 0 1 3 0 1
    34 49478 57010 89.46 salaried very_low_risk 10 5 0 79750 187000 187000 0 0 0 0 0 0 23309 0 1 0 1 2 4 1
    46 49278 57080 89.35 self_employed high_risk 5 5 3 95597 179252 179252 0 0 0 0 0 0 3514 0 0 0 4 7 0 1
    24 47549 61400 79.80 salaried low_risk 1 0 0 0 0 0 0 0 0 0 0 0 7900 0 1 0 0 0 0 1
    42 57713 65750 89.28 self_employed high_risk 6 4 2 29069 1067200 1067200 0 0 0 0 0 0 47100 0 1 1 2 6 0 0
    36 53503 62100 87.28 self_employed low_risk 13 8 0 1076657 2277048 2277048 0 0 0 0 0 0 4982 0 1 0 2 5 0 0
    30 70017 86760 82.99 self_employed medium_risk 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 1
    38 58259 68500 86.13 self_employed medium_risk 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
    40 58013 69650 84.71 self_employed very_low_risk 9 1 0 134499 32198 32198 0 0 0 0 0 0 557 0 1 0 0 1 1 0
    34 31184 57110 56.91 salaried very_low_risk 7 5 0 1338774 2306289 2291743 0 0 0 0 0 0 11083 0 0 0 3 6 2 0
    24 43594 78256 57.50 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    39 54513 61900 89.66 self_employed very_low_risk 1 1 0 6690 25200 25200 0 0 0 0 0 0 1700 0 0 0 1 1 0 0
    30 73123 92900 79.66 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    33 55213 68600 83.09 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    50 53713 61780 89.83 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    43 57713 68040 86.27 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    46 57513 65750 88.97 self_employed medium_risk 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 1 1
    49 54513 61900 89.66 self_employed very_low_risk 5 3 0 101518 162800 162800 0 0 0 0 0 0 8972 0 1 0 2 5 0 0
    45 45349 60300 76.29 self_employed very_low_risk 5 3 0 909093 1442349 1442349 0 0 0 0 0 0 0 0 0 0 2 3 0 0
    28 46549 69518 69.05 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    52 57259 70100 82.74 salaried very_low_risk 7 3 0 324323 604845 604845 0 0 0 0 0 0 1219 0 1 0 2 5 0 0
    22 42594 60630 72.57 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    41 50678 58300 89.88 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    25 53078 64280 84.01 self_employed medium_risk 1 0 0 0 0 0 0 0 0 0 0 0 2111 0 0 0 0 0 1 1
    32 56013 63930 89.16 self_employed low_risk 9 6 0 3878357 4015900 4015900 0 0 0 0 0 0 126287 0 4 0 1 4 1 1
    28 58013 65838 89.61 self_employed high_risk 13 3 1 19121 91161 91161 0 0 0 0 0 0 22427 0 0 2 1 2 4 0
    41 55759 63100 89.54 self_employed very_low_risk 4 3 0 18518 77480 77480 0 0 0 0 0 0 0 0 2 0 2 4 4 0
    22 50578 58400 89.55 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    49 51303 66450 78.25 salaried very_low_risk 5 2 0 8000 145000 145000 0 0 0 0 0 0 0 0 0 0 1 3 0 0
    26 40894 61230 67.78 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    36 42894 70600 61.61 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    28 64282 74290 89.11 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    42 57213 64750 89.88 salaried low_risk 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
    28 68082 79806 87.71 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    26 48349 67650 72.43 salaried low_risk 1 1 0 155000 155000 155000 0 0 0 0 0 0 0 0 1 0 0 0 1 0
    32 61013 68850 89.76 self_employed low_risk 9 2 0 157671 214800 214800 0 0 0 0 0 0 2667 0 0 0 0 2 0 1
    45 54078 70000 78.57 salaried medium_risk 4 4 1 2470898 2836417 2836417 0 0 0 0 0 0 29840 0 0 0 2 5 1 0
    38 57613 68950 84.99 self_employed very_low_risk 1 1 0 3793 49597 49597 0 0 0 0 0 0 1956 0 0 0 3 3 1 1
    33 58413 66100 89.86 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    47 58459 71200 84.27 salaried low_risk 2 2 0 20865 93201 93201 0 0 0 0 0 0 2000 0 1 0 2 3 0 0
    29 49478 57520 88.66 self_employed high_risk 2 1 1 1959 364800 364800 0 0 0 0 0 0 0 0 0 0 4 5 0 1
    24 55513 67950 83.15 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    34 53599 66115 84.70 salaried low_risk 10 9 0 3656027 3690603 3690603 0 0 0 0 0 0 28721 0 6 0 0 1 3 0
    34 53040 67067 82.73 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    24 50475 62413 84.60 salaried very_low_risk 3 1 0 2412 36920 36920 0 0 0 0 0 0 0 0 0 0 1 1 0 0
    25 49458 63000 82.54 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    43 48693 65500 77.86 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    54 48500 59313 83.79 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    42 48693 62577 81.50 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    40 54273 66855 86.75 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    35 43869 62577 71.91 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    47 50673 62577 84.70 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    24 49458 63000 82.54 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    34 45268 62577 76.23 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    34 50943 63896 85.29 salaried high_risk 4 0 1 0 0 0 0 0 0 0 0 0 1780 0 0 0 1 2 0 0
    28 49713 68000 77.94 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    52 46759 62577 78.30 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    34 49458 63000 82.54 salaried very_low_risk 2 2 0 5428 63500 63500 0 0 0 0 0 0 4248 0 0 0 2 3 0 1
    36 54343 68862 82.77 salaried low_risk 9 1 0 79569 80000 80000 0 0 0 0 0 0 5154 0 1 0 1 2 2 0
    24 42874 63840 68.92 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    52 54273 71840 80.73 salaried low_risk 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 13 0 0
    39 48468 65500 77.86 salaried low_risk 1 1 0 58558 48220 48220 0 0 0 0 0 0 0 0 1 0 0 0 0 1
    26 50046 70516 72.75 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    30 47773 63306 80.56 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    46 46548 63306 78.82 salaried very_low_risk 1 1 0 0 12000 12000 0 0 0 0 0 0 0 0 0 0 1 1 0 0
    33 54131 69936 82.36 salaried very_low_risk 11 2 0 45639 75000 75000 0 0 0 0 0 0 22267 0 1 0 1 2 5 1
    36 50743 66115 78.65 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    33 48258 63896 80.60 salaried low_risk 1 1 0 51500 51500 51500 0 0 0 0 0 0 0 0 1 0 0 0 0 0
    49 47773 63306 80.56 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    48 44779 62852 74.78 salaried medium_risk 5 2 1 34922 60000 60000 0 0 0 0 0 0 11759 0 1 1 1 1 1 0
    26 45814 66115 71.09 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    41 48468 60410 84.42 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    49 44819 66487 69.19 salaried very_low_risk 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4 0 0
    22 50673 63840 83.02 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    41 37939 64500 62.02 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    20 52428 67405 81.60 not_reported low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    20 51653 63896 86.08 not_reported low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    36 52818 63896 88.42 salaried medium_risk 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
    24 51428 64840 84.82 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0
    20 49488 63306 83.72 not_reported low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    40 51663 68000 79.41 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    26 49458 62852 82.73 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    43 52508 66246 81.51 salaried high_risk 36 11 2 327845 489490 489490 0 0 0 0 0 0 40747 0 3 1 1 7 0 0
    36 55333 73805 78.59 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    41 38939 59313 67.44 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    51 49713 63896 82.95 self_employed medium_risk 8 2 0 2131369 3254115 3254115 0 0 0 0 0 0 2884 0 0 0 2 4 0 0
    28 49458 63000 82.54 salaried low_risk 1 1 0 282809 292906 292906 0 0 0 0 0 0 0 0 1 0 0 0 0 1
    21 40884 59313 70.81 not_reported low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    42 50295 67528 77.92 salaried very_low_risk 1 0 0 0 0 0 0 0 0 0 0 0 1328 0 0 0 1 1 0 0
    29 44575 59540 78.94 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    25 49973 63306 84.51 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    38 45769 66365 72.33 salaried very_low_risk 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 0
    41 51653 63306 86.88 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    30 46809 59313 80.93 salaried very_low_risk 1 1 0 3391 25000 25000 0 0 0 0 0 0 1350 0 0 0 2 2 0 1
    40 41670 59313 74.18 salaried low_risk 17 3 0 1002630 1028000 1028000 0 0 0 0 0 0 23951 0 1 0 1 2 0 1
    43 48693 62577 81.50 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    36 51428 63306 86.88 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    29 54273 69067 83.98 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    41 51428 63306 86.88 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    52 45769 59313 80.93 self_employed very_low_risk 1 0 0 0 0 0 0 0 0 0 0 0 2460 0 0 0 2 2 0 0
    30 42969 63840 72.06 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    25 48258 63896 80.60 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    28 47650 62577 79.74 salaried very_low_risk 2 1 0 1352 15000 15000 0 0 0 0 0 0 7460 0 0 0 1 2 1 0
    30 42690 63000 69.84 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    50 41854 59300 74.20 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    47 46759 62577 78.30 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    38 48433 63896 80.15 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    38 46555 59313 82.61 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    23 48699 59313 84.13 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    31 49250 65269 77.37 salaried low_risk 1 1 0 27537 28196 28196 0 0 0 0 0 0 0 0 0 0 1 1 0 1
    32 34959 60410 59.59 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    34 43090 64800 70.22 salaried low_risk 1 1 0 45500 45500 45500 0 0 0 0 0 0 0 0 1 0 0 0 0 0
    42 28084 63720 45.51 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    21 49683 62577 83.10 not_reported low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    47 50673 65269 81.20 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    36 38164 63896 64.17 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    57 50673 62577 84.70 salaried very_low_risk 3 2 0 2681 15749 15749 0 0 0 0 0 0 0 0 2 0 0 1 0 0
    42 50458 63896 84.51 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    36 49458 62577 83.10 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    23 42874 63800 68.97 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    31 49683 62577 83.10 salaried very_low_risk 2 1 0 21323 26000 26000 0 0 0 0 0 0 2856 0 1 0 1 1 1 1
    27 50743 63149 82.34 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    38 60213 84398 73.46 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    29 54303 69542 79.09 salaried high_risk 5 4 1 312718 677000 677000 0 0 0 0 0 0 12979 0 0 2 2 4 3 1
    32 49803 65368 77.25 salaried very_low_risk 2 0 0 0 0 0 0 0 0 0 0 0 4164 0 0 0 0 0 0 1
    46 51403 65687 79.32 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    30 60971 72657 85.00 salaried low_risk 2 1 0 10000 42500 42500 0 0 0 0 0 0 2154 0 0 0 2 3 0 1
    50 45349 65368 70.37 salaried very_low_risk 2 1 0 33612 40000 40000 0 0 0 0 0 0 3740 0 0 0 1 2 0 0
    27 36439 61865 59.81 self_employed very_low_risk 3 1 0 13785 56173 56173 0 0 0 0 0 0 4020 0 0 0 2 5 0 0
    33 51078 65368 79.55 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
    28 74951 102945 74.02 self_employed very_low_risk 2 1 0 21480 201381 201381 0 0 0 0 0 0 0 0 0 0 3 3 0 0
    34 58259 66068 89.30 self_employed very_low_risk 7 4 0 196020 259363 259363 0 0 0 0 0 0 26627 0 2 0 1 4 0 0
    40 52303 66310 79.93 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    45 46349 65687 71.55 salaried low_risk 1 1 0 5470 5470 5470 0 0 0 0 0 0 954 0 1 0 0 0 1 0
    46 51303 65060 79.93 salaried very_low_risk 1 0 0 0 0 0 0 0 0 0 0 0 1250 0 0 0 0 0 0 0
    33 51078 65310 79.62 salaried very_low_risk 5 4 0 46481 99090 99090 0 0 0 0 0 0 7210 0 0 0 1 1 1 1
    36 58259 65687 89.82 salaried very_low_risk 8 4 0 3259073 3610215 3587762 0 0 0 0 0 0 30113 0 2 0 1 7 3 0
    34 52003 68695 76.72 salaried low_risk 1 1 0 8455 14500 14500 0 0 0 0 0 0 1209 0 1 0 0 0 0 0
    40 49349 65368 76.49 self_employed very_low_risk 6 4 0 13438 48579 48579 0 0 0 0 0 0 2785 0 1 0 1 1 0 0
    22 55567 66252 84.99 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    44 40094 61865 65.79 self_employed low_risk 9 6 3 20196 351003 285648 0 0 0 0 0 0 0 0 2 1 3 11 0 0
    42 55259 65937 84.93 salaried very_low_risk 1 0 0 0 0 0 0 0 0 0 0 0 1475 0 0 0 1 1 0 0
    22 40394 65368 62.72 salaried very_low_risk 1 0 0 0 0 0 2 2 1 1171994 1690000 1690000 1090 9382 0 0 3 5 0 0
    37 52303 62365 84.98 self_employed high_risk 5 4 1 1147365 1163250 1163250 0 0 0 0 0 0 13050 0 2 0 1 2 1 0
    25 40094 65368 62.26 salaried low_risk 4 4 0 42063 66950 66950 0 0 0 0 0 0 4037 0 3 0 0 1 0 0
    25 50303 67099 76.01 self_employed low_risk 3 3 1 7960 38950 7960 0 0 0 0 0 0 1464 0 2 1 1 2 1 1
    42 53578 68870 79.13 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    25 52303 68695 77.15 self_employed low_risk 1 1 0 0 15000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    35 58259 65687 89.82 self_employed very_low_risk 2 0 0 0 0 0 0 0 0 0 0 0 2597 0 0 0 2 2 0 0
    48 54305 64760 85.00 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    23 48835 61865 79.99 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    33 56059 63349 89.66 salaried low_risk 2 1 0 0 4968 4968 0 0 0 0 0 0 0 0 0 0 3 3 0 0
    30 50303 64651 78.89 salaried high_risk 3 2 1 30206 99100 105564 1 1 0 0 40000 361 2315 0 1 1 1 3 0 0
    25 42394 77968 55.15 self_employed high_risk 5 1 1 22075 45000 34589 0 0 0 0 0 0 0 0 0 0 1 1 0 0
    39 57759 65368 89.49 self_employed medium_risk 5 5 1 476937 644797 644797 0 0 0 0 0 0 0 0 4 0 0 2 0 0
    53 54078 65197 84.36 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    35 41094 57782 72.17 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    61 35939 61865 59.00 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    31 54003 64389 84.95 self_employed low_risk 5 3 0 192520 555613 507644 0 0 0 0 0 0 4671 0 0 0 3 3 0 0
    27 27229 61865 44.77 salaried low_risk 4 3 0 4340 31809 31809 0 0 0 0 0 0 0 0 1 0 0 1 0 0
    44 51078 65203 79.75 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    40 49349 65690 76.12 self_employed very_low_risk 3 2 0 974963 1260000 1260000 0 0 0 0 0 0 19740 0 0 0 7 11 0 0
    40 51303 65249 79.69 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
    42 44749 61865 73.39 self_employed very_low_risk 10 2 0 1420891 1776733 1776733 0 0 0 0 0 0 31311 0 0 0 2 4 0 0
    33 47049 65215 73.14 salaried very_low_risk 2 2 0 5420 35271 35271 0 0 0 0 0 0 3847 0 1 0 1 1 4 0
    26 43394 66068 66.60 self_employed very_low_risk 10 5 0 145434 180522 180522 0 0 0 0 0 0 19485 0 4 0 0 1 1 0
    29 53803 68245 79.86 self_employed very_low_risk 1 1 0 1200 13900 13900 0 0 0 0 0 0 2317 0 0 0 1 1 0 1
    33 38439 65215 59.80 salaried very_low_risk 4 3 0 37855 90026 90026 0 0 0 0 0 0 5721 0 1 0 2 5 0 0
    53 48349 65368 74.96 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    41 54759 65368 84.90 self_employed very_low_risk 10 6 0 741288 1299454 1102429 0 0 0 0 0 0 0 0 0 0 4 13 0 0
    42 73723 99500 74.97 self_employed very_low_risk 2 1 0 820301 900000 900000 0 0 0 0 0 0 3000 0 0 0 1 2 0 0
    34 40394 69498 58.99 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    41 56013 65687 86.78 self_employed very_low_risk 13 2 0 881713 1115000 1115000 0 0 0 0 0 0 3722 0 0 0 1 4 0 0
    24 46349 62465 75.24 self_employed very_low_risk 1 1 0 24900 50000 50000 0 0 0 0 0 0 0 0 0 0 1 1 0 0
    51 56959 68695 83.99 self_employed very_low_risk 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
    45 58259 66068 89.30 self_employed very_low_risk 3 1 0 7493 14990 14990 0 0 0 0 0 0 0 0 1 0 1 1 0 0
    31 53878 68601 79.88 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    28 42394 61917 69.45 self_employed medium_risk 12 7 1 1074066 1353681 1341560 0 0 0 0 0 0 1565 0 4 1 1 2 0 1
    31 51303 65251 79.69 salaried very_low_risk 1 0 0 0 0 0 0 0 0 0 0 0 2000 0 0 0 2 2 0 0
    55 74122 103060 72.77 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    46 54576 65368 85.00 salaried low_risk 3 3 0 8710 27154 27154 0 0 0 0 0 0 1910 0 1 0 1 1 1 0
    59 52303 66552 79.64 self_employed very_low_risk 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0
    36 49049 64217 77.39 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    39 46349 65687 71.55 self_employed very_low_risk 4 0 0 0 0 0 0 0 0 0 0 0 4592 0 0 0 0 1 0 0
    38 51003 65687 78.71 self_employed very_low_risk 2 2 0 13163 31251 31251 0 0 0 0 0 0 0 0 2 0 0 0 0 0
    51 53078 65687 82.21 salaried low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    25 51303 67714 76.79 salaried very_low_risk 2 0 0 0 0 0 0 0 0 0 0 0 5556 0 0 0 1 1 0 0
    26 35939 61865 59.00 self_employed low_risk 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    5.1.2 Exploratory Data Analysis

    Check Shape of data
    dim(vehicle_loan_default_train_cleaned)
    ## [1] 233154     26
    There are 26 columns (25 predictor variables and 1 response variable) and 233,154 observations.
    Take a Glimpse at the data
    glimpse(vehicle_loan_default_train_cleaned)
    ## Rows: 233,154
    ## Columns: 26
    ## $ age_years                           <dbl> 35, 33, 33, 25, 41, 28, 30, 29, 27…
    ## $ disbursed_amount                    <dbl> 50578, 47145, 53278, 57513, 52378,…
    ## $ asset_cost                          <dbl> 58400, 65550, 61360, 66113, 60300,…
    ## $ ltv                                 <dbl> 89.55, 73.23, 89.63, 88.48, 88.39,…
    ## $ employment_type                     <chr> "salaried", "self_employed", "self…
    ## $ perform_cns_score_description       <chr> "low_risk", "medium_risk", "low_ri…
    ## $ pri_no_of_accts                     <dbl> 0, 1, 0, 3, 0, 2, 0, 1, 1, 1, 1, 3…
    ## $ pri_active_accts                    <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2…
    ## $ pri_overdue_accts                   <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ pri_current_balance                 <dbl> 0, 27600, 0, 0, 0, 0, 0, 72879, -4…
    ## $ pri_sanctioned_amount               <dbl> 0, 50200, 0, 0, 0, 0, 0, 74500, 36…
    ## $ pri_disbursed_amount                <dbl> 0, 50200, 0, 0, 0, 0, 0, 74500, 36…
    ## $ sec_no_of_accts                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ sec_active_accts                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ sec_overdue_accts                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ sec_current_balance                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ sec_sanctioned_amount               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ sec_disbursed_amount                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ primary_instal_amt                  <dbl> 0, 1991, 0, 31, 0, 1347, 0, 0, 0, …
    ## $ sec_instal_amt                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ new_accts_in_last_six_months        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ delinquent_accts_in_last_six_months <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
    ## $ average_acct_age                    <dbl> 0, 2, 0, 1, 0, 2, 0, 0, 5, 2, 1, 2…
    ## $ credit_history_length               <dbl> 0, 2, 0, 1, 0, 2, 0, 0, 5, 2, 1, 2…
    ## $ no_of_inquiries                     <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1…
    ## $ loan_default                        <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0…
    Check for Null Values
    # use missmap function from the Amelia package to check for NA values
    missmap(vehicle_loan_default_train_cleaned,
            plot.background = element_rect(fill = "antiquewhite"),
            main = "Vehicle Loan Default - Missing Values", 
            x.cex = 0.45,
            y.cex = 0.6,
            margins = c(7.1, 7.1),
            col = c("yellow", "black"), legend = FALSE)


    From the plot of missing values, we can see that there is no missing value in the data. No NA imputation or handling of nulls is necessary.

    Check for Imbalance of data for the loan_default variable
    loan_default_grouped <- vehicle_loan_default_train_cleaned %>% group_by(loan_default) %>% 
                            summarise(count = n()) %>%
                            mutate(percentage = round((count / sum(count) * 100), 2))
    
    loan_default_grouped_displayed <- kable(head(loan_default_grouped, 200), "html") %>% 
                                      kable_paper("hover", full_width = F)
    loan_default_grouped_displayed
    loan_default count percentage
    0 182543 78.29
    1 50611 21.71
    p_bar_loan_default_category <- loan_default_grouped %>% ggplot(aes(x=factor(loan_default), 
                                                                       y = percentage, fill = factor(loan_default))) +
                                   geom_bar(stat = "identity", position = "dodge") +
                                   labs(title = "Loan Default Distribution", x = "Loan Default", y = "Percentage", 
                                        fill = "Loan Default") +
                                   scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentages
                                   theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                            panel.background = element_rect(fill = "gray80"),
                                                            plot.background = element_rect(fill = "antiquewhite"))                          
    p_bar_loan_default_category


    From the values and plot above for the Loan Default categories, we can see that there are more values for category 0 (No default) than 1 (Default). Before, modeling the data, the data will be enhanced by oversampling the “default” category to achieve some form of balance in the data.

    Age Distribution
    # Age distribution - All categories
    age_dist_all <- vehicle_loan_default_train_cleaned %>% ggplot(aes(x = age_years)) +
                        geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
                        labs(title = "Age Distribution - All Categories", x = "Age", y = "Frequency") + theme_minimal() +
                        theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
    
    # Age distribution - Loan Default
    age_dist_loan_default <- vehicle_loan_default_train_cleaned %>% filter(loan_default == 1) %>% ggplot(aes(x = age_years)) +
                        geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
                        labs(title = "Age Distribution - Loan Default", x = "Age", y = "Frequency") + theme_minimal() +
                        theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
    
    # Age distribution - No Default
    age_dist_no_default <- vehicle_loan_default_train_cleaned %>% filter(loan_default == 0) %>% ggplot(aes(x = age_years)) +
                        geom_histogram(binwidth = 2, fill = "blue", color = "white", alpha = 0.7) +
                        labs(title = "Age Distribution - Loan Default", x = "Age", y = "Frequency") + theme_minimal() +
                        theme(plot.title = element_text(hjust = 0.5),
                                                        panel.background = element_rect(fill = "gray80"),
                                                        plot.background = element_rect(fill = "antiquewhite"))
    
    # Plot all three plots
    plot_age_dist <- plot_grid(age_dist_loan_default, age_dist_no_default, age_dist_all, byrow = TRUE, nrow = 3) 
    plot_age_dist


    We can see that most of the age distribution is about the same for both loan_default and no-loan_default. Also, we can see that most of the applicants fall between 20 years old and 40 years old with a few applicants between 40 and 60 and almost none after 70 years old.

    CNS Score Description

    The perform_cns_score_description basically categorizes the risk level of the applicant based on their credit score.

    # All Categories
    cns_score_desc_all <- vehicle_loan_default_train_cleaned %>% group_by(perform_cns_score_description) %>% 
                                    summarise(count = n()) %>%
                                    mutate(percentage = round((count / sum(count) * 100), 2)) %>%
                                    ggplot(aes(x = perform_cns_score_description, y = percentage,
                                                                        fill = factor(perform_cns_score_description))) +
                                    geom_bar(stat = "identity", position = "dodge") +
                                    labs(title = "Risk Level Distribution - All Categories", x = "Risk Level", y = "Percentage", 
                                         fill = "Risk Level") +
                                    scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentages
                                    theme_minimal()  + theme(plot.title = element_text(hjust = 0.5),
                                                             panel.background = element_rect(fill = "gray80"),
                                                             plot.background = element_rect(fill = "antiquewhite"))
    
    cns_score_desc_all


    Significantly high proportion of the applicants fall under low_risk category and over 85% fall under either very_low_risk or low_risk. This may be due to the believe that applicants may prefer to work on their credit to improve their risk level before apply for vehicle loans especially as low_risk applicants tend to get better interest rates on their loans.

    Loan Default Distribution By Risk Level
    loan_default_risk_level <- vehicle_loan_default_train_cleaned %>% 
                                   count(loan_default, perform_cns_score_description, name = "Record_Count") %>%
                                   ggplot(aes(x=loan_default, y = Record_Count, fill = perform_cns_score_description)) +
                                   geom_bar(stat = "identity", position = "stack") +
                                   labs(title = "Loan Default Distribution by Risk Level", x = "Loan Default", y = "Percentage", 
                                        fill = "Risk Level") + theme_minimal()  + 
                                   theme(plot.title = element_text(hjust = 0.5),
                                                            panel.background = element_rect(fill = "gray80"),
                                                            plot.background = element_rect(fill = "antiquewhite")) 
    loan_default_risk_level


    It is clear that a huge portion of the non-loan_default are categorized as low_risk or very_low_risk. Also, for the defaulters, the low_risk still constitute a huge portion.

    Employment Type
    employment_type <- vehicle_loan_default_train_cleaned %>% 
                                   count(loan_default, employment_type, name = "Record_Count") %>%
                                   ggplot(aes(x=loan_default, y = Record_Count, fill = employment_type)) +
                                   geom_bar(stat = "identity", position = "stack") +
                                   labs(title = "Loan Default Distribution by Employment Type", x = "Loan Default", y = "Percentage", 
                                        fill = "Employment Type") + theme_minimal()  + 
                                   theme(plot.title = element_text(hjust = 0.5),
                                                            panel.background = element_rect(fill = "gray80"),
                                                            plot.background = element_rect(fill = "antiquewhite")) 
    employment_type


    Salaried or Self-Employed constitute most of the data in almost equal proportions for both categories of default although the self-employed are a little more.

    Correlation Plot
    vehicle_loan_default_train_cleaned_numeric <- vehicle_loan_default_train_cleaned %>% 
                                                  select(-employment_type, -perform_cns_score_description)
    corr_matrix <- cor(vehicle_loan_default_train_cleaned_numeric)
    correlation_plot <- ggcorrplot(corr_matrix, 
                                   lab = TRUE, # Show axis labels
                                   lab_size = 2, # Adjust the size of axis labels
                                   hc.order = TRUE, # Reorder the correlation matrix
                                   type = "lower", 
                                   outline.col = "white", 
                                   colors = c("blue", "white", "red"), 
                                   ggtheme = ggplot2::theme_minimal(),
                                   title = "Correlation Plot") + coord_fixed(ratio = 0.9) + 
                                   theme(axis.text.x = element_text(size = 9, angle = 45, hjust = 1),
                                         axis.text.y = element_text(size = 9, hjust = 1),
                                         plot.title = element_text(hjust = 0.5),
                                         panel.background = element_rect(fill = "gray80"),
                                         plot.background = element_rect(fill = "antiquewhite"),
                                         axis.title = element_text(size = 10)) +labs(x = NULL, y = NULL)
    correlation_plot


    5.2.2 Data Pre-processing, Model Training, and Testing

    Preprocess the data

    Since the data is imbalanced, we preprocess the data to get a balanced dataset and also standardize the numeric variables in the data using the standard normal distribution.

    data_preprocess_scaling <- function(df){
                          # This helper function standardizes the numeric variables of the df using the standard normal method
                          df <- as.data.frame(df)
                          df_char <- df %>% select(loan_default, employment_type, perform_cns_score_description) 
                          df_numeric <- df %>% select(-loan_default, -employment_type, -perform_cns_score_description)
                          df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
                          df_scaled_combined <- cbind(df_char, df_numeric_scaled)
                          return(df_scaled_combined)
    }
    
    
    xgb_nnet_preprocess <- function(df, mode){
     # convert the categorical variable to dummies
                  df2 <- dummy_cols(df, select_columns = c("employment_type","perform_cns_score_description"), 
                                    remove_selected_columns = TRUE) %>% as.data.frame()
                  # prepare the xgb.DMatrix to use in the xgboost training
                  df2_train <- as.data.frame(df2[, -1])
                  df2_label <- as.data.frame(df2[, 1])
                  df_dmatrix <- xgb.DMatrix(as.matrix(sapply(df2_train, as.numeric)), label=as.matrix(df2_label))
                  
                  if(mode == "xgboost"){
                    return(df_dmatrix)
                    
                  } else if(mode == "nnet"){
                    return(df2)
                    
                  } else{
                    print("Mode not supported.")
                  }
                  
    }
    
    
    data_preprocessing <- function(df, mode = "train"){
        # This function pre-processes the cleaned data and get the data ready for training.
        tryCatch({
          if(is_tibble(df) | is.data.frame(df)){
            if(mode == "test"){
               df_scaled <- data_preprocess_scaling(df)
               print("Data Pre-processing complete")
               return(df_scaled)
               
            } else if(mode == "train"){
              curr_frame <<- sys.nframe() # sends the current frame to the global environment.
              # The ovun.sample function in the ROSE package assumes the data to be in the global env so you have to tell it  
              # which frame (scope) to find the data else this will fail if executed inside a function.
              df_ovun <- ovun.sample(formula = formula(loan_default ~ .), data = get("df", sys.frame(curr_frame)),
                                     N = 1.5 * nrow(data), seed = 1994, method = "both")$data %>% as.data.frame() %>% as_tibble()
              print("Oversampling and undersampling completed")
              df_scaled <- data_preprocess_scaling(df_ovun)
              print("Data Pre-processing complete!")
              return(df_scaled)
              
            } else {
              print("You did not enter a valid mode type: Enter train or test for mode")
              
            }
          }
          
        else{
        print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
      }
        
      },
      #if an error occurs, tell me the error
      error=function(e) {
            message('An Error Occurred')
            print(e)
            },
      #or if a warning occurs, tell me the warning
      warning=function(w) {
            message('A Warning Occurred')
            print(w)
            return(NA)
            }
        )
      
    }
    Train Test Split

    Use the CaTools library to split the cleaned dataset into training and testing datasets in 70:30 ratio.

    # Set a seed
    set.seed(1994)
    #Split the sample
    sampling <- sample.split(vehicle_loan_default_train_cleaned$loan_default, SplitRatio = 0.7) 
    # Training Data
    df_train_subset <- subset(vehicle_loan_default_train_cleaned, sampling == TRUE)
    # Testing Data
    df_test_subset <- subset(vehicle_loan_default_train_cleaned, sampling == FALSE)

    Pre-process the train dataset

    df_train = data_preprocessing(df_train_subset, mode = "train")
    ## [1] "Oversampling and undersampling completed"
    ## [1] "Data Pre-processing complete!"

    Pre-process the test dataset

    df_test = data_preprocessing(df_test_subset, mode = "test")
    ## [1] "Data Pre-processing complete"
    Develop function to train the data

    The function model_training trains a machine learning model according to the mode selected (logistic, rf, xgboost, or nnet).

    model_training <- function(df, mode = "logistic"){
      
              if(mode == "logistic"){
                  print("Training a Logistic Regression Model...")
                  logistic_model <- glm(formula = loan_default ~ . , 
                                           family = binomial(link = 'logit'), data = df)
                  print("Logistic Regression Model complete")
                  return(logistic_model)
               
            } else if(mode == "rf"){
                  print("Training a Random Forest Classification Model...")
                  rf_model_ranger <- ranger(
                                     formula   = loan_default ~ ., 
                                     data      = df, 
                                     num.trees = 500,
                                     mtry      = floor(length(df) / 3),
                                     probability = TRUE,
                                     verbose = FALSE,
                                     classification = TRUE
                                     )
                  print("Random Forest Classification Model complete")
                  return(rf_model_ranger)         
              
            } else if(mode == "xgboost"){
                  print("Training an XGBoost Classification Model...")
                  # pre-process the data to obtain the dmatrix
                  df_dmatrix <- xgb_nnet_preprocess(df, mode)
                  xgb_model <- xgboost(data = df_dmatrix, nthread = 4, nrounds = 150,
                                       max.depth = 10, eta = 0.1, objective = "binary:logistic", verbose = FALSE)
                  print("XGBoost Training complete")
                  return(xgb_model)
              
            } else if(mode == "nnet"){
                  print("Training a Neural Network Classification Model...")
                  # pre-process the model to convert character variables to dummies
                  df_nnet <- xgb_nnet_preprocess(df, mode)
                  nnet_model <- nnet(loan_default ~ ., data = df_nnet, decay = 5e-4, 
                                     size = 20, maxit = 100, trace = F, set.seed(1994))
                  print("NNET Training complete")
                  return(nnet_model)
              
            } else {
              print("You did not enter a valid mode type: Enter logistic, rf, xgboost or svm")
              
            }
          
    }
    Develop function to predict data

    The function model_predictions predicts the loan_default probability of new data for each of the model types.

    model_prediction = function(df, trained_model, model_type){
      
      # remove the response variable from the dataframe if it exists
      if("loan_default" %in% colnames(df)){
        test_data = df %>% select(-loan_default)
      } else {
        test_data = df
      }
      
      # make predictions
      if(model_type == "rf"){
        predictions = predict(trained_model, data = test_data)$predictions[,1]
      } else if (model_type == "logistic"){
        predictions = predict(trained_model, newdata = test_data, type = "response")
      }  else if (model_type == "xgboost"){
        test_data_xgb = xgb_nnet_preprocess(df, model_type)
        predictions = predict(trained_model, newdata = test_data_xgb)
      } else{
        test_data_nnet = xgb_nnet_preprocess(test_data, model_type)
        predictions = predict(trained_model, newdata = test_data_nnet)
      }
      
      # convert probabilities to classes
      predicted = ifelse(predictions > 0.5, 1, 0)
      
      return(predicted)
    }
    Develop function to evaluate metrics of the model

    The function model_evaluations provides metrics such as accuracy, recall, precision, f1-score, and AUC for the model.

    model_metrics = function(actual, predicted){
      
      # accuracy
      accuracy_model = round((Metrics::accuracy(actual, predicted)), 4)
      # precision
      precision_model = round((Metrics::precision(actual, predicted)), 4)
      # recall 
      recall_model = round((Metrics::recall(actual, predicted)), 4)
      # auc
      auc_model = round((pROC::auc(actual, predicted)), 4)
      
      # model metrics
      model_eval_metrics = c(accuracy_model, precision_model, recall_model, auc_model) %>% t()
      column_names = c("Accuracy", "Precision", "Recall", "AUC")
      evaluation_metrics = data.frame(values = model_eval_metrics)
      colnames(evaluation_metrics) = column_names
      
      # confusion Matrix
      confusion_table = table(predicted, actual)
      confusion_matrix = caret::confusionMatrix(confusion_table)
      print("**********************************************************************")
      print(confusion_matrix)
      print("**********************************************************************")
      
      return(evaluation_metrics)
    }
    5.2.2.1 Model Training and Testing - Logistic Regression

    Train the logistic model

    logistic_model <- model_training(df_train, mode = "logistic")
    ## [1] "Training a Logistic Regression Model..."
    ## [1] "Logistic Regression Model complete"

    Evaluate the logistic Model

    logistic_actual = df_test$loan_default
    logistic_predicted = model_prediction(df_test, logistic_model, "logistic")
    logistic_model_metrics = model_metrics(logistic_actual, logistic_predicted)
    ## [1] "**********************************************************************"
    ## Confusion Matrix and Statistics
    ## 
    ##          actual
    ## predicted     0     1
    ##         0 27660  4961
    ##         1 27103 10222
    ##                                           
    ##                Accuracy : 0.5416          
    ##                  95% CI : (0.5379, 0.5453)
    ##     No Information Rate : 0.7829          
    ##     P-Value [Acc > NIR] : 1               
    ##                                           
    ##                   Kappa : 0.1168          
    ##                                           
    ##  Mcnemar's Test P-Value : <2e-16          
    ##                                           
    ##             Sensitivity : 0.5051          
    ##             Specificity : 0.6733          
    ##          Pos Pred Value : 0.8479          
    ##          Neg Pred Value : 0.2739          
    ##              Prevalence : 0.7829          
    ##          Detection Rate : 0.3954          
    ##    Detection Prevalence : 0.4664          
    ##       Balanced Accuracy : 0.5892          
    ##                                           
    ##        'Positive' Class : 0               
    ##                                           
    ## [1] "**********************************************************************"

    Display Model Metrics - Logistic Regression Model

    rownames(logistic_model_metrics) = "Logistic Model"
    logistic_model_metrics 
    ##                Accuracy Precision Recall    AUC
    ## Logistic Model   0.5416    0.2739 0.6733 0.5892
    5.2.2.2 Model Training and Testing - Random Forest

    Train a random forest classification model for the data

    rf_model <- model_training(df_train, mode = "rf")
    ## [1] "Training a Random Forest Classification Model..."
    ## [1] "Random Forest Classification Model complete"

    Evaluate the Random Forest Model

    rf_actual = df_test$loan_default
    rf_predicted = model_prediction(df_test, rf_model, "rf")
    rf_model_metrics = model_metrics(rf_actual, rf_predicted)
    ## [1] "**********************************************************************"
    ## Confusion Matrix and Statistics
    ## 
    ##          actual
    ## predicted     0     1
    ##         0  8012  3208
    ##         1 46751 11975
    ##                                           
    ##                Accuracy : 0.2857          
    ##                  95% CI : (0.2824, 0.2891)
    ##     No Information Rate : 0.7829          
    ##     P-Value [Acc > NIR] : 1               
    ##                                           
    ##                   Kappa : -0.0319         
    ##                                           
    ##  Mcnemar's Test P-Value : <2e-16          
    ##                                           
    ##             Sensitivity : 0.1463          
    ##             Specificity : 0.7887          
    ##          Pos Pred Value : 0.7141          
    ##          Neg Pred Value : 0.2039          
    ##              Prevalence : 0.7829          
    ##          Detection Rate : 0.1145          
    ##    Detection Prevalence : 0.1604          
    ##       Balanced Accuracy : 0.4675          
    ##                                           
    ##        'Positive' Class : 0               
    ##                                           
    ## [1] "**********************************************************************"

    Display Model Metrics - Random Forest Classification Model

    rownames(rf_model_metrics) = "Random Forest Model"
    rf_model_metrics 
    ##                     Accuracy Precision Recall    AUC
    ## Random Forest Model   0.2857    0.2039 0.7887 0.4675
    5.2.2.3 Model Training and Testing - XGBoost

    Train an XGBoost classification model for the data

    xgb_model <- model_training(df_train, mode = "xgboost")
    ## [1] "Training an XGBoost Classification Model..."
    ## [1] "XGBoost Training complete"

    Evaluate the XGBoost Model

    xgb_actual = df_test$loan_default
    xgb_predicted = model_prediction(df_test, xgb_model, "xgboost")
    xgb_model_metrics = model_metrics(xgb_actual, xgb_predicted)
    ## [1] "**********************************************************************"
    ## Confusion Matrix and Statistics
    ## 
    ##          actual
    ## predicted     0     1
    ##         0 31609  6375
    ##         1 23154  8808
    ##                                           
    ##                Accuracy : 0.5778          
    ##                  95% CI : (0.5742, 0.5815)
    ##     No Information Rate : 0.7829          
    ##     P-Value [Acc > NIR] : 1               
    ##                                           
    ##                   Kappa : 0.1124          
    ##                                           
    ##  Mcnemar's Test P-Value : <2e-16          
    ##                                           
    ##             Sensitivity : 0.5772          
    ##             Specificity : 0.5801          
    ##          Pos Pred Value : 0.8322          
    ##          Neg Pred Value : 0.2756          
    ##              Prevalence : 0.7829          
    ##          Detection Rate : 0.4519          
    ##    Detection Prevalence : 0.5430          
    ##       Balanced Accuracy : 0.5787          
    ##                                           
    ##        'Positive' Class : 0               
    ##                                           
    ## [1] "**********************************************************************"

    Display Model Metrics - XGBoost classification Model

    rownames(xgb_model_metrics) = "XGBoost"
    xgb_model_metrics
    ##         Accuracy Precision Recall    AUC
    ## XGBoost   0.5778    0.2756 0.5801 0.5787
    5.2.2.4 Model Training and Testing - Artificial Neural Networks

    Train a Neural Network model for the data

    nnet_model <- model_training(df_train, mode = "nnet")
    ## [1] "Training a Neural Network Classification Model..."
    ## [1] "NNET Training complete"

    Evaluate the NNET Model

    nnet_actual = df_test$loan_default
    nnet_predicted = model_prediction(df_test, nnet_model, "nnet")
    nnet_model_metrics = model_metrics(nnet_actual, nnet_predicted)
    ## [1] "**********************************************************************"
    ## Confusion Matrix and Statistics
    ## 
    ##          actual
    ## predicted     0     1
    ##         0 22452  3582
    ##         1 32311 11601
    ##                                           
    ##                Accuracy : 0.4868          
    ##                  95% CI : (0.4831, 0.4906)
    ##     No Information Rate : 0.7829          
    ##     P-Value [Acc > NIR] : 1               
    ##                                           
    ##                   Kappa : 0.1034          
    ##                                           
    ##  Mcnemar's Test P-Value : <2e-16          
    ##                                           
    ##             Sensitivity : 0.4100          
    ##             Specificity : 0.7641          
    ##          Pos Pred Value : 0.8624          
    ##          Neg Pred Value : 0.2642          
    ##              Prevalence : 0.7829          
    ##          Detection Rate : 0.3210          
    ##    Detection Prevalence : 0.3722          
    ##       Balanced Accuracy : 0.5870          
    ##                                           
    ##        'Positive' Class : 0               
    ##                                           
    ## [1] "**********************************************************************"

    Display Model Metrics - Neural Network Model

    rownames(nnet_model_metrics) = "Neural Network"
    nnet_model_metrics
    ##                Accuracy Precision Recall   AUC
    ## Neural Network   0.4868    0.2642 0.7641 0.587

    5.3 Results

    The table below compares the results of the four (4) models trained:

    results = rbind(logistic_model_metrics, rf_model_metrics, xgb_model_metrics, nnet_model_metrics)
    kable(results, "html") %>%
                            kable_paper("hover", full_width = F) %>%
                            scroll_box(width = "500px", height = "200px")
    Accuracy Precision Recall AUC
    Logistic Model 0.5416 0.2739 0.6733 0.5892
    Random Forest Model 0.2857 0.2039 0.7887 0.4675
    XGBoost 0.5778 0.2756 0.5801 0.5787
    Neural Network 0.4868 0.2642 0.7641 0.5870

    As we can see from the results of the different models. The models perform differently across the different metrics (Accuracy, Precision, Recall, AUC). It is important to know that the metric to be used to select the chosen model is dependent on the goal of the analysis/model.
    In this case of vehicle loan default prediction, the main goal is to identify applicants that will default on their loans. Also, kindly note that the positive class in this analysis is the zero (0 - No default) while the negative class is the 1 (Loan Default).
    Since the main concern is to correctly identifying defaults to minimize financial risk, the Recall is considered to be more critical. High recall ensures capturing most defaults, even at the cost of some false alarms.
    Recall measures the proportion of actual defaults that are correctly predicted by the model. It focuses on the model’s ability to capture all positive instances of defaults. High recall means a low false negative rate, which is crucial when identifying all actual defaults is important, even if it means higher false positives.
    On the other hand, the precision measures the proportion of correctly predicted defaults out of all instances predicted as defaults. It specifically focuses on the accuracy of positive predictions, but here the focus is mainly on negative predictions which are default to minimize financial losses on vehicle loans. High precision means a low false positive rate, which is valuable when correctly identifying defaults is crucial and minimizing false alarms is a priority.
    Also, Accuracy measures the overall correctness of predictions by considering the ratio of correctly predicted instances (both defaults and non-defaults) to the total instances. While accuracy is an intuitive metric, its reliability can be affected by class imbalance which is usually the case for most loan-default problems as there are usually fewer defaults than non-defaults.
    Lastly, AUC quantifies the model’s ability to distinguish between classes. It represents the area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. It is a valuable metric for assessing the overall performance of a classification model, including its ability to discriminate between default and non-default instances in a loan default prediction problem. It provides a comprehensive evaluation of the model’s discriminatory power, especially in scenarios with class imbalance.

    Looking at the results of the model, we find that that all of the models are competitive in terms of recall and they all performed badly on the precision and about average for the accuracy for the most part. Also, the AUC values are above average for the most part and I strongly believe that the overall metrics of these models can be improved by having more quality data, feature selection, and hyper-parameter tuning.

    5.4 Limitations of the model and data

    One major limitation of the model is the data source. The data was obtained from Kaggle, which is an open database source, which may question the reliability of the data for real world usage. There are some features like credit score classification which may not be available for young applicants or immigrants to the country who have not built an history to be classified as low risk. More quality data is required to get a better performing model.
    Also, most of the models have tune-able parameters to obtain better performing models, but the models trained in this analysis did not include hyper-parameter tuning and at such, the best models might not have been obtained for each of the model categories. This tuning process was avoided here because some of the models like Neural Network, and the Tree-Based models can become quickly complex and may require more computational capacity to perform several cross-validation/grid search and tuning process. For example, for the Random Forest model, a mere 500 trees was randomly selected without knowing if that number of trees is enough to grow the Forest as well as other parameters. Also, the number of rounds for the Xtreme Gradient Boosting was randomly selected without finding the best values for the combination of tunable parameters to get the best model. Furthermore, for the Neural Network model, we do not know what depth of network or number of nodes to train the model on to get the best performing model.
    In addition, proper feature selection was not done to see if certain features can be dropped or to know what features of the data offer the most predictive value.
    Lastly, we did not compare training error to test error to know if the trained model suffers from either overfitting or underfitting.

    5.5 Next Steps

    One of the most important next steps is to try to obtain better quality and reputable data possibly from a reputable financial institution. This might be very difficult to obtain considering the fact that financial data are highly regulated and also companies consider these data as part of their intellectual property and may not be willing to give these out. If these are done as part of an internal modeling process, better quality data may be obtained to be used in training a loan default model.

    In addition, it is very important to conduct hyper-parameter tuning for the models to determine what parameters would be best suited to train the model on. It is also important to know that as more data is included, the computational resources needed to conduct extensive hyper-parameter tuning on large data for complex models like Neural Network, and Tree-Based model would increase significantly especially for large datasets.

    Also, we can also try to do feature selection, check for multi-correlation using the Variance Inflation Factor (VIF) method/approach to obtain only relevant features that provide predictive value. Several other feature engineering techniques like principal component analysis (PCA) can also be explored.

    Furthermore, next steps should involve comparing training error vs test error to understand if the model is overfitting or underfitting and appropriate steps should be taken to minimize either underfitting or overfitting whichever is the case for the model.


    1. CONCLUSION

    The purpose of this project was to develop a machine learning model for loan default prediction used in the automotive credit industry to minimize financial losses to financial institutions due to loan default. That is to say that banks are very much interested in identifying customers that are more likely to default on their auto loans and avoid extending credit to those customers. From the various literature reviews, we see that the industry standard for credit risk modeling/loan default prediction is currently the Logistic Regression. Throughout this study, machine learning algorithms such as Logistic regression, Random Forest, Extreme Gradient Boosting Machines and Neural Network have been examined and the findings is that the non-conventional tree-based models and the Neural Network models perform slightly better than the industry standard. The Industry standard of logistic regression is good enough and perform well compared to the other models but not necessarily better that others. However, since the Random Forest, XGBoost, and Neural Networks are very difficult to interpret, it may be difficult for banks to understand why a customer is rejected for their loan application. While on the other hand, the Logistic Regression model is relatively more easier to interpret than its counterparts, it might be difficult to sway loan officers/decision makers away from the logistic regression without extracting way better performance from the other models since the Logistic Regression performance is not too far from the others.
    Hence, I will still recommend the industry standard Logistic Regression model because of its simplicity, easier to understand and interpret, take less time to train and predict the results, and also provides competitive performance when compared to the other more complex models. Even though I recommend the Logistic Regression, I do not shut the doors on other models because as we continue to obtain more information about the behavior of customers in this increasingly digital world, the dimension of each observation might continue to increase and the Logistic Regression might suffer from getting predictive ability on highly dimensional data. Hence, the other models like Neural Network and Tree-based models might come in handy.


    1. REFERENCES

    Agarwal, S., Ambrose, B. W., & Chomsisengphet, S. (2008). Determinants of automobile loan default and prepayment. Economic Perspectives - Federal Reserve Bank of Chicago.

    Altman, E. I., & Saunders, A. (1998). Credit risk measurement: Developments over the last 20 years. Journal of Banking and Finance, 21(11-12), 1728–1742.

    Agrawal, A., Agrawal, M., & Raizada, D. A. (2014). Predicting defaults in commercial vehicle loans using logistic regression: Case of an indian nbfc. International Journal of Research in Commerce and Management, 5, 22–28.

    Brownlee, J. (2019, August 12). Overfitting and Underfitting With Machine Learning Algorithms. Machine Learning Mastery. Retrieved November 4, 2023, from https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/#:%7E:text=Overfitting%3A%20Good%20performance%20on%20the,poor%20generalization%20to%20other%20data

    Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).

    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.

    Crook, J. N., Edelman, D. B., & Thomas, L. C. (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183, 447–1465.

    Cromer, O. C., Purdy, K. W., Cromer, G. C., & Foster, C. G. (2023, October 13). Automobile | Definition, History, industry, design, & Facts. Encyclopedia Britannica. https://www.britannica.com/technology/automobile

    Diez, D., Barr, C. D., & Cetinkaya-Rundel, M. (2019). OpenIntro statistics 4th Edition.

    Education, I. C. (2021, March 25). Underfitting. IBM Cloud Learn. Retrieved November 5, 2023, from https://www.ibm.com/cloud/learn/underfitting

    Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

    GeeksforGeeks. (2021, October 20). ML | Underfitting and Overfitting. Retrieved November 4, 2023, from https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/

    Hand, D.J. and Henley, W.E. (1997) Statistical Classification Methods in Consumer Credit Scoring: A Review. Journal of Royal Statistical Society, 160, 523-541.

    Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

    Hao, C., Alam, M. M., & Carling, K. (2010). Review of the literature on credit risk modeling: Development of the past 10 years. Banks and Bank Systems, 5 (3).

    Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247 (1), 124–136.

    Martin, A. (2023). What is an auto loan? Bankrate. https://www.bankrate.com/loans/auto-loans/what-is-an-auto-loan/

    Model Fit: Underfitting vs. Overfitting - Amazon Machine Learning. (n.d.). Amazon Machine Learning Developer Guide. Retrieved November 4, 2023, from https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

    O’Brien, S. (2023, February 4). Auto loan delinquencies are rising. Here’s what to do if you’re struggling with payments. CNBC. https://www.cnbc.com/2023/02/04/auto-loan-delinquencies-rise-what-to-do-if-you-struggle-with-payments.html

    Probasco, J. (2023). Expert explanation of how auto loans work. Investopedia. https://www.investopedia.com/how-car-loans-work-5202265

    U.S. vehicle fleet 1990-2021 | Statista. (2023, August 24). Statista. https://www.statista.com/statistics/183505/number-of-vehicles-in-the-united-states-since-1990/

    U.S.: average selling price of new vehicles 2022 | Statista. (2023, June 7). Statista. https://www.statista.com/statistics/274927/new-vehicle-average-selling-price-in-the-united-states/

    Witkowski, R. (2023, June 2). For People Under 30, Car Loan Delinquencies Hit A 15-Year High. Is The Economy Running Out Of Gas? Forbes Advisor. https://www.forbes.com/advisor/auto-loans/car-loan-late-payments/


    1. APPENDIX

    This section contains all codes used in the analysis in the correct order. However, no output is generated.
    # Load libraries
    library(Amelia) # To visualize missing data
    library(caret)
    library(caTools)
    library(corrplot) # To plot correlation plot
    library(cowplot) # To combine plots in a grid
    library(fastDummies) # to convert character variables to dummies
    library(ggcorrplot) # To plot correlation plot
    library(kableExtra)
    library(Metrics) # for model evaluation
    library(nnet)
    library(pROC)
    library(ranger) # for random forest implementation
    library(ROSE) # to balance the data
    library(tidyverse)
    library(xgboost)
    
    # Read the data
    url_vehicle_loan_default_train = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_train_data.csv"
    url_vehicle_loan_default_test = "https://raw.githubusercontent.com/chinedu2301/data698-analytics-project/main/data/vehicle_default_test_data.csv"
    vehicle_loan_default_train_raw = read_csv(url_vehicle_loan_default_train) %>% as_tibble()
    vehicle_loan_default_test_raw = read_csv(url_vehicle_loan_default_test) %>% as_tibble()
    
    # data_cleaning
    data_cleaning <- function(df){
      # This function accepts a dataframe (df) as input and returns another dataframe (cleaned_df) that is clean.
      tryCatch({
      if(is_tibble(df) | is.data.frame(df)){
        print("The dataframe is a tibble, will proceed to clean data")
        print("Data Cleaning in progress...")
        # rename all the columns in the dataframe to lowercase
        cleaned_df <- df %>% rename_all(tolower) %>%  
                    # compute the age of the applicant in number of years
                    mutate(date_of_birth = dmy(date_of_birth),  
                          disbursal_date = dmy(disbursal_date), 
                          age = difftime(disbursal_date, date_of_birth, units = "days"),
                          age_years = round(as.numeric(age / 365.25), 0)) %>%
                   # extract the years and month component of the average_acct_age  and convert to years
                    mutate(average_acct_age_year_comp = as.numeric(str_extract(average_acct_age, "\\d+")),
                          average_acct_age_mon_comp = as.numeric(str_extract(average_acct_age, "\\d+(?=mon)")),
                          average_acct_age = round((average_acct_age_year_comp + average_acct_age_mon_comp/12), 0)
                           ) %>%
                    # extract the years and month component of the credit_history_length and convert to years
                    mutate(credit_history_length_year_comp = as.numeric(str_extract(credit_history_length, "\\d+")),
                           credit_history_length_comp = as.numeric(str_extract(credit_history_length, "\\d+(?=mon)")),
                           credit_history_length = round((credit_history_length_year_comp + credit_history_length_comp/12), 0)
                           )  %>%
                    # clean up the perform_cns_score_distribution to include only a few categories
                    mutate(lowercase_cns_description = tolower(perform_cns_score_description),
                           perform_cns_score_description = case_when(
                                    str_detect(lowercase_cns_description, "very low risk") ~ "very_low_risk",
                                    str_detect(lowercase_cns_description, "low risk") ~ "low_risk",
                                    str_detect(lowercase_cns_description, "medium risk") ~ "medium_risk",
                                    str_detect(lowercase_cns_description, "high risk") ~ "high_risk",
                                    str_detect(lowercase_cns_description, "very high risk") ~ "very_high_risk",
                                    str_detect(lowercase_cns_description, "not scored|no bureau") ~ "low_risk",
                                    TRUE ~ "none"))  %>%
                    # clean up the employment type to have only few categories
                    mutate(lower_case_employment_type = tolower(employment_type),
                           employment_type = case_when(
                                    str_detect(lower_case_employment_type, "salaried") ~ "salaried",
                                    str_detect(lower_case_employment_type, "self employed") ~ "self_employed",
                                    TRUE ~ "not_reported")) %>%
                    # select only the required columns
                    select(
                      age_years, disbursed_amount, asset_cost, ltv, employment_type, perform_cns_score_description,
                      pri_no_of_accts, pri_active_accts, pri_overdue_accts, pri_current_balance, pri_sanctioned_amount,
                      pri_disbursed_amount, sec_no_of_accts, sec_active_accts, sec_overdue_accts, sec_current_balance,
                      sec_sanctioned_amount, sec_disbursed_amount, primary_instal_amt, sec_instal_amt, new_accts_in_last_six_months,
                      delinquent_accts_in_last_six_months, average_acct_age, credit_history_length, no_of_inquiries, loan_default
                    )
                    print("Data Cleaning complete!!!")
        return(cleaned_df)
        
      }
      else{
      print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
      }
        
      },
      #if an error occurs, tell me the error
      error=function(e) {
            message('An Error Occurred')
            print(e)
            },
      #or if a warning occurs, tell me the warning
      warning=function(w) {
            message('A Warning Occurred')
            print(w)
            return(NA)
            }
        )
      
    }
    
    # clean the data
    vehicle_loan_default_train_cleaned = data_cleaning(vehicle_loan_default_train_raw)
    
    # data pre-processing
    data_preprocess_scaling <- function(df){
                          # This helper function standardizes the numeric variables of the df using the standard normal method
                          df <- as.data.frame(df)
                          df_char <- df %>% select(loan_default, employment_type, perform_cns_score_description) 
                          df_numeric <- df %>% select(-loan_default, -employment_type, -perform_cns_score_description)
                          df_numeric_scaled <- df_numeric %>% mutate_all( ~ (scale(.) %>% as.vector))
                          df_scaled_combined <- cbind(df_char, df_numeric_scaled)
                          return(df_scaled_combined)
    }
    
    
    xgb_nnet_preprocess <- function(df, mode){
     # convert the categorical variable to dummies
                  df2 <- dummy_cols(df, select_columns = c("employment_type","perform_cns_score_description"), 
                                    remove_selected_columns = TRUE) %>% as.data.frame()
                  # prepare the xgb.DMatrix to use in the xgboost training
                  df2_train <- as.data.frame(df2[, -1])
                  df2_label <- as.data.frame(df2[, 1])
                  df_dmatrix <- xgb.DMatrix(as.matrix(sapply(df2_train, as.numeric)), label=as.matrix(df2_label))
                  
                  if(mode == "xgboost"){
                    return(df_dmatrix)
                    
                  } else if(mode == "nnet"){
                    return(df2)
                    
                  } else{
                    print("Mode not supported.")
                  }
                  
    }
    
    
    data_preprocessing <- function(df, mode = "train"){
        # This function pre-processes the cleaned data and get the data ready for training.
        tryCatch({
          if(is_tibble(df) | is.data.frame(df)){
            if(mode == "test"){
               df_scaled <- data_preprocess_scaling(df)
               print("Data Pre-processing complete")
               return(df_scaled)
               
            } else if(mode == "train"){
              curr_frame <<- sys.nframe() # sends the current frame to the global environment.
              # The ovun.sample function in the ROSE package assumes the data to be in the global env so you have to tell it  
              # which frame (scope) to find the data else this will fail if executed inside a function.
              df_ovun <- ovun.sample(formula = formula(loan_default ~ .), data = get("df", sys.frame(curr_frame)),
                                     N = 1.5 * nrow(data), seed = 1994, method = "both")$data %>% as.data.frame() %>% as_tibble()
              print("Oversampling and undersampling completed")
              df_scaled <- data_preprocess_scaling(df_ovun)
              print("Data Pre-processing complete!")
              return(df_scaled)
              
            } else {
              print("You did not enter a valid mode type: Enter train or test for mode")
              
            }
          }
          
        else{
        print("The dataframe is not a tibble. Kindly have your data in the form of a dataframe or a tibble")
      }
        
      },
      #if an error occurs, tell me the error
      error=function(e) {
            message('An Error Occurred')
            print(e)
            },
      #or if a warning occurs, tell me the warning
      warning=function(w) {
            message('A Warning Occurred')
            print(w)
            return(NA)
            }
        )
      
    }
    
    # Train Test Split
    # Set a seed
    set.seed(1994)
    #Split the sample
    sampling <- sample.split(vehicle_loan_default_train_cleaned$loan_default, SplitRatio = 0.7) 
    # Training Data
    df_train_subset <- subset(vehicle_loan_default_train_cleaned, sampling == TRUE)
    # Testing Data
    df_test_subset <- subset(vehicle_loan_default_train_cleaned, sampling == FALSE)
    
    
    df_train = data_preprocessing(df_train_subset, mode = "train")
    df_test = data_preprocessing(df_test_subset, mode = "test")
    
    # Model training function
    model_training <- function(df, mode = "logistic"){
      
              if(mode == "logistic"){
                  print("Training a Logistic Regression Model...")
                  logistic_model <- glm(formula = loan_default ~ . , 
                                           family = binomial(link = 'logit'), data = df)
                  print("Logistic Regression Model complete")
                  return(logistic_model)
               
            } else if(mode == "rf"){
                  print("Training a Random Forest Classification Model...")
                  rf_model_ranger <- ranger(
                                     formula   = loan_default ~ ., 
                                     data      = df, 
                                     num.trees = 500,
                                     mtry      = floor(length(df) / 3),
                                     probability = TRUE,
                                     verbose = FALSE,
                                     classification = TRUE
                                     )
                  print("Random Forest Classification Model complete")
                  return(rf_model_ranger)         
              
            } else if(mode == "xgboost"){
                  print("Training an XGBoost Classification Model...")
                  # pre-process the data to obtain the dmatrix
                  df_dmatrix <- xgb_nnet_preprocess(df, mode)
                  xgb_model <- xgboost(data = df_dmatrix, nthread = 4, nrounds = 150,
                                       max.depth = 10, eta = 0.1, objective = "binary:logistic", verbose = FALSE)
                  print("XGBoost Training complete")
                  return(xgb_model)
              
            } else if(mode == "nnet"){
                  print("Training a Neural Network Classification Model...")
                  # pre-process the model to convert character variables to dummies
                  df_nnet <- xgb_nnet_preprocess(df, mode)
                  nnet_model <- nnet(loan_default ~ ., data = df_nnet, decay = 5e-4, 
                                     size = 20, maxit = 100, trace = F, set.seed(1994))
                  print("NNET Training complete")
                  return(nnet_model)
              
            } else {
              print("You did not enter a valid mode type: Enter logistic, rf, xgboost or svm")
              
            }
          
    }
    
    # model prediction function
    model_prediction = function(df, trained_model, model_type){
      
      # remove the response variable from the dataframe if it exists
      if("loan_default" %in% colnames(df)){
        test_data = df %>% select(-loan_default)
      } else {
        test_data = df
      }
      
      # make predictions
      if(model_type == "rf"){
        predictions = predict(trained_model, data = test_data)$predictions[,1]
      } else if (model_type == "logistic"){
        predictions = predict(trained_model, newdata = test_data, type = "response")
      }  else if (model_type == "xgboost"){
        test_data_xgb = xgb_nnet_preprocess(df, model_type)
        predictions = predict(trained_model, newdata = test_data_xgb)
      } else{
        test_data_nnet = xgb_nnet_preprocess(test_data, model_type)
        predictions = predict(trained_model, newdata = test_data_nnet)
      }
      
      # convert probabilities to classes
      predicted = ifelse(predictions > 0.5, 1, 0)
      
      return(predicted)
    }
    
    # Model Metrics
    model_metrics = function(actual, predicted){
      
      # accuracy
      accuracy_model = round((Metrics::accuracy(actual, predicted)), 4)
      # precision
      precision_model = round((Metrics::precision(actual, predicted)), 4)
      # recall 
      recall_model = round((Metrics::recall(actual, predicted)), 4)
      # auc
      auc_model = round((pROC::auc(actual, predicted)), 4)
      
      # model metrics
      model_eval_metrics = c(accuracy_model, precision_model, recall_model, auc_model) %>% t()
      column_names = c("Accuracy", "Precision", "Recall", "AUC")
      evaluation_metrics = data.frame(values = model_eval_metrics)
      colnames(evaluation_metrics) = column_names
      
      # confusion Matrix
      confusion_table = table(predicted, actual)
      confusion_matrix = caret::confusionMatrix(confusion_table)
      print("**********************************************************************")
      print(confusion_matrix)
      print("**********************************************************************")
      
      return(evaluation_metrics)
    }
    
    # Train Logistic Model
    logistic_model <- model_training(df_train, mode = "logistic")
    logistic_actual = df_test$loan_default
    logistic_predicted = model_prediction(df_test, logistic_model, "logistic")
    logistic_model_metrics = model_metrics(logistic_actual, logistic_predicted)
    rownames(logistic_model_metrics) = "Logistic Model"
    
    # Train RF model
    rf_model <- model_training(df_train, mode = "rf")
    rf_actual = df_test$loan_default
    rf_predicted = model_prediction(df_test, rf_model, "rf")
    rf_accuracy = accuracy(rf_actual, rf_predicted)
    rf_model_metrics = model_metrics(rf_actual, rf_predicted)
    rownames(rf_model_metrics) = "Random Forest Model"
    
    # Train an XGBoost model
    xgb_model <- model_training(df_train, mode = "xgboost")
    xgb_actual = df_test$loan_default
    xgb_predicted = model_prediction(df_test, xgb_model, "xgboost")
    xgb_model_metrics = model_metrics(xgb_actual, xgb_predicted)
    rownames(xgb_model_metrics) = "XGBoost"
    
    # Train an NNET model
    nnet_model <- model_training(df_train, mode = "nnet")
    nnet_actual = df_test$loan_default
    nnet_predicted = model_prediction(df_test, nnet_model, "nnet")
    nnet_model_metrics = model_metrics(nnet_actual, nnet_predicted)
    rownames(nnet_model_metrics) = "Neural Network"
    
    # Result
    results = rbind(logistic_model_metrics, rf_model_metrics, xgb_model_metrics, nnet_model_metrics)
    kable(results, "html") %>%
                            kable_paper("hover", full_width = F) %>%
                            scroll_box(width = "500px", height = "200px")