Introduction: Three V’s

Typically, this data source seems vast, it may have no end to its growth, and the information amounts it produces, is where the volume V comes from.
The velocity V may refer to that data’s creation an collection speed.
And, the Variety V may refer to the diversity of data points represented in it.

Introduction: Banking

As on might imagine, then, big data, may be intertwined with and a product of daily human interactions. Interactions like those associated with banking. This has not gone unnoticed by the banking industry.
According to Garadis( n.d.), the banking industry is rife with data. Daily transactions may be a prime example of that. The value of that data may be vast and the insights that data may provide may be under-leveraged.
Generally, the data generated and collected with in the industry, may help institutions improve costumer experience through the understandings that it can provide, help them better position their internal interest, and much more. Ultimately, these things may work to increase revenue and/or limit losses.

Introduction: Loan default

One may imagine, then, that data concerning loans, a source of revenue for many banking institutions, is highly valued and thoroughly analysed. Especially, information on defaulting.
According to Brozic(2023), Loan default may be considered a label applied by a lender to a borrower, when that borrower, missed or neglected to make payments on a loan provided by that lender within a time period.
Defaulting on a loan is a terrible experience, one that most wish to avoid. It may leave long lasting effects on ones mental health and have consequences that may limit ones quality of life, earning potential, and much more. Consequently, the information gleaned from loan default data may be useful.

Data Set Description

The data set use is called the “BankLoanDefaultDataset”. The Structure of the data set follows.
It is taken from the book “Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6”. It was described where it was posted as, “…a subset of a large o data set.
The structure of the data set is simple.
It can be used for logistic and binary classification / predictive models and algorithms”.
Defaut is used as its response variable. There are 1000 observations and 15 variables.

Variables	Descriptions
Default\((y)\):	(1 displays bank loan default and 0 displays bank loan non default)
Checking_amount \((x_1)\):	(Numeric)
Term \((x_2)\):	(displayed in months (Numeric))
Credit_score \((x_3)\):	(Numeric)
Gender \((x_4)\):	(Categorical)
Marital_status\((x_5)\):	(Categorical)
Car_loan \((x_6)\):	(1- Own car loan, 0- Does not own car loan – Numeric)
Personal_loan\((x_7)\):	(1- Own Personal loan, 0- Does not own Personal loan – Numeric)
Home_loan \((x_8)\):	(1- Own Home loan, 0- Does not own Home loan – Numeric)
Education_loan \((x_9)\):	(1- Own Education loan, 0- Does not own Education loan – Numeric)
Emp_status \((x_{10})\):	(Categorical)
Amount \((x_{11})\):	(Numeric)
Saving_amount \((x_{12})\):	(Numeric)
Emp_duration \((x_{13})\):	(which is displayed in months (Numeric))
Age \((x_{14})\):	(which is displayed in years (Numeric))
No_of_credit_acc \((x_{15})\):	(Numeric)

Research Questions

The primary question for this analysis may be how will the explored predictive models perform.

Exploratory Data Analysis

Exploratory Data Analysis was conducted.
- Before the models could be explored, they needed to be built.
- Before they could be built, some data set exploration was necessary.
The following is the structure of the data set.
- At the top it may be clearly seen that there were 1000 observations, and 16 variables.
- Default, the response variable, is characterized as an integer object.
- One may also note that several of the variables that were characterized as integers, seem to only posses values 0 and 1.
- Further there were also variables characterized as characters, which seemed to posses at least two categories.
- Both of those variables types were assumed to be factors after exploring summary statistics for each of the variables and the data set description.

Exploratory Data Analysis: Data Structure Table

'data.frame':   1000 obs. of  16 variables:
 $ Default         : int  0 0 0 1 1 0 0 0 0 1 ...
 $ Checking_amount : int  988 458 158 300 63 1071 -192 172 585 189 ...
 $ Term            : int  15 15 14 25 24 20 13 16 20 19 ...
 $ Credit_score    : int  796 813 756 737 662 828 856 763 778 649 ...
 $ Gender          : chr  "Female" "Female" "Female" "Female" ...
 $ Marital_status  : chr  "Single" "Single" "Single" "Single" ...
 $ Car_loan        : int  1 1 0 0 0 1 1 1 1 1 ...
 $ Personal_loan   : int  0 0 1 0 0 0 0 0 0 0 ...
 $ Home_loan       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Education_loan  : int  0 0 0 1 1 0 0 0 0 0 ...
 $ Emp_status      : chr  "employed" "employed" "employed" "employed" ...
 $ Amount          : int  1536 947 1678 1804 1184 475 626 1224 1162 786 ...
 $ Saving_amount   : int  3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
 $ Emp_duration    : int  12 25 43 0 4 12 11 12 12 0 ...
 $ Age             : int  38 36 34 29 30 32 38 36 36 29 ...
 $ No_of_credit_acc: int  1 1 1 1 1 2 1 1 1 1 ...

Exploratory Data Analysis: Summary Statistics

The following represents summary statistics for each of the variables.
as well as a table for the variable No_of_credit_acc bellow the summary output.
Note, that that this variable had only 9 values and was also characterized as an integer.
For those reasons it was also assumed to be a factor variable, and was converted into one.
Lastly, from the summary output it may be noticed that there were no missing values in the data set.

Exploratory Data Analysis: Data

Data Summary

FREQ Counts for No_of_credit_acc
Default	Min. :0.0	1st Qu.:0.0	Median :0.0	Mean :0.3	3rd Qu.:1.0	Max. :1.0
Checking_amount	Min. :-665.0	1st Qu.: 164.8	Median : 351.5	Mean : 362.4	3rd Qu.: 553.5	Max. :1319.0
Term	Min. : 9.00	1st Qu.:16.00	Median :18.00	Mean :17.82	3rd Qu.:20.00	Max. :27.00
Credit_score	Min. : 376.0	1st Qu.: 725.8	Median : 770.5	Mean : 760.5	3rd Qu.: 812.0	Max. :1029.0
Gender	Length:1000	Class :character	Mode :character	NA	NA	NA
Marital_status	Length:1000	Class :character	Mode :character	NA	NA	NA
Car_loan	Min. :0.000	1st Qu.:0.000	Median :0.000	Mean :0.353	3rd Qu.:1.000	Max. :1.000
Personal_loan	Min. :0.000	1st Qu.:0.000	Median :0.000	Mean :0.474	3rd Qu.:1.000	Max. :1.000
Home_loan	Min. :0.000	1st Qu.:0.000	Median :0.000	Mean :0.056	3rd Qu.:0.000	Max. :1.000
Education_loan	Min. :0.000	1st Qu.:0.000	Median :0.000	Mean :0.112	3rd Qu.:0.000	Max. :1.000
Emp_status	Length:1000	Class :character	Mode :character	NA	NA	NA
Amount	Min. : 244	1st Qu.:1016	Median :1226	Mean :1219	3rd Qu.:1420	Max. :2362
Saving_amount	Min. :2082	1st Qu.:2951	Median :3203	Mean :3179	3rd Qu.:3402	Max. :4108
Emp_duration	Min. : 0.00	1st Qu.: 15.00	Median : 41.00	Mean : 49.39	3rd Qu.: 85.00	Max. :120.00
Age	Min. :18.00	1st Qu.:29.00	Median :32.00	Mean :31.21	3rd Qu.:34.00	Max. :42.00
No_of_credit_acc	Min. :1.000	1st Qu.:1.000	Median :2.000	Mean :2.546	3rd Qu.:3.000	Max. :9.000

FREQ Table: No_of_credit_acc

FREQ Counts for No_of_credit_acc
Var1	Freq
1	308
2	325
3	119
4	105
5	109
6	6
7	8
8	6
9	14

Exploratory Data Analysis: Response Variable

Next the Response variable was explored.
From the previous information it may be clear that it is a binary variable which used the values 0 and 1 to encode a burrowers loan default status.
A proportion table of the frequency for this two levels follows.
From it, it may be clear that there are 7/3 more borrowers that did not default than that did.
Note that the response variable is a binary factor variable that was stored as an integer, with the integer 1 associated bank loan default, so \(P(Y=1|X)=P(Default=1|X)\).

Exploratory Data Analysis: Response Variable Proportion Table

Prop FREQ Counts:Default
Var1	Freq
0	0.7
1	0.3

Exploratory Data Analysis: Correlations

Next correlations between the numeric variables that were not turned into factors was assessed.
For most of these variables, it seems that there were low amounts of correlation detected and that no variable had a correlation higher than .35.
In fact, it seems that Age and Checking_amount, Age and Term, Age and Credit_score, and Age and Saving_amount, accounted for the highest amounts of correlation for each of these variables.

Exploratory Data Analysis: Correlations Matrix

Correlations Matrix
	Checking_amount	Term	Credit_score	Amount	Saving_amount	Emp_duration	Age
Checking_amount	1.0000000	-0.1916292	0.1892957	-0.1153301	0.2013942	0.0698080	0.2974109
Term	-0.1916292	1.0000000	-0.1954363	0.0540702	-0.1868427	-0.0637356	-0.2443853
Credit_score	0.1892957	-0.1954363	1.0000000	-0.0783984	0.2138242	0.0676228	0.3280754
Amount	-0.1153301	0.0540702	-0.0783984	1.0000000	-0.0097196	0.0179394	-0.1077698
Saving_amount	0.2013942	-0.1868427	0.2138242	-0.0097196	1.0000000	0.0909485	0.3430830
Emp_duration	0.0698080	-0.0637356	0.0676228	0.0179394	0.0909485	1.0000000	0.0798093
Age	0.2974109	-0.2443853	0.3280754	-0.1077698	0.3430830	0.0798093	1.0000000

Exploratory Data Analysis: Variables Transformations

Lastly, after the variables were transformed into factors, the data set was reexamined. The following is a print out of that information.

'data.frame':   1000 obs. of  16 variables:
 $ Default         : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 2 ...
 $ Checking_amount : int  988 458 158 300 63 1071 -192 172 585 189 ...
 $ Term            : int  15 15 14 25 24 20 13 16 20 19 ...
 $ Credit_score    : int  796 813 756 737 662 828 856 763 778 649 ...
 $ Gender          : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 1 2 ...
 $ Marital_status  : Factor w/ 2 levels "Married","Single": 2 2 2 2 2 1 2 2 2 1 ...
 $ Car_loan        : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 2 2 2 ...
 $ Personal_loan   : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
 $ Home_loan       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Education_loan  : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...
 $ Emp_status      : Factor w/ 2 levels "employed","unemployed": 1 1 1 1 2 1 1 1 2 1 ...
 $ Amount          : int  1536 947 1678 1804 1184 475 626 1224 1162 786 ...
 $ Saving_amount   : int  3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
 $ Emp_duration    : int  12 25 43 0 4 12 11 12 12 0 ...
 $ Age             : int  38 36 34 29 30 32 38 36 36 29 ...
 $ No_of_credit_acc: Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 2 1 1 1 1 ...

Model Selection: Multiple Logistic Regression

Before training the predictive models and exploring their performance, the models had to be built.
Simple criteria was used to do that.
- The first model contains all variables from the data set.
- The Second model contains only those variables which were found to be statistically significant in the first model.
- The third model was produced using an automated process. Specifically, an automatic variable selection procedure, its choice of final model is based on an AIC value that has been minimized the most by a particular model.

Model Selection: Multicolinearity

After the models were built multicolinearity was explored within each model using variable inflation factors.
The following output displays that information for each model. For models 2 and 3, none of the variables crawled much from one.
The same may be said for model 1.
However, for each of the loan types, even after adjustment, these values were all greater than three some were even greater than 9.
Why this is the case is unknown.
However, after exploring the raw data, it was found that all observations had at most one type of loan.
Output of the model summarry is withheld; the models were not used to make inferences about Default.

Model Selection: GVIF Model 1

VIF: Model 1
	GVIF	Df	GVIF^(1/(2*Df))
Checking_amount	1.185956	1	1.089016
Term	1.045746	1	1.022617
Credit_score	1.099802	1	1.048714
Gender	2.573097	1	1.604088
Marital_status	2.721962	1	1.649837
Car_loan	83.354431	1	9.129865
Personal_loan	81.896071	1	9.049645
Home_loan	14.354499	1	3.788733
Education_loan	32.253932	1	5.679254
Emp_status	1.143766	1	1.069470
Amount	1.076582	1	1.037585
Saving_amount	1.228646	1	1.108443
Emp_duration	1.166378	1	1.079990
Age	1.242704	1	1.114766
No_of_credit_acc	1.338116	8	1.018371

Model Selection: GVIF Model 2

VIF: Model 2
	x
Checking_amount	1.115226
Term	1.015942
Credit_score	1.047873
Saving_amount	1.120649
Age	1.146111

Model Selection: GVIF Model 3

VIF: Model 3
	x
Checking_amount	1.177743
Term	1.020826
Credit_score	1.072932
Personal_loan	1.187600
Home_loan	1.176122
Education_loan	1.200096
Emp_status	1.031324
Amount	1.049029
Saving_amount	1.204225
Age	1.169251

Data Split

After MLR was conducted, The three models were cross-validated using k Fold Cross Validation.
Their predictive error and and accuracy were stored after each was fit.
- This was done five times.
- The average of these five iterations for the predictive error and accuracy were stored and then output to tables.

The following is tabular output of the quantities of observations in each non overlapping partition employed in the k Fold Cross Validation procedure.

Fold Quantities

Fold Quantities
folds	Freq
1	195
2	186
3	223
4	199
5	197

k-Fold Cross Validation

Tabular output from the 5-Fold Cross Validation procedure follows.
- the first table should contain the average predictive errors from each model.
- Note, PE1, PE2, and PE3 should represent the full, reduced, and auto models respectively.
- Likewise ACC1, ACC2, and ACC3, which should represent the average accuracy of each model, follows that same order.
The final model was chosen based on greatest minimized predictive error.
The cut-off probability used to dichotomize each models predictions for later assessment was .5.

Average prediction errors

Average prediction errors
PE1	PE2	PE3
0.0741	0.0713	0.0631

Average prediction accuracy

Average prediction accuracy
ACC1	ACC2	ACC3
0.9259	0.9287	0.9369

Discussion: Results

The predictive abilities of three models was explored.
Ultimately, these abilities were quantified in the predictive error of each model and then these errors were compared.
It was found that the predictive error of model three was the lowest.
However, each models predictive error did not differ by much.

Discussion: Process

To get these predicitive errors, 5-Fold Cross Validation was used each model fitted with training data.
- At each iteration, test data with held from the model training, was used to compare predictions made by each model.
- The cut-off probability used to dichotomize each models predictions for later assessment was .5.
Exploratory data analysis was conducted before each models predictive abilities were explored.
- Durring that exploration, categorical and pseudo-categorical variables were transformed into factors.
- Multicolinearity within each model was also explored. For models 2 and 3, none was found.
- The same may be said for model 1.
  - However, for each of the loan types, even after adjustment, these variables had high GVIF values. Why this was the case was unknown. - However, note that all observations had at most one type of loan.
- Correlations between the numeric variables that were not turned into factors was assessed as well. low amounts were found.

Discussion: Models

The models were not used to make inferences, so no output of the model summary or interpretations were given.

The first model contained all variables from the data set.
The Second model contained only those variables which were found to be statistically significant in the first model.
The third model was produced using automatic variable selection, its choice of final model is based on an AIC value that has been minimized the most by a particular model.

Note that both the automatic model and 2nd model might have differed with each iteration of k-Fold Cross Validation, based on their selection criteria. That is, if the full model was fitted with training data from differing folds, it may have produced a reduced model that could be entirely different from model 2. The same can be said for model 3.

Conclusion

Predictive Error among the three models did not differ by much. However, if one was to pick a model for prediction based on lowest predictive error, one might choose the third model. Further research might be directed to exploring the cause of the high GVIF values in model 1.

REFs

Garadis, P. (n.d.). Modern Data Analytics in Banking: Benefits, Outlook & More. Retrieved from Hitachi Solutions: https://global.hitachi-solutions.com/blog/big-data-banking/
Qureshi, A. (n.d.). Data Retention Policy 101: Best Practices, Examples & More [with Template]. Retrieved August 07, 2021, from https://www.intradyn.com/data-retention-policy/
Segal, T. (2022, November 29). What Is Big Data? Definition, How It Works, and Uses. Retrieved from investopedia: https://www.investopedia.com/terms/b/big-data.asp

Exploring Predictive Models

Agenda

Introduction