In this project, we followed the CRISP-DM model to solve this problem. So we divided our analysis on regards to the following sections:
Our motivation for this project is to apply data analytics in a business context to increase efficiency of the business. In the retail lending business, banks receives loan applications from individuals with different profiles and make decision whether to approve or not. When banks reject applications from good credit applicants who are likely to repay the loan, banks will lose their business opportunities. On the other hands, if banks accept applications from bad credit applicants who are likely to default, banks will incur financial losses. Thus, in this project, we will leverage data from the German credit data set to develop a model that is able to help banks determine credit quality of their loan applications.
Additionally, we can help banks increase the efficiency of the loan approval process by the ability to analyze loan applications in real-time and provide the loan approval decision instantly.
Business Objective: The goal of this analysis is to obtain a model that is able to determine if new applicants present a good or bad credit risk based on personal and credit profiles.
Situation: We obtain the German Credit data, which has data on 1000 past credit applicants, described by 30 variables such as account status, credit history, duration of credit and purpose of credit. Each applicant is rated as “Good” or “Bad” credit (encoded as 1 and 0 respectively in the response variable).
Data mining goals: The aim of this project in the data mining terms is using credit application information, including personal information, credit history and type of credit, to generate a model that classify or rate applicants to the “Good” or “Bad” groups. Banks will be able to predict if an applicant will be “Good” or “Bad” based on the application information.
We will take a closer look at the German Credit data to avoid any data related issues that might happen in the next steps, Data Preparation and Modeling.
Missing values
To have a further knowledge of the structure of our data, we plot the following charts to identify if there were missing observations on each of the variables. We note that we don’t observe any missing values, so we will proceed looking at each variable more closer to identify possible outliers.
Overall Summary
In this section, we will show the summary of each variable.
Key Findings from the summary table below
Response Variable: has 70% good applicants and 30% bad applicants.
Duration: the average duration of credit in months is 20.9 months whereas the first and third quartiles are 12 and 24 months respectively.
History: something important to mention on the credit history is that there is 29% of critical accounts.
Variables considered as Purposes of credit:
Gender and status of applicants:
Variables | N | Customers, N = 1,0001 |
---|---|---|
CHK_ACCT | 1,000 | |
< 0DM | 274, (27%) | |
>= 200DM | 63, (6.3%) | |
0 < -- < 200DM | 269, (27%) | |
No checking account | 394, (39%) | |
DURATION | 1,000 | |
Median, (IQR) | 18, (12, 24) | |
Mean, (Range) | 21, (4, 72) | |
HISTORY | 1,000 | |
All credits at this bank paid back dully | 49, (4.9%) | |
Critical account | 293, (29%) | |
Existing credits paid | 530, (53%) | |
No checking account | 88, (8.8%) | |
No credit data | 40, (4.0%) | |
AMOUNT | 1,000 | |
Median, (IQR) | 2,320, (1,366, 3,972) | |
Mean, (Range) | 3,271, (250, 18,424) | |
SAV_ACCT | 1,000 | |
< 100DM | 603, (60%) | |
>= 1000DM | 48, (4.8%) | |
100 <= -- < 500DM | 103, (10%) | |
500 <= -- < 1000DM | 63, (6.3%) | |
Unknown/No savings account | 183, (18%) | |
EMPLOYMENT | 1,000 | |
< 1 year | 172, (17%) | |
>= 7 years | 253, (25%) | |
1 <= -- < 4 years | 339, (34%) | |
1 <= -- < 7 years | 174, (17%) | |
Unemployed | 62, (6.2%) | |
INSTALL_RATE | 1,000 | |
1 | 136, (14%) | |
2 | 231, (23%) | |
3 | 157, (16%) | |
4 | 476, (48%) | |
PRESENT_RESIDENT | 1,000 | |
< 1 year | 130, (13%) | |
>= 4 years | 413, (41%) | |
1 <= -- < 2 years | 308, (31%) | |
2 <= -- < 3 years | 149, (15%) | |
AGE | 1,000 | |
Median, (IQR) | 33, (27, 42) | |
Mean, (Range) | 36, (19, 125) | |
NUM_CREDITS | 1,000 | |
1 | 633, (63%) | |
2 | 333, (33%) | |
3 | 28, (2.8%) | |
4 | 6, (0.6%) | |
JOB | 1,000 | |
Management/self-employed/highly qualified employee/officcer | 148, (15%) | |
Skilled employee/official | 630, (63%) | |
Unemployed/unskilled non-resident | 22, (2.2%) | |
Unskilled - resident | 200, (20%) | |
NUM_DEPENDENTS | 1,000 | |
1 | 845, (84%) | |
2 | 155, (16%) | |
NEW_CAR, Positive(Yes) | 1,000 | 234, (23%) |
USED_CAR, Positive(Yes) | 1,000 | 103, (10%) |
FURNITURE, Positive(Yes) | 1,000 | 181, (18%) |
RADIO.TV, Positive(Yes) | 1,000 | 280, (28%) |
EDUCATION, Positive(Yes) | 1,000 | |
Binary as -1 (OBS #37) | 1, (0.1%) | |
No | 950, (95%) | |
Yes | 49, (4.9%) | |
RETRAINING, Positive(Yes) | 1,000 | 97, (9.7%) |
MALE_DIV, Positive(Yes) | 1,000 | 50, (5.0%) |
MALE_SINGLE, Positive(Yes) | 1,000 | 548, (55%) |
MALE_MAR_or_WID, Positive(Yes) | 1,000 | 92, (9.2%) |
CO.APPLICANT, Positive(Yes) | 1,000 | 41, (4.1%) |
GUARANTOR, Positive(Yes) | 1,000 | |
Binary as 2 (OBS #234) | 1, (0.1%) | |
No | 948, (95%) | |
Yes | 51, (5.1%) | |
REAL_ESTATE, Positive(Yes) | 1,000 | 282, (28%) |
PROP_UNKN_NONE, Positive(Yes) | 1,000 | 154, (15%) |
OTHER_INSTALL, Positive(Yes) | 1,000 | 186, (19%) |
RENT, Positive(Yes) | 1,000 | 179, (18%) |
OWN_RES, Positive(Yes) | 1,000 | 713, (71%) |
TELEPHONE, Positive(Yes) | 1,000 | 404, (40%) |
FOREIGN,, Positive(Yes) | 1,000 | 37, (3.7%) |
RESPONSE | 1,000 | 700, (70%) |
1 n, (%) |
Histogram to visualize numerical and categorical variables
Histogram is to summarize the distribution of the data set. Thus, binary variables are not included in the chart. This is because binary variables have only 0 and 1 values, so their histograms are not very meaningful.
Key Findings from the histograms
AGE, AMOUNT and DURATION: positive skewness meaning that the bank may expect frequent young clients which borrow small loan with approximately 0-25 months credit duration.
The most frequent category for CHK_ACCT is no checking account and only a few applicants have more than 200 DM. Majority of applicants have less than 200 DM in saving account.
Most applicants are employed with the skilled employee category. The most highest frequency is in the 1-4 years category.
The most common installment rate is 4% of disposable income and the most common number of existing credits at this bank is 1. Also, the larger parts of applicants are individuals for whom liable to provide maintenance. In general, applicants have been in the current residence for more than 1 year.
Most of the applicants are paid the existing credits back duly however, the second most frequent group is the critical account.
Box plot to interpret numerical data
We use box plot to visualize quartile, median, skewness and outliers
of numerical data. Thus, in this section, we select only numerical
variables including DURATION, AMOUNT, INSTALL_RATE, AGE, NUM_CREDITS and
NUM_DEPENDENTS. However, we also include EMPLOYMENT and
PRESENT_RESIDENT. Although both are categorical variables, the values
represent the length of employment and residency, which also provide
numerical information.
For each box plot below, the top box (response of 1) represents good applicants and the bottom one (response of 0) represents bad applications.
Regarding the ‘good’ applicants from box plot, they tend to have one of the following characteristics: shorter credit duration, longer employment duration, lower credit amount. Regarding the AGE variable, the age profiles of good applicants tend to be slightly higher. Also, we spot a possible error which is 125 years old. As for the rest variables, INSTALL_RATE, NUM_CREDITS, NUM_DEPENDENTS and PRESENT_RESIDENT, there are no significant differences between good and bad applicants.
Summary by RESPONSE
In this section, we show average of each variable split by the response variable. Since our data contains binary variables, we believe that average is more appropriate than median. Please note that the response of 0 represents ‘Bad’ applicants and the response of 1 represents ‘Good’ applicants.
## 0 1
## DURATION 24.86 19.21
## AMOUNT 3938.13 2985.46
## INSTALL_RATE 3.10 2.92
## AGE 33.96 36.30
## NUM_CREDITS 1.37 1.42
## NUM_DEPENDENTS 1.15 1.16
## NEW_CAR 0.30 0.21
## USED_CAR 0.06 0.12
## FURNITURE 0.19 0.18
## RADIO.TV 0.21 0.31
## EDUCATION 0.07 0.04
## RETRAINING 0.11 0.09
## MALE_DIV 0.07 0.04
## MALE_SINGLE 0.49 0.57
## MALE_MAR_or_WID 0.08 0.10
## CO.APPLICANT 0.06 0.03
## GUARANTOR 0.03 0.06
## REAL_ESTATE 0.20 0.32
## PROP_UNKN_NONE 0.22 0.12
## OTHER_INSTALL 0.25 0.16
## RENT 0.23 0.16
## OWN_RES 0.62 0.75
## TELEPHONE 0.38 0.42
## FOREIGN 0.01 0.05
From the table above, we can observe that the average amount of ‘Good’ applicants is almost 3000, while the average amount of ‘Bad’ applicants is almost 4000. If we look at Purpose of Credit, we find that the top purpose of ‘Good’ applicants are ratio/TV (31% of the total number of ‘Good’ applicants). On the other hand, the top purpose of ‘Bad’ applicants is new car (30% of the total number of ‘Bad’ applicants).
Scatter plots and Correlations
In this section, we show scatter plots and linear correlations of all
numerical variables as well as the response variable.
From the result, we see that AMOUNT and DURATION have a
correlation of 0.63, and they seems to be negatively correlated with the
response variable. This supports our observation from the box plot
section that the ‘good’ applicants tend to have short credit duration
and low credit amount.
Data cleaning
In this section, we list all possible errors that we found in the data set during the Data Understanding.
german_credit table
as following:
Next, before starting applying the models to the data, first we convert the categorical variables from numeric(int) values to categorical (factor). For example, CHK_ACCT is actually a categorical data but the values in german_credit are numeric (int). On the following pie chart we can observe this transformation.
For the RESPONSE
variable, we transform the values to
“Good” if it is equal to 1, and “Bad” otherwise. We allocated this new
values in a column called Applicant
. It is worth mentioning
that we could have treated the data the way it was “0 & 1”, and
applied a regression task, but we preferred to visualize the categorical
data with “Good” and “Bad” values.
Finally, we proceed splitting the German Credit data into two datasets, to ensure that the models will not overfit the data and that the results of the predictions are good. To do so, we select for the first set; our training set, 80% of the observations randomly(800 obs), and for the observations that remain we took them as our test set(200 obs). Please note that we set a seed value for reproducibility purposes for the data partitioning.
Our goal is to obtain a model that may be used to determine if new
applicants present a good or bad credit risk. Since we have transformed
the column RESPONSE
as a factor with categorical values, we
will apply models that consider a classification task.
We have chosen the models as follows:
Decision trees are algorithms that recursively search the space for the best boundary possible, until we unable them to do so (Ivo Bernardo,2021). The basic functionality of decision trees is to split the data space into rectangles, by measuring each split. The main goal is to minimize the impurity of each split from the previous one.
Build the model - Unbalanced data without Cross validation
After creating the model with rpart
function, we proceed
to plot it to visualize the result of this classification tree. In the
graph we can observe that the main splitting variable is “CHK_ACCT =
0,1” the selected one, it reduces the impurity by
43.99. In addition, we can determine which variables
have the most significant reduction impact on the impurity function; in
this case they will be those with the longest splitting length.
Pruning the Tree
We decided to prune the tree to reduce the statistical noise in the data, because as the tree splits over and over the length becomes shorter, therefore the importance of the split diminishes. Another reason is that since decision trees are susceptible to overfitting, reducing the size of the model will improve the accuracy.
In the Complexity table below, we can visualize the 16 variables that
were considered in the construction of the classification tree.
Furthermore, the tree yielding the lowest-cross-validated rate
xerror
is tree number 3. We have chosen this tree by using
the rule of thumb, which chooses the lowest level where
the rel_error + xstd < xerror
, and also by considering
the simplest tree, so we discard the trees 4, 5, and 6 for this
reason.
##
## Classification tree:
## rpart(formula = Applicant ~ ., data = df.tr, method = "class",
## model = TRUE, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] AGE AMOUNT CHK_ACCT DURATION
## [5] EMPLOYMENT HISTORY JOB MALE_SINGLE
## [9] NEW_CAR OTHER_INSTALL OWN_RES PRESENT_RESIDENT
## [13] REAL_ESTATE SAV_ACCT
##
## Root node error: 240/800 = 0.3
##
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.0520833 0 1.00000 1.00000 0.054006
## 2 0.0361111 3 0.84167 0.94583 0.053129
## 3 0.0166667 6 0.73333 0.85417 0.051449
## 4 0.0111111 10 0.66667 0.82083 0.050773
## 5 0.0093750 13 0.63333 0.81250 0.050599
## 6 0.0083333 18 0.57917 0.84583 0.051284
## 7 0.0041667 21 0.55417 0.85000 0.051367
## 8 0.0020833 24 0.54167 0.81667 0.050686
## 9 0.0010417 26 0.53750 0.85833 0.051531
## 10 0.0010000 30 0.53333 0.86667 0.051694
Another way to find the lowest-cross-validated rate is by visualizing
the size of the tree on a plot, in which you can observe the relative
error rel error
on the y axis, the cross-validation
procedure cp
on the x axis, and on the top side of the
plot,the size of the tree (no. of terminal nodes). The black line is the
cross-validated error rate xerror
of each split.
From the graph, we can visualize right away which cp to choose which is the one closest to the dotted line and the simplest one. For this case we could choose the tree with 7 nodes with a cp of 0.025
To visualize how the classification tree will look-like after the pruning, we plot it again. Note: On the following tree we considered the cp of 0.0166667 , the no. 3 of the complexity table .
We can note from the graph that the classification tree has
shortened, and the main splitting branch variable remains the same as
expected, but the third node OTHER_INSTALL
that was
considered on the previous tree, has been removed, the same happened
with the fourth node SAV_ACCT
and the node
JOB
.
Model evaluation - Unbalanced data without Cross validation
##
## Bad Good
## 240 560
From the table we can observe an unbalanced data in the training set with 204 bad applicants and 560 good applicants, since there are many more “Good” applicants than “Bad” applicants, any model would favor the prediction of the “Good”.
First,we want to measure the accuracy of our model with the unbalance data, to know how good our model is, so we compute the confusion matrix to observe it’s values.
We note a 0.74 of accuracy, a balanced accuracy of 0.63, and disproportion of the sensitivity (0.35) and specificity(0.91) which is not good. For that reason, we decided to balance our data so we can improve the balanced accuracy and make the overall score more robust by applying to the model a cross-validation technique. This will help our model to find the best set of hyperparameters and have better results.
Class Balancing - Re-sampling Data
Balancing by re-sampling consists of increasing the number of cases in the smallest class (here “Bad”) by re-sampling at random cases from this category to get the same amount as the largest category (here “Good”). It has the same aim as sub-sampling which is to have the same amount on each category(reducing the highest category).
After applying the re-sampling we balance the data set to 560 applicants each.
##
## Bad Good
## 560 560
Model evaluation - Balanced data with Cross validation and Tuning CP
On the previous analysis we have seen how to build a decision tree
model with the rpart
function(by hand), also how to select
manually the preferred CP from the complexity table, and how to identify
it on a graph. Now, we will build a decision tree model with a Cross
validation (CV) with the caret
function (automatically), it
will be applied to the balanced data that we created on the previous
point. Finally, we will compare which balance accuracy is the
highest.
First, we split the training data into 10 non-overlapping subsets, 9/10 of this folds will be used to train the model, and 1/10 will be used as a validation set.
Secondly, we built the model with the data that we have already split
to the function trainControl of caret
. In addition, we
looked at the results across the tuning parameters and we found out that
there were only 3 cp’s selected with the best at
0.01785714, which make us think about the possibility
of a better one by looking on a larger grid of hyper parameters.
## CART
##
## 1120 samples
## 30 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1008, 1008, 1008, 1008, 1008, 1008, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01785714 0.7205357 0.4410714
## 0.03125000 0.7000000 0.4000000
## 0.36607143 0.5678571 0.1357143
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01785714.
We built again the model but this time we considered tuning grid going from 0 to 0.03 with a sequence of 0.001, in order to find a better hyper parameter than the ones seen already.
## CART
##
## 1120 samples
## 30 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1008, 1008, 1008, 1008, 1008, 1008, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.000 0.7669643 0.5339286
## 0.001 0.7687500 0.5375000
## 0.002 0.7714286 0.5428571
## 0.003 0.7767857 0.5535714
## 0.004 0.7803571 0.5607143
## 0.005 0.7812500 0.5625000
## 0.006 0.7642857 0.5285714
## 0.007 0.7607143 0.5214286
## 0.008 0.7553571 0.5107143
## 0.009 0.7562500 0.5125000
## 0.010 0.7517857 0.5035714
## 0.011 0.7517857 0.5035714
## 0.012 0.7482143 0.4964286
## 0.013 0.7428571 0.4857143
## 0.014 0.7401786 0.4803571
## 0.015 0.7366071 0.4732143
## 0.016 0.7285714 0.4571429
## 0.017 0.7258929 0.4517857
## 0.018 0.7196429 0.4392857
## 0.019 0.7196429 0.4392857
## 0.020 0.7232143 0.4464286
## 0.021 0.7232143 0.4464286
## 0.022 0.7241071 0.4482143
## 0.023 0.7241071 0.4482143
## 0.024 0.7214286 0.4428571
## 0.025 0.7214286 0.4428571
## 0.026 0.7214286 0.4428571
## 0.027 0.7214286 0.4428571
## 0.028 0.7169643 0.4339286
## 0.029 0.7107143 0.4214286
## 0.030 0.6991071 0.3982143
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005.
From the results of the table, the model with the cp = 0.005 deliver the highest accuracy of 77.9%, which is higher than the accuracy from the previous model that has cp = 0.01785714.
Finally, we apply the model to the test set to visualize the outcome on the confusion matrix table.
As expected, the accuracy has decreased from 0.74 to 0.68 but the balanced accuracy has increased from 0.63 to 0.66. We can also observe that sensitivity and specificity are now more balanced. The sensitivity is significantly improved from 0.35 to 0.60, while the specificity decreases from 0.91 to 0.71.
Overall, we determine that the model computed with the
caret
function using a tuning grid was the one performing
better on new data (test set) for the Classification Tree.
Moreover, as we wanted to visualize the best result of the
classification tree, we plot it into a graph.
Random Forest (RF) are algorithms of a set of decision trees that will produce a final prediction with the average outcome of the set of trees considered (user can define the amount of trees and the number of variables for each node). One of the reasons that we decided to test this method is because RF are considered to be more stable than Decision Trees; more trees better performance, but certain advantages come at a price. RF slow down the computation speed and cannot be visualize, however, we will look at the results for later comparison (Saikumar Talari, 2022).
Model evaluation - Balanced data with Cross validation and Tuning Parameter
For this method we will consider the same approach as the last one of Classification Tree, but we will use another class balancing called Sub-sampling. Balancing by sub-sampling consists of decreasing the number of cases in the highest class (here “Good”) by sub-sampling at random cases from this category to get the same amount as the smallest category (here “Bad”). It has the same aim as re-sampling which is to have the same amount on each category(increasing the lowest category). Finally, we will also take into account a Cross-Validation technique.
We also tune the ‘mtry’ hyper parameter, which indicates the number of variables randomly sampled as candidates at each split, using the tuneLenght parameter.
The optimal model is selected using the largest value of Accuracy. The hyper parameter of the optimal model is mtry = 11.
According from the results of the Confusion matrix, we note that there are changes between the classification tree model and the random forest model in terms of the accuracy and the balanced accuracy, which decrease from 0.68 to 0.66 and from 0.657 to 0.644 respectively. However, there is a higher difference between the sensitivity and the specificity, which means that the precision of the model is lower determining if an Applicant is Good or Bad. In addition, the Cohen’s Kappa has a strength of agreement of 0.292 (fair agreement), which means that the observed accuracy is only slightly higher than the accuracy that one would expect from a random model. Overall, the results of Random Forest are lower than the Classification Tree.
Neural Networks(NN) are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They are constructed by nodes, which represent the neurons, connected by arcs that interpret sensory data through a kind of machine perception, labeling or clustering raw input (Chris Nicholson, 2020). Since NN are applicable to classification problems, we decided to consider it for our analysis.
Model evaluation - Balanced data with Cross validation and Tuning Parameters
For this method we will consider the same approach as the past models, so we could compare the results and define which model is the one performing better at predicting the classes “Good” and “Bad”. To balance the data, we will apply ones again the sub-sampling method, and we will also take into account a Cross-Validation technique.
First, we build the model with the caret
package with
the previous mentioned considerations, to determine the best model for
NN. In addition, we selected a grid from 1 to 10 for the number of nodes
in a hidden layers with a sequence of 1 (Note: nnet fit a single hidden
layer neural network), and another one for the decay from 0.1 to 0.5
with a sequence of 0.1. The metric that was selected was the
Accuracy.
After the model was built, we wanted to visualize which hyper parameters were chosen for building the best model, so we plot them into a chart.
According to the chart, we note that the highest Accuracy can be reach considering only 1 node in a hidden layer with 0.3 of weight decay (Is the regularization parameter to avoid over-fitting).
For visualization purposes, we plot the Neural Network. The positive connections weights are shown in color green, while the negative connections are in blue.
Comparing the results of the confusion matrix with the Classification Tree model (the best one so far), we note a lower Accuracy on NN model by 0.025, and the Cohen’s Kappa diminished by 0.025. Moreover, the Balanced Accuracy also gave a lower result with 0.008 below the Classification Tree result, and also with a lower Precision (0.447). From this results, we will still prefer the Classification Tree over NN.
The logistic regression is a regression adapted to binary classification. The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear combination is transformed to a probability using a sigmoid function.
Model evaluation - Balanced data with Cross validation
Similar to the previous models, we would like to balance the response data (Good and Bad) by applying a sub-sampling method. A Cross-Validation technique with 10- fold is also performed in the train set to alleviate the over-fitting issue.
In the model development, the caret
package is still in
use with glm
method. We should pass an additional argument
of “binomial” to the “family” parameter for glm method because the
method refers to the generalized linear model which includes many models
such as linear regression, ANOVA, poisson regression, logistic
regression, etc. However, we do not need to specify the family in
actuality since caret automatically detects that we are trying to
perform classification, and would automatically use family = “binomial”
as a default parameter. Additionally, the twoClassSummary
and the classProbs = “TRUE” are required to compute measures specific to
two-class problems, such as the area under the ROC curve, the
sensitivity and specificity.
We do not set the threshold for this model so the threshold of 50% will be applied.
We then plot the receiver operating characteristic curve (ROC) to find the optimal threshold. The optimal threshold can be visualize at 0.531 on the ROC curve accompanied with the values of the specificity and sensitivity. Thus, our default threshold (50%) is appropriate.
plot(ROC, print.thres="best")
After that, we apply the trained model to the test set. From the confusion matrix, we see that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.64, 0.25, 0.643, 0.65 and 0.636, respectively. Comparing to the results of the classification tree model (the best one so far), only the sensitivity of logistic regression is higher by 0.05. We then conclude that the classification tree model still performs best.
Support Vector Machines is another simple algorithm in machine learning aiming to find a hyperplane in an N-dimensional space(N is the number of features) that distinctly classifies the observations. The selected plane has the maximum margin. In other words, to separate the two classes of the data points, the selected plane has the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. We can apply SVM to solve both linear and non-linear problems.
Model evaluation - Balanced data with Cross validation and Tuning Cost
Due to the number of features (30 variables), it is difficult to apply a scatter plot and justify whether this problem is linear or non-linear. Therefore, we perform both linear SVM and non-linear SVM, giving a radial SVM as an example.
Linear SVM
Starting with linear SVM, we still apply the caret
package to build the model. For the method, we select the
svmLinear
. In order to reduce the effect of the
over-fitting issue and to improve the robustness of the model, we still
take into account a Cross-Validation technique with 10-fold and a
sub-sampling method. Moreover, we build a search grid and fit the model
with each possible value in the grid to select a good hyperparameter,
which is “cost” in this case. Setting the cost is a way to control the
tolerance to bad classification. For example, if the cost is equal zero,
there is no penalty on the distance to the margin, so that all the
points can be misclassified and the border is very smooth. On the other
hand, if the cost is very large, few misclassifications are allowed and
overfitting is possible. Then, the border is not smooth.
We set the cost in the grid to be 0.01, 0.1, 1, 10, 100 and 1,000.
From the plot and the below results, the accuracy (0.72125) apparently reaches a plateau(peak) at the C = 100.
## C
## 5 100
## Support Vector Machines with Linear Kernel
##
## 800 samples
## 30 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 1e-02 0.70875 0.3919747
## 1e-01 0.70625 0.3783631
## 1e+00 0.72125 0.3969411
## 1e+01 0.70875 0.3823382
## 1e+02 0.72375 0.4052089
## 1e+03 0.71375 0.3895158
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 100.
Then, we apply the model to the test set. Regarding the confusion matrix, the accuracy, kappa, balanced accuracy, sensitivity and specificity are 0.635, 0.217, 0.62, 0.583 and 0.657 respectively. The classification tree is still the best one so far.
Radial SVM
We repeat the procedure for SVM with a radial basis kernel. Here, there are two parameters (sigma and cost (C)) to tune. The grid choice is rather arbitrary (often the result of trials and errors). We set the cost same as the linear SVM and sigma to be 0.01, 0.02, 0.05, and 0.1.
We can see from the plot and the results of the model, the optimal model from this search is with sigma = 0.01 and C=1. The accuracy is 0.71875.
## sigma C
## 3 0.01 1
## Support Vector Machines with Radial Basis Function Kernel
##
## 800 samples
## 30 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## sigma C Accuracy Kappa
## 0.01 1e-02 0.66875 0.3445613
## 0.01 1e-01 0.69250 0.3824192
## 0.01 1e+00 0.71875 0.4156249
## 0.01 1e+01 0.70500 0.3672905
## 0.01 1e+02 0.65875 0.2936150
## 0.01 1e+03 0.66000 0.2796385
## 0.02 1e-02 0.67250 0.3568070
## 0.02 1e-01 0.66125 0.3364705
## 0.02 1e+00 0.71375 0.4024602
## 0.02 1e+01 0.68875 0.3370347
## 0.02 1e+02 0.70000 0.3611860
## 0.02 1e+03 0.69000 0.3365943
## 0.05 1e-02 0.63875 0.2849361
## 0.05 1e-01 0.65375 0.2609499
## 0.05 1e+00 0.70375 0.3656006
## 0.05 1e+01 0.69500 0.3496004
## 0.05 1e+02 0.68625 0.3283431
## 0.05 1e+03 0.69250 0.3416771
## 0.10 1e-02 0.52125 0.1466878
## 0.10 1e-01 0.54125 0.1365361
## 0.10 1e+00 0.66750 0.2661624
## 0.10 1e+01 0.68000 0.2867268
## 0.10 1e+02 0.69375 0.3237131
## 0.10 1e+03 0.70750 0.3313321
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01 and C = 1.
Similarly, the radial SVM model is used in the test set to measure the performance. As for the confusion matrix, the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.655, 0.253, 0.639, 0.6 and 0.679 respectively. Comparing between both SVM models, we observe that radial SVM performs better than linear SVM in every measurement. However, the classification tree model performs best.
The KNN algorithm is one of simple machine learning techniques based on the assumption that similar things exist in close proximity. In other words, similar things are near to each other. The KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) by calculating the distance between data points. There are several ways of calculating distance, such as Manhattan distance, Hamming distance, Euclidean Distance, and Gower index.
Model evaluation - Balanced data with Cross validation and Tuning parameter
We use the knn
method in the caret
package
to build the model and use a sub-sampling method as well as perform
10-fold cross validation through the trControl parameter. In addition,
we also standardize features by assigning ‘center’ and ‘scale’ to the
‘preProcess’ parameter.
Moreover, we set the ‘tuneLength’ parameter to 20 in order to let the algorithm to randomly try 20 different sets of hyperparameter.
From the results below, the model select the hyperparameter k = 9 since it deliver the highest accuracy (0.66).
## k-Nearest Neighbors
##
## 800 samples
## 30 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: centered (45), scaled (45)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Addtional sampling using down-sampling prior to pre-processing
##
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.64625 0.2794474
## 7 0.65625 0.2971011
## 9 0.66125 0.3173530
## 11 0.65625 0.3108543
## 13 0.64500 0.2913405
## 15 0.65875 0.3180636
## 17 0.63250 0.2757460
## 19 0.65500 0.3202601
## 21 0.64500 0.3018661
## 23 0.63750 0.2933197
## 25 0.65000 0.3212294
## 27 0.65250 0.3075140
## 29 0.63375 0.2910729
## 31 0.64375 0.2994611
## 33 0.65625 0.3333056
## 35 0.63625 0.3072473
## 37 0.62125 0.2814219
## 39 0.65875 0.3312813
## 41 0.64250 0.3184974
## 43 0.65375 0.3234573
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
Then, we apply the model to the test set. From the confusion matrix below, we find that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.62, 0.221, 0.629, 0.65 and 0.607 respectively. We find that all the metrics in this model, except specificity, are lower than the results from the classification tree model.
LDA (Linear Discriminant Analysis) is a classifier used when a linear boundary is required and generated by fitting class conditional densities to the data and using Bayes’ rule. LDA assume that all response classes share the same covariance, and distributions of each response class is normal with a class-specific mean and common variance.
Model evaluation - Balanced data with Cross validation
We use the lda
method in the caret
package
to build the model and use a sub-sampling method to avoid a bias in
unbalance data. In addition, we set a “cv” method with “10” number under
the trControl command referring to a Cross-Validation technique with
10-fold.
We also standardize features by setting assign ‘center’ and ‘scale’ to the preProcess parameter.
Please note that there is no tunning parameter for the
lda
method, so we do not assign any values to th tuneLength
or the tuneGrid parameters.
## Linear Discriminant Analysis
##
## 800 samples
## 30 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: centered (45), scaled (45)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Addtional sampling using down-sampling prior to pre-processing
##
## Resampling results:
##
## Accuracy Kappa
## 0.7425 0.4511797
Afterward, we apply the trained model to the test set. From the confusion matrix, we see that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.65, 0.257, 0.646, 0.65 and 0.643, respectively. All metrics, except sensitivity, of LDA are lower than those of the classification tree model (the best one so far). Thus, the classification tree model still performs best.
In contrast to LDA, QDA (Quadratic Discriminant Analysis) is less strict for the covariance assumption and allow different covariance for different classes, which result in a quadratic boundary.
Model evaluation - Balanced data with Cross validation
We use the qda
method in the caret
package
to build the model and use a sub-sampling method to avoid a bias in
unbalance data. In addition, we set a “cv” method with “10” number under
the trControl command referring to a Cross-Validation technique with
10-fold.
We also standardize features by setting assign ‘center’ and ‘scale’ to the preProcess parameter.
Please note that there is no tunning parameter for the
qda
method, so we do not assign any values to th tuneLength
or the tuneGrid parameters.
## Quadratic Discriminant Analysis
##
## 800 samples
## 30 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: centered (45), scaled (45)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Addtional sampling using down-sampling prior to pre-processing
##
## Resampling results:
##
## Accuracy Kappa
## 0.7 0.3578281
Afterward, we apply the trained model to the test set. From the confusion matrix, we see that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.685, 0.33, 0.685, 0.683 and 0.686, respectively. Comparing to the classification tree model (the best model so far), QDA outperforms in all metrics, expect specificity. Thus, we will select QDA as the best model.
A Naive Bayes classifier is under a simple probabilistic classifier family, which is a machine learning model that is used to discriminate different objects based on certain features. The crux of the Naive Bayes classifier is based on the Bayes theorem and it is generally used for classification tasks. There are many types of this classifier such as Multinomial Naive Bayes, Bernoulli Naive Bayes, and Gaussian Naive Bayes.
Model evaluation - Balanced data with Cross validation and Tuning Parameters
The caret
package is still the main package we apply in
building the model. To build the Naive Bayes classifier, the
naive_bayes
method is used. As for a good practice to avoid
a bias in unbalance data, a sub-sampling method is set with the
trControl command. Moreover, we also set a “cv” method with “10” number
under the trControl command referring to a Cross-Validation technique
with 10-fold.
In addition, a search grid is built to find good parameters for our
trained model. In the naive_bayes
method , we can tune 3
parameters which are the following:
From the results below, we can observe that the Accuracy from gausian distribution (‘False’, left-hand sided chart) is better than the accuracy from kernel distribution (‘True’, right-hand sided chart). The accuracy between the models with the gausian distribution is relatively close to each other, and the model whose accuracy is the highest is the model with the gausian distribution, laplace correction = 1 and bandwidth adjustment = 0.75.
## Naive Bayes
##
## 800 samples
## 30 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## usekernel laplace adjust Accuracy Kappa
## FALSE 0.0 0.75 0.69250 0.3611620
## FALSE 0.0 1.00 0.67625 0.3484189
## FALSE 0.0 1.25 0.68750 0.3568673
## FALSE 0.0 1.50 0.68375 0.3580817
## FALSE 0.5 0.75 0.68875 0.3593466
## FALSE 0.5 1.00 0.69625 0.3701652
## FALSE 0.5 1.25 0.68250 0.3418565
## FALSE 0.5 1.50 0.68750 0.3606921
## FALSE 1.0 0.75 0.70375 0.3877056
## FALSE 1.0 1.00 0.68125 0.3438536
## FALSE 1.0 1.25 0.68875 0.3580144
## FALSE 1.0 1.50 0.70000 0.3824663
## TRUE 0.0 0.75 0.60750 0.2855257
## TRUE 0.0 1.00 0.57000 0.2415623
## TRUE 0.0 1.25 0.60125 0.2801110
## TRUE 0.0 1.50 0.59000 0.2645727
## TRUE 0.5 0.75 0.61250 0.3005501
## TRUE 0.5 1.00 0.57625 0.2446172
## TRUE 0.5 1.25 0.60125 0.2632262
## TRUE 0.5 1.50 0.61500 0.2974588
## TRUE 1.0 0.75 0.61375 0.2897495
## TRUE 1.0 1.00 0.61500 0.2834386
## TRUE 1.0 1.25 0.58875 0.2653659
## TRUE 1.0 1.50 0.62375 0.3161825
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 1, usekernel = FALSE
## and adjust = 0.75.
## laplace usekernel adjust
## 9 1 FALSE 0.75
We then apply the trained model to the test set. From the confusion matrix, we see that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.62, 0.215, 0.624, 0.633 and 0.614, respectively. Comparing to the results of the Quadratic Discriminant Analysis (QDA), the best one so far, every metric of the Navie Bayes classifier is lower. We conclude that the Quadratic Discriminant Analysis (QDA) remains the best one.
In this section, we create a table containing sensitivity, specificity, accuracy, balanced accuracy, and Cohen’s kappa of every model. To do this, we can see and compare performance of each model easier and clearer.
From the model performance summary table, we observe that the decision tree model outperform the other models in terms of specificity, accuracy, balanced accuracy and Cohen’s kappa. However, some models, such as logistic regression, neural network, and navie bayes classifier, return better result in the sensitivity. Therefore, we can say that the decision tree model is more suitable than the rest in this research.
Sensitivity | Specifity | Accuracy | Balanced_accuracy | Kappa | |
---|---|---|---|---|---|
Decison Tree | 0.6000000 | 0.7142857 | 0.680 | 0.6571429 | 0.2920354 |
Random Forest | 0.6166667 | 0.6714286 | 0.655 | 0.6440476 | 0.2596567 |
Neural Networks | 0.6333333 | 0.6642857 | 0.655 | 0.6488095 | 0.2659574 |
Logistic Regression | 0.6500000 | 0.6357143 | 0.640 | 0.6428571 | 0.2500000 |
Linear SVM | 0.5833333 | 0.6571429 | 0.635 | 0.6202381 | 0.2167382 |
Radial SVM | 0.6000000 | 0.6785714 | 0.655 | 0.6392857 | 0.2532468 |
K-Nearest Neighbors | 0.6500000 | 0.6071429 | 0.620 | 0.6285714 | 0.2213115 |
Linear Discriminant Analysis | 0.6500000 | 0.6428571 | 0.645 | 0.6464286 | 0.2573222 |
Quadratic Discriminant Analysis | 0.6833333 | 0.6857143 | 0.685 | 0.6845238 | 0.3297872 |
Naive Bayes classifier | 0.6333333 | 0.6142857 | 0.620 | 0.6238095 | 0.2148760 |
After the model evaluation, it would be very useful in a business context to see which variables are significant in the selected model.
Variable importance is a method that provides a measure of the importance of each feature for the model prediction quality. We analyze the variables importance of our best model which is the Quadratic Discriminant Analysis (QDA).
We use the AUC loss to compare the model quality of shuffling different variables. AUC is a synthetic measure of the distance to random model in the ROC curve plot. The larger AUC, the better the model.
According to the feature importance of the QDA model, the top 5 most important variables are HISTORY (Credit History), OWN_RES (Applicant owns residence or not), USED_CAR (Purpose of Credit), CHK_ACCT (Balance in checking account) and SAV_ACCT (Balance in savings account). It is important to mention that if we remove these variables, the AUC of the model will have the largest loss.
There are some limitations of the variable importance method that we should also take into consideration. For example, variable importance cannot see the interaction relationships between features. The combination (interaction) of several features might be a good predictor. Furthermore, the variable importance measure is dependent on the data set, so it might be subject to over-fitting.
Deployment is the process of using your new insights to make improvements within your organization. We will focus on planning for deployment and planning monitoring and maintenance.
Planning for Deployment
In this step, we plan responsibilities for each key stakeholders, including application developers, database experts, and credit officers.
Application developers need to create an interactive web interface using tools such as R Shiny and then deploy the final model into the web interface.
Database experts need to maintain the application databases and need to ensure the quality of the data. Also, they need to be able to modify the databases in case new variables might be needed for the model improvement in the future.
Credit officers need to learn how to use the web interface and how to use the output from the model in their decision making process. For example, at the beginning, credit officers might make a decision based on both model output (50%) and other qualitative information or any other factors that are not taken into account in the model (50%). Once the model is improved and able to provide better accuracy, the credit decision might be automated and rely 100% on the model output. It’s also very important that they must be informed when there are any changes in the model.
Planning Monitoring and Maintenance
It’s also important to frequently monitor the model performance over time to ensure that the model performs as expected when it’s rolled out.
The frequency of monitoring should depend on the number of applicants over the time horizon. This is because we need a sufficient sample size to ensure that the observed accuracy is robust and meaningful. We suggest to monitor the model accuracy when there are at least 200 new applications, which is the same size as our test set.
In addition, we need to feed new information into the model and re-train the model based on new data. This is because the data distribution and customer behavior can be expected to drift over time.
caret: Classification and Regression Training, accessed on 30 April 2022
IBM, Introduction to CRISP-DM, accessed on 10 May 2022
Ivo Bernardo, “Classification Decision Trees, Easily Explained”, Aug 30, 2021
Saikumar Talari, “Random Forest® vs Decision Tree: Key Differences”,February 18, 2022