The German Credit Data

In this project, we followed the CRISP-DM model to solve this problem. So we divided our analysis on regards to the following sections:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Business Understanding

Our motivation for this project is to apply data analytics in a business context to increase efficiency of the business. In the retail lending business, banks receives loan applications from individuals with different profiles and make decision whether to approve or not. When banks reject applications from good credit applicants who are likely to repay the loan, banks will lose their business opportunities. On the other hands, if banks accept applications from bad credit applicants who are likely to default, banks will incur financial losses. Thus, in this project, we will leverage data from the German credit data set to develop a model that is able to help banks determine credit quality of their loan applications.

Additionally, we can help banks increase the efficiency of the loan approval process by the ability to analyze loan applications in real-time and provide the loan approval decision instantly.

Business Objective: The goal of this analysis is to obtain a model that is able to determine if new applicants present a good or bad credit risk based on personal and credit profiles.

Situation: We obtain the German Credit data, which has data on 1000 past credit applicants, described by 30 variables such as account status, credit history, duration of credit and purpose of credit. Each applicant is rated as “Good” or “Bad” credit (encoded as 1 and 0 respectively in the response variable).

Data mining goals: The aim of this project in the data mining terms is using credit application information, including personal information, credit history and type of credit, to generate a model that classify or rate applicants to the “Good” or “Bad” groups. Banks will be able to predict if an applicant will be “Good” or “Bad” based on the application information.

Data Understanding

We will take a closer look at the German Credit data to avoid any data related issues that might happen in the next steps, Data Preparation and Modeling.

Missing values

To have a further knowledge of the structure of our data, we plot the following charts to identify if there were missing observations on each of the variables. We note that we don’t observe any missing values, so we will proceed looking at each variable more closer to identify possible outliers.

Overall Summary

In this section, we will show the summary of each variable.

Key Findings from the summary table below
Response Variable: has 70% good applicants and 30% bad applicants.
Duration: the average duration of credit in months is 20.9 months whereas the first and third quartiles are 12 and 24 months respectively.
History: something important to mention on the credit history is that there is 29% of critical accounts.
Variables considered as Purposes of credit:
- New Car: 23% of users
- Used Car: 10% of users
- Furniture: 18% of users
- Radio / TV: 28% of users
- Education: 5% of users
- Retraining: ~ 10% of users
- There is a 6% of users that we don’t know the purpose of the credit
Gender and status of applicants:
- 5% are male and divorced
- 55% are male and single
- 9% are male and married/a widower
- 31% of the others such as women in any status

**German Credit Data Summary**
Variables	N	Customers, N = 1,000¹
CHK_ACCT	1,000
< 0DM		274, (27%)
>= 200DM		63, (6.3%)
0 < -- < 200DM		269, (27%)
No checking account		394, (39%)
DURATION	1,000
Median, (IQR)		18, (12, 24)
Mean, (Range)		21, (4, 72)
HISTORY	1,000
All credits at this bank paid back dully		49, (4.9%)
Critical account		293, (29%)
Existing credits paid		530, (53%)
No checking account		88, (8.8%)
No credit data		40, (4.0%)
AMOUNT	1,000
Median, (IQR)		2,320, (1,366, 3,972)
Mean, (Range)		3,271, (250, 18,424)
SAV_ACCT	1,000
< 100DM		603, (60%)
>= 1000DM		48, (4.8%)
100 <= -- < 500DM		103, (10%)
500 <= -- < 1000DM		63, (6.3%)
Unknown/No savings account		183, (18%)
EMPLOYMENT	1,000
< 1 year		172, (17%)
>= 7 years		253, (25%)
1 <= -- < 4 years		339, (34%)
1 <= -- < 7 years		174, (17%)
Unemployed		62, (6.2%)
INSTALL_RATE	1,000
1		136, (14%)
2		231, (23%)
3		157, (16%)
4		476, (48%)
PRESENT_RESIDENT	1,000
< 1 year		130, (13%)
>= 4 years		413, (41%)
1 <= -- < 2 years		308, (31%)
2 <= -- < 3 years		149, (15%)
AGE	1,000
Median, (IQR)		33, (27, 42)
Mean, (Range)		36, (19, 125)
NUM_CREDITS	1,000
1		633, (63%)
2		333, (33%)
3		28, (2.8%)
4		6, (0.6%)
JOB	1,000
Management/self-employed/highly qualified employee/officcer		148, (15%)
Skilled employee/official		630, (63%)
Unemployed/unskilled non-resident		22, (2.2%)
Unskilled - resident		200, (20%)
NUM_DEPENDENTS	1,000
1		845, (84%)
2		155, (16%)
NEW_CAR, Positive(Yes)	1,000	234, (23%)
USED_CAR, Positive(Yes)	1,000	103, (10%)
FURNITURE, Positive(Yes)	1,000	181, (18%)
RADIO.TV, Positive(Yes)	1,000	280, (28%)
EDUCATION, Positive(Yes)	1,000
Binary as -1 (OBS #37)		1, (0.1%)
No		950, (95%)
Yes		49, (4.9%)
RETRAINING, Positive(Yes)	1,000	97, (9.7%)
MALE_DIV, Positive(Yes)	1,000	50, (5.0%)
MALE_SINGLE, Positive(Yes)	1,000	548, (55%)
MALE_MAR_or_WID, Positive(Yes)	1,000	92, (9.2%)
CO.APPLICANT, Positive(Yes)	1,000	41, (4.1%)
GUARANTOR, Positive(Yes)	1,000
Binary as 2 (OBS #234)		1, (0.1%)
No		948, (95%)
Yes		51, (5.1%)
REAL_ESTATE, Positive(Yes)	1,000	282, (28%)
PROP_UNKN_NONE, Positive(Yes)	1,000	154, (15%)
OTHER_INSTALL, Positive(Yes)	1,000	186, (19%)
RENT, Positive(Yes)	1,000	179, (18%)
OWN_RES, Positive(Yes)	1,000	713, (71%)
TELEPHONE, Positive(Yes)	1,000	404, (40%)
FOREIGN,, Positive(Yes)	1,000	37, (3.7%)
RESPONSE	1,000	700, (70%)
¹ n, (%)

Histogram to visualize numerical and categorical variables

Histogram is to summarize the distribution of the data set. Thus, binary variables are not included in the chart. This is because binary variables have only 0 and 1 values, so their histograms are not very meaningful.

Key Findings from the histograms
AGE, AMOUNT and DURATION: positive skewness meaning that the bank may expect frequent young clients which borrow small loan with approximately 0-25 months credit duration.
The most frequent category for CHK_ACCT is no checking account and only a few applicants have more than 200 DM. Majority of applicants have less than 200 DM in saving account.
Most applicants are employed with the skilled employee category. The most highest frequency is in the 1-4 years category.
The most common installment rate is 4% of disposable income and the most common number of existing credits at this bank is 1. Also, the larger parts of applicants are individuals for whom liable to provide maintenance. In general, applicants have been in the current residence for more than 1 year.
Most of the applicants are paid the existing credits back duly however, the second most frequent group is the critical account.

Box plot to interpret numerical data

We use box plot to visualize quartile, median, skewness and outliers of numerical data. Thus, in this section, we select only numerical variables including DURATION, AMOUNT, INSTALL_RATE, AGE, NUM_CREDITS and NUM_DEPENDENTS. However, we also include EMPLOYMENT and PRESENT_RESIDENT. Although both are categorical variables, the values represent the length of employment and residency, which also provide numerical information.

For each box plot below, the top box (response of 1) represents good applicants and the bottom one (response of 0) represents bad applications.

Regarding the ‘good’ applicants from box plot, they tend to have one of the following characteristics: shorter credit duration, longer employment duration, lower credit amount. Regarding the AGE variable, the age profiles of good applicants tend to be slightly higher. Also, we spot a possible error which is 125 years old. As for the rest variables, INSTALL_RATE, NUM_CREDITS, NUM_DEPENDENTS and PRESENT_RESIDENT, there are no significant differences between good and bad applicants.

Summary by RESPONSE

In this section, we show average of each variable split by the response variable. Since our data contains binary variables, we believe that average is more appropriate than median. Please note that the response of 0 represents ‘Bad’ applicants and the response of 1 represents ‘Good’ applicants.

##                       0       1
## DURATION          24.86   19.21
## AMOUNT          3938.13 2985.46
## INSTALL_RATE       3.10    2.92
## AGE               33.96   36.30
## NUM_CREDITS        1.37    1.42
## NUM_DEPENDENTS     1.15    1.16
## NEW_CAR            0.30    0.21
## USED_CAR           0.06    0.12
## FURNITURE          0.19    0.18
## RADIO.TV           0.21    0.31
## EDUCATION          0.07    0.04
## RETRAINING         0.11    0.09
## MALE_DIV           0.07    0.04
## MALE_SINGLE        0.49    0.57
## MALE_MAR_or_WID    0.08    0.10
## CO.APPLICANT       0.06    0.03
## GUARANTOR          0.03    0.06
## REAL_ESTATE        0.20    0.32
## PROP_UNKN_NONE     0.22    0.12
## OTHER_INSTALL      0.25    0.16
## RENT               0.23    0.16
## OWN_RES            0.62    0.75
## TELEPHONE          0.38    0.42
## FOREIGN            0.01    0.05

From the table above, we can observe that the average amount of ‘Good’ applicants is almost 3000, while the average amount of ‘Bad’ applicants is almost 4000. If we look at Purpose of Credit, we find that the top purpose of ‘Good’ applicants are ratio/TV (31% of the total number of ‘Good’ applicants). On the other hand, the top purpose of ‘Bad’ applicants is new car (30% of the total number of ‘Bad’ applicants).

Scatter plots and Correlations

In this section, we show scatter plots and linear correlations of all numerical variables as well as the response variable.

From the result, we see that AMOUNT and DURATION have a correlation of 0.63, and they seems to be negatively correlated with the response variable. This supports our observation from the box plot section that the ‘good’ applicants tend to have short credit duration and low credit amount.

Data Preparation

Data cleaning

In this section, we list all possible errors that we found in the data set during the Data Understanding.

EDUCATION: is a binary variable (0/1) but we notice from the description table that minimum value is -1 (observation number 37).
GUARANTOR: is a binary variable (0/1) but we notice from the description table that maximum value is 2 (observation number 234).
AGE: we spotted 125 years (the observation number 537).
PRESENT_RESIDENT: is a categorical data and we spotted that it contains 0 to 3 in the pdf data description; however, it contains 1 to 4 in the actual data.

Therefore, we believe that these numbers are errors on the data, so we modify them from the german_credit table as following:
- EDUCATION at the observation number 37 is modified from -1 to 1
- GUARANTOR at the observation number 234 is modified from 2 to 1
- AGE at the observation 537 is modified from 125 to 75
- PRESENT_RESIDENT values are modified from a range of 1 - 4 to 0 - 3

Next, before starting applying the models to the data, first we convert the categorical variables from numeric(int) values to categorical (factor). For example, CHK_ACCT is actually a categorical data but the values in german_credit are numeric (int). On the following pie chart we can observe this transformation.

For the RESPONSE variable, we transform the values to “Good” if it is equal to 1, and “Bad” otherwise. We allocated this new values in a column called Applicant. It is worth mentioning that we could have treated the data the way it was “0 & 1”, and applied a regression task, but we preferred to visualize the categorical data with “Good” and “Bad” values.

Finally, we proceed splitting the German Credit data into two datasets, to ensure that the models will not overfit the data and that the results of the predictions are good. To do so, we select for the first set; our training set, 80% of the observations randomly(800 obs), and for the observations that remain we took them as our test set(200 obs). Please note that we set a seed value for reproducibility purposes for the data partitioning.

Modeling and Evaluation

Our goal is to obtain a model that may be used to determine if new applicants present a good or bad credit risk. Since we have transformed the column RESPONSE as a factor with categorical values, we will apply models that consider a classification task.

We have chosen the models as follows:

Decision Trees
Random Forest
Neural Networks
Logistic Regression
Support Vector Machines (SVM)
K-Nearest Neighbors (KNN)
Linear Discriminant Analysis (LDA)
Quadratic Discriminant Analysis (QDA)
Naive Bayes classifier

Decision Trees - Classification

Decision trees are algorithms that recursively search the space for the best boundary possible, until we unable them to do so (Ivo Bernardo,2021). The basic functionality of decision trees is to split the data space into rectangles, by measuring each split. The main goal is to minimize the impurity of each split from the previous one.

Build the model - Unbalanced data without Cross validation

After creating the model with rpart function, we proceed to plot it to visualize the result of this classification tree. In the graph we can observe that the main splitting variable is “CHK_ACCT = 0,1” the selected one, it reduces the impurity by 43.99. In addition, we can determine which variables have the most significant reduction impact on the impurity function; in this case they will be those with the longest splitting length.

The variables are:
- HISTORY: Credit History
- DURATION: Duration of the Credit

Pruning the Tree

We decided to prune the tree to reduce the statistical noise in the data, because as the tree splits over and over the length becomes shorter, therefore the importance of the split diminishes. Another reason is that since decision trees are susceptible to overfitting, reducing the size of the model will improve the accuracy.

In the Complexity table below, we can visualize the 16 variables that were considered in the construction of the classification tree. Furthermore, the tree yielding the lowest-cross-validated rate xerror is tree number 3. We have chosen this tree by using the rule of thumb, which chooses the lowest level where the rel_error + xstd < xerror, and also by considering the simplest tree, so we discard the trees 4, 5, and 6 for this reason.

## 
## Classification tree:
## rpart(formula = Applicant ~ ., data = df.tr, method = "class", 
##     model = TRUE, cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] AGE              AMOUNT           CHK_ACCT         DURATION        
##  [5] EMPLOYMENT       HISTORY          JOB              MALE_SINGLE     
##  [9] NEW_CAR          OTHER_INSTALL    OWN_RES          PRESENT_RESIDENT
## [13] REAL_ESTATE      SAV_ACCT        
## 
## Root node error: 240/800 = 0.3
## 
## n= 800 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0520833      0   1.00000 1.00000 0.054006
## 2  0.0361111      3   0.84167 0.94583 0.053129
## 3  0.0166667      6   0.73333 0.85417 0.051449
## 4  0.0111111     10   0.66667 0.82083 0.050773
## 5  0.0093750     13   0.63333 0.81250 0.050599
## 6  0.0083333     18   0.57917 0.84583 0.051284
## 7  0.0041667     21   0.55417 0.85000 0.051367
## 8  0.0020833     24   0.54167 0.81667 0.050686
## 9  0.0010417     26   0.53750 0.85833 0.051531
## 10 0.0010000     30   0.53333 0.86667 0.051694

Another way to find the lowest-cross-validated rate is by visualizing the size of the tree on a plot, in which you can observe the relative error rel error on the y axis, the cross-validation procedure cp on the x axis, and on the top side of the plot,the size of the tree (no. of terminal nodes). The black line is the cross-validated error rate xerror of each split.

From the graph, we can visualize right away which cp to choose which is the one closest to the dotted line and the simplest one. For this case we could choose the tree with 7 nodes with a cp of 0.025

To visualize how the classification tree will look-like after the pruning, we plot it again. Note: On the following tree we considered the cp of 0.0166667 , the no. 3 of the complexity table .

We can note from the graph that the classification tree has shortened, and the main splitting branch variable remains the same as expected, but the third node OTHER_INSTALL that was considered on the previous tree, has been removed, the same happened with the fourth node SAV_ACCT and the node JOB.

Model evaluation - Unbalanced data without Cross validation

## 
##  Bad Good 
##  240  560

From the table we can observe an unbalanced data in the training set with 204 bad applicants and 560 good applicants, since there are many more “Good” applicants than “Bad” applicants, any model would favor the prediction of the “Good”.

First,we want to measure the accuracy of our model with the unbalance data, to know how good our model is, so we compute the confusion matrix to observe it’s values.

We note a 0.74 of accuracy, a balanced accuracy of 0.63, and disproportion of the sensitivity (0.35) and specificity(0.91) which is not good. For that reason, we decided to balance our data so we can improve the balanced accuracy and make the overall score more robust by applying to the model a cross-validation technique. This will help our model to find the best set of hyperparameters and have better results.

Class Balancing - Re-sampling Data

Balancing by re-sampling consists of increasing the number of cases in the smallest class (here “Bad”) by re-sampling at random cases from this category to get the same amount as the largest category (here “Good”). It has the same aim as sub-sampling which is to have the same amount on each category(reducing the highest category).

After applying the re-sampling we balance the data set to 560 applicants each.

## 
##  Bad Good 
##  560  560

Model evaluation - Balanced data with Cross validation and Tuning CP

On the previous analysis we have seen how to build a decision tree model with the rpart function(by hand), also how to select manually the preferred CP from the complexity table, and how to identify it on a graph. Now, we will build a decision tree model with a Cross validation (CV) with the caretfunction (automatically), it will be applied to the balanced data that we created on the previous point. Finally, we will compare which balance accuracy is the highest.

First, we split the training data into 10 non-overlapping subsets, 9/10 of this folds will be used to train the model, and 1/10 will be used as a validation set.

Secondly, we built the model with the data that we have already split to the function trainControl of caret. In addition, we looked at the results across the tuning parameters and we found out that there were only 3 cp’s selected with the best at 0.01785714, which make us think about the possibility of a better one by looking on a larger grid of hyper parameters.

## CART 
## 
## 1120 samples
##   30 predictor
##    2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1008, 1008, 1008, 1008, 1008, 1008, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01785714  0.7205357  0.4410714
##   0.03125000  0.7000000  0.4000000
##   0.36607143  0.5678571  0.1357143
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01785714.

We built again the model but this time we considered tuning grid going from 0 to 0.03 with a sequence of 0.001, in order to find a better hyper parameter than the ones seen already.

## CART 
## 
## 1120 samples
##   30 predictor
##    2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1008, 1008, 1008, 1008, 1008, 1008, ... 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa    
##   0.000  0.7669643  0.5339286
##   0.001  0.7687500  0.5375000
##   0.002  0.7714286  0.5428571
##   0.003  0.7767857  0.5535714
##   0.004  0.7803571  0.5607143
##   0.005  0.7812500  0.5625000
##   0.006  0.7642857  0.5285714
##   0.007  0.7607143  0.5214286
##   0.008  0.7553571  0.5107143
##   0.009  0.7562500  0.5125000
##   0.010  0.7517857  0.5035714
##   0.011  0.7517857  0.5035714
##   0.012  0.7482143  0.4964286
##   0.013  0.7428571  0.4857143
##   0.014  0.7401786  0.4803571
##   0.015  0.7366071  0.4732143
##   0.016  0.7285714  0.4571429
##   0.017  0.7258929  0.4517857
##   0.018  0.7196429  0.4392857
##   0.019  0.7196429  0.4392857
##   0.020  0.7232143  0.4464286
##   0.021  0.7232143  0.4464286
##   0.022  0.7241071  0.4482143
##   0.023  0.7241071  0.4482143
##   0.024  0.7214286  0.4428571
##   0.025  0.7214286  0.4428571
##   0.026  0.7214286  0.4428571
##   0.027  0.7214286  0.4428571
##   0.028  0.7169643  0.4339286
##   0.029  0.7107143  0.4214286
##   0.030  0.6991071  0.3982143
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005.

From the results of the table, the model with the cp = 0.005 deliver the highest accuracy of 77.9%, which is higher than the accuracy from the previous model that has cp = 0.01785714.

Finally, we apply the model to the test set to visualize the outcome on the confusion matrix table.

As expected, the accuracy has decreased from 0.74 to 0.68 but the balanced accuracy has increased from 0.63 to 0.66. We can also observe that sensitivity and specificity are now more balanced. The sensitivity is significantly improved from 0.35 to 0.60, while the specificity decreases from 0.91 to 0.71.

Overall, we determine that the model computed with the caretfunction using a tuning grid was the one performing better on new data (test set) for the Classification Tree. Moreover, as we wanted to visualize the best result of the classification tree, we plot it into a graph.

Random Forest

Random Forest (RF) are algorithms of a set of decision trees that will produce a final prediction with the average outcome of the set of trees considered (user can define the amount of trees and the number of variables for each node). One of the reasons that we decided to test this method is because RF are considered to be more stable than Decision Trees; more trees better performance, but certain advantages come at a price. RF slow down the computation speed and cannot be visualize, however, we will look at the results for later comparison (Saikumar Talari, 2022).

Model evaluation - Balanced data with Cross validation and Tuning Parameter

For this method we will consider the same approach as the last one of Classification Tree, but we will use another class balancing called Sub-sampling. Balancing by sub-sampling consists of decreasing the number of cases in the highest class (here “Good”) by sub-sampling at random cases from this category to get the same amount as the smallest category (here “Bad”). It has the same aim as re-sampling which is to have the same amount on each category(increasing the lowest category). Finally, we will also take into account a Cross-Validation technique.

We also tune the ‘mtry’ hyper parameter, which indicates the number of variables randomly sampled as candidates at each split, using the tuneLenght parameter.

The optimal model is selected using the largest value of Accuracy. The hyper parameter of the optimal model is mtry = 11.

According from the results of the Confusion matrix, we note that there are changes between the classification tree model and the random forest model in terms of the accuracy and the balanced accuracy, which decrease from 0.68 to 0.66 and from 0.657 to 0.644 respectively. However, there is a higher difference between the sensitivity and the specificity, which means that the precision of the model is lower determining if an Applicant is Good or Bad. In addition, the Cohen’s Kappa has a strength of agreement of 0.292 (fair agreement), which means that the observed accuracy is only slightly higher than the accuracy that one would expect from a random model. Overall, the results of Random Forest are lower than the Classification Tree.

Neural Networks

Neural Networks(NN) are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They are constructed by nodes, which represent the neurons, connected by arcs that interpret sensory data through a kind of machine perception, labeling or clustering raw input (Chris Nicholson, 2020). Since NN are applicable to classification problems, we decided to consider it for our analysis.

Model evaluation - Balanced data with Cross validation and Tuning Parameters

For this method we will consider the same approach as the past models, so we could compare the results and define which model is the one performing better at predicting the classes “Good” and “Bad”. To balance the data, we will apply ones again the sub-sampling method, and we will also take into account a Cross-Validation technique.

First, we build the model with the caret package with the previous mentioned considerations, to determine the best model for NN. In addition, we selected a grid from 1 to 10 for the number of nodes in a hidden layers with a sequence of 1 (Note: nnet fit a single hidden layer neural network), and another one for the decay from 0.1 to 0.5 with a sequence of 0.1. The metric that was selected was the Accuracy.

After the model was built, we wanted to visualize which hyper parameters were chosen for building the best model, so we plot them into a chart.

According to the chart, we note that the highest Accuracy can be reach considering only 1 node in a hidden layer with 0.3 of weight decay (Is the regularization parameter to avoid over-fitting).

For visualization purposes, we plot the Neural Network. The positive connections weights are shown in color green, while the negative connections are in blue.

Comparing the results of the confusion matrix with the Classification Tree model (the best one so far), we note a lower Accuracy on NN model by 0.025, and the Cohen’s Kappa diminished by 0.025. Moreover, the Balanced Accuracy also gave a lower result with 0.008 below the Classification Tree result, and also with a lower Precision (0.447). From this results, we will still prefer the Classification Tree over NN.

Logistic Regression

The logistic regression is a regression adapted to binary classification. The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear combination is transformed to a probability using a sigmoid function.

Model evaluation - Balanced data with Cross validation

Similar to the previous models, we would like to balance the response data (Good and Bad) by applying a sub-sampling method. A Cross-Validation technique with 10- fold is also performed in the train set to alleviate the over-fitting issue.

In the model development, the caret package is still in use with glm method. We should pass an additional argument of “binomial” to the “family” parameter for glm method because the method refers to the generalized linear model which includes many models such as linear regression, ANOVA, poisson regression, logistic regression, etc. However, we do not need to specify the family in actuality since caret automatically detects that we are trying to perform classification, and would automatically use family = “binomial” as a default parameter. Additionally, the twoClassSummary and the classProbs = “TRUE” are required to compute measures specific to two-class problems, such as the area under the ROC curve, the sensitivity and specificity.

We do not set the threshold for this model so the threshold of 50% will be applied.

We then plot the receiver operating characteristic curve (ROC) to find the optimal threshold. The optimal threshold can be visualize at 0.531 on the ROC curve accompanied with the values of the specificity and sensitivity. Thus, our default threshold (50%) is appropriate.

plot(ROC, print.thres="best")

After that, we apply the trained model to the test set. From the confusion matrix, we see that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.64, 0.25, 0.643, 0.65 and 0.636, respectively. Comparing to the results of the classification tree model (the best one so far), only the sensitivity of logistic regression is higher by 0.05. We then conclude that the classification tree model still performs best.

Support Vector Machines (SVM) - Classification

Support Vector Machines is another simple algorithm in machine learning aiming to find a hyperplane in an N-dimensional space(N is the number of features) that distinctly classifies the observations. The selected plane has the maximum margin. In other words, to separate the two classes of the data points, the selected plane has the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. We can apply SVM to solve both linear and non-linear problems.

Model evaluation - Balanced data with Cross validation and Tuning Cost

Due to the number of features (30 variables), it is difficult to apply a scatter plot and justify whether this problem is linear or non-linear. Therefore, we perform both linear SVM and non-linear SVM, giving a radial SVM as an example.

Linear SVM

Starting with linear SVM, we still apply the caret package to build the model. For the method, we select the svmLinear. In order to reduce the effect of the over-fitting issue and to improve the robustness of the model, we still take into account a Cross-Validation technique with 10-fold and a sub-sampling method. Moreover, we build a search grid and fit the model with each possible value in the grid to select a good hyperparameter, which is “cost” in this case. Setting the cost is a way to control the tolerance to bad classification. For example, if the cost is equal zero, there is no penalty on the distance to the margin, so that all the points can be misclassified and the border is very smooth. On the other hand, if the cost is very large, few misclassifications are allowed and overfitting is possible. Then, the border is not smooth.

We set the cost in the grid to be 0.01, 0.1, 1, 10, 100 and 1,000.

From the plot and the below results, the accuracy (0.72125) apparently reaches a plateau(peak) at the C = 100.

##     C
## 5 100

## Support Vector Machines with Linear Kernel 
## 
## 800 samples
##  30 predictor
##   2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   C      Accuracy  Kappa    
##   1e-02  0.70875   0.3919747
##   1e-01  0.70625   0.3783631
##   1e+00  0.72125   0.3969411
##   1e+01  0.70875   0.3823382
##   1e+02  0.72375   0.4052089
##   1e+03  0.71375   0.3895158
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 100.

Then, we apply the model to the test set. Regarding the confusion matrix, the accuracy, kappa, balanced accuracy, sensitivity and specificity are 0.635, 0.217, 0.62, 0.583 and 0.657 respectively. The classification tree is still the best one so far.

Radial SVM

We repeat the procedure for SVM with a radial basis kernel. Here, there are two parameters (sigma and cost (C)) to tune. The grid choice is rather arbitrary (often the result of trials and errors). We set the cost same as the linear SVM and sigma to be 0.01, 0.02, 0.05, and 0.1.

We can see from the plot and the results of the model, the optimal model from this search is with sigma = 0.01 and C=1. The accuracy is 0.71875.

##   sigma C
## 3  0.01 1

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 800 samples
##  30 predictor
##   2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   sigma  C      Accuracy  Kappa    
##   0.01   1e-02  0.66875   0.3445613
##   0.01   1e-01  0.69250   0.3824192
##   0.01   1e+00  0.71875   0.4156249
##   0.01   1e+01  0.70500   0.3672905
##   0.01   1e+02  0.65875   0.2936150
##   0.01   1e+03  0.66000   0.2796385
##   0.02   1e-02  0.67250   0.3568070
##   0.02   1e-01  0.66125   0.3364705
##   0.02   1e+00  0.71375   0.4024602
##   0.02   1e+01  0.68875   0.3370347
##   0.02   1e+02  0.70000   0.3611860
##   0.02   1e+03  0.69000   0.3365943
##   0.05   1e-02  0.63875   0.2849361
##   0.05   1e-01  0.65375   0.2609499
##   0.05   1e+00  0.70375   0.3656006
##   0.05   1e+01  0.69500   0.3496004
##   0.05   1e+02  0.68625   0.3283431
##   0.05   1e+03  0.69250   0.3416771
##   0.10   1e-02  0.52125   0.1466878
##   0.10   1e-01  0.54125   0.1365361
##   0.10   1e+00  0.66750   0.2661624
##   0.10   1e+01  0.68000   0.2867268
##   0.10   1e+02  0.69375   0.3237131
##   0.10   1e+03  0.70750   0.3313321
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01 and C = 1.

Similarly, the radial SVM model is used in the test set to measure the performance. As for the confusion matrix, the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.655, 0.253, 0.639, 0.6 and 0.679 respectively. Comparing between both SVM models, we observe that radial SVM performs better than linear SVM in every measurement. However, the classification tree model performs best.

K-Nearest Neighbors (KNN)

The KNN algorithm is one of simple machine learning techniques based on the assumption that similar things exist in close proximity. In other words, similar things are near to each other. The KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) by calculating the distance between data points. There are several ways of calculating distance, such as Manhattan distance, Hamming distance, Euclidean Distance, and Gower index.

Model evaluation - Balanced data with Cross validation and Tuning parameter

We use the knn method in the caret package to build the model and use a sub-sampling method as well as perform 10-fold cross validation through the trControl parameter. In addition, we also standardize features by assigning ‘center’ and ‘scale’ to the ‘preProcess’ parameter.

Moreover, we set the ‘tuneLength’ parameter to 20 in order to let the algorithm to randomly try 20 different sets of hyperparameter.

From the results below, the model select the hyperparameter k = 9 since it deliver the highest accuracy (0.66).

## k-Nearest Neighbors 
## 
## 800 samples
##  30 predictor
##   2 classes: 'Bad', 'Good' 
## 
## Pre-processing: centered (45), scaled (45) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
## Addtional sampling using down-sampling prior to pre-processing
## 
## Resampling results across tuning parameters:
## 
##   k   Accuracy  Kappa    
##    5  0.64625   0.2794474
##    7  0.65625   0.2971011
##    9  0.66125   0.3173530
##   11  0.65625   0.3108543
##   13  0.64500   0.2913405
##   15  0.65875   0.3180636
##   17  0.63250   0.2757460
##   19  0.65500   0.3202601
##   21  0.64500   0.3018661
##   23  0.63750   0.2933197
##   25  0.65000   0.3212294
##   27  0.65250   0.3075140
##   29  0.63375   0.2910729
##   31  0.64375   0.2994611
##   33  0.65625   0.3333056
##   35  0.63625   0.3072473
##   37  0.62125   0.2814219
##   39  0.65875   0.3312813
##   41  0.64250   0.3184974
##   43  0.65375   0.3234573
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Then, we apply the model to the test set. From the confusion matrix below, we find that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.62, 0.221, 0.629, 0.65 and 0.607 respectively. We find that all the metrics in this model, except specificity, are lower than the results from the classification tree model.

Linear Discriminant Analysis (LDA)

LDA (Linear Discriminant Analysis) is a classifier used when a linear boundary is required and generated by fitting class conditional densities to the data and using Bayes’ rule. LDA assume that all response classes share the same covariance, and distributions of each response class is normal with a class-specific mean and common variance.

Model evaluation - Balanced data with Cross validation

We use the lda method in the caret package to build the model and use a sub-sampling method to avoid a bias in unbalance data. In addition, we set a “cv” method with “10” number under the trControl command referring to a Cross-Validation technique with 10-fold.

We also standardize features by setting assign ‘center’ and ‘scale’ to the preProcess parameter.

Please note that there is no tunning parameter for the lda method, so we do not assign any values to th tuneLength or the tuneGrid parameters.

## Linear Discriminant Analysis 
## 
## 800 samples
##  30 predictor
##   2 classes: 'Bad', 'Good' 
## 
## Pre-processing: centered (45), scaled (45) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
## Addtional sampling using down-sampling prior to pre-processing
## 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.7425    0.4511797

Afterward, we apply the trained model to the test set. From the confusion matrix, we see that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.65, 0.257, 0.646, 0.65 and 0.643, respectively. All metrics, except sensitivity, of LDA are lower than those of the classification tree model (the best one so far). Thus, the classification tree model still performs best.

Quadratic Discriminant Analysis (QDA)

In contrast to LDA, QDA (Quadratic Discriminant Analysis) is less strict for the covariance assumption and allow different covariance for different classes, which result in a quadratic boundary.

Model evaluation - Balanced data with Cross validation

We use the qda method in the caret package to build the model and use a sub-sampling method to avoid a bias in unbalance data. In addition, we set a “cv” method with “10” number under the trControl command referring to a Cross-Validation technique with 10-fold.

We also standardize features by setting assign ‘center’ and ‘scale’ to the preProcess parameter.

Please note that there is no tunning parameter for the qda method, so we do not assign any values to th tuneLength or the tuneGrid parameters.

## Quadratic Discriminant Analysis 
## 
## 800 samples
##  30 predictor
##   2 classes: 'Bad', 'Good' 
## 
## Pre-processing: centered (45), scaled (45) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
## Addtional sampling using down-sampling prior to pre-processing
## 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.7       0.3578281

Afterward, we apply the trained model to the test set. From the confusion matrix, we see that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.685, 0.33, 0.685, 0.683 and 0.686, respectively. Comparing to the classification tree model (the best model so far), QDA outperforms in all metrics, expect specificity. Thus, we will select QDA as the best model.

Naive Bayes classifier

A Naive Bayes classifier is under a simple probabilistic classifier family, which is a machine learning model that is used to discriminate different objects based on certain features. The crux of the Naive Bayes classifier is based on the Bayes theorem and it is generally used for classification tasks. There are many types of this classifier such as Multinomial Naive Bayes, Bernoulli Naive Bayes, and Gaussian Naive Bayes.

Model evaluation - Balanced data with Cross validation and Tuning Parameters

The caret package is still the main package we apply in building the model. To build the Naive Bayes classifier, the naive_bayes method is used. As for a good practice to avoid a bias in unbalance data, a sub-sampling method is set with the trControl command. Moreover, we also set a “cv” method with “10” number under the trControl command referring to a Cross-Validation technique with 10-fold.

In addition, a search grid is built to find good parameters for our trained model. In the naive_bayes method , we can tune 3 parameters which are the following:

Kernel distribution: we set it as True or False arguments. Note that the default argument is the Gaussian distribution.
Laplace: Laplace correction is a smoothing technique to solve the problem of zero probability of unseen events. We assign values of 0.0, 0.5 and 1.0. The value of zero indicates no Laplace correction.
Adjust: This parameter allows the bandwidth adjustment of the density. We assign values of 0.75, 1.00, 1.25 and 1.50.

From the results below, we can observe that the Accuracy from gausian distribution (‘False’, left-hand sided chart) is better than the accuracy from kernel distribution (‘True’, right-hand sided chart). The accuracy between the models with the gausian distribution is relatively close to each other, and the model whose accuracy is the highest is the model with the gausian distribution, laplace correction = 1 and bandwidth adjustment = 0.75.

## Naive Bayes 
## 
## 800 samples
##  30 predictor
##   2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   usekernel  laplace  adjust  Accuracy  Kappa    
##   FALSE      0.0      0.75    0.69250   0.3611620
##   FALSE      0.0      1.00    0.67625   0.3484189
##   FALSE      0.0      1.25    0.68750   0.3568673
##   FALSE      0.0      1.50    0.68375   0.3580817
##   FALSE      0.5      0.75    0.68875   0.3593466
##   FALSE      0.5      1.00    0.69625   0.3701652
##   FALSE      0.5      1.25    0.68250   0.3418565
##   FALSE      0.5      1.50    0.68750   0.3606921
##   FALSE      1.0      0.75    0.70375   0.3877056
##   FALSE      1.0      1.00    0.68125   0.3438536
##   FALSE      1.0      1.25    0.68875   0.3580144
##   FALSE      1.0      1.50    0.70000   0.3824663
##    TRUE      0.0      0.75    0.60750   0.2855257
##    TRUE      0.0      1.00    0.57000   0.2415623
##    TRUE      0.0      1.25    0.60125   0.2801110
##    TRUE      0.0      1.50    0.59000   0.2645727
##    TRUE      0.5      0.75    0.61250   0.3005501
##    TRUE      0.5      1.00    0.57625   0.2446172
##    TRUE      0.5      1.25    0.60125   0.2632262
##    TRUE      0.5      1.50    0.61500   0.2974588
##    TRUE      1.0      0.75    0.61375   0.2897495
##    TRUE      1.0      1.00    0.61500   0.2834386
##    TRUE      1.0      1.25    0.58875   0.2653659
##    TRUE      1.0      1.50    0.62375   0.3161825
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 1, usekernel = FALSE
##  and adjust = 0.75.

##   laplace usekernel adjust
## 9       1     FALSE   0.75

We then apply the trained model to the test set. From the confusion matrix, we see that the accuracy, Kappa, balanced accuracy, sensitivity and specificity are 0.62, 0.215, 0.624, 0.633 and 0.614, respectively. Comparing to the results of the Quadratic Discriminant Analysis (QDA), the best one so far, every metric of the Navie Bayes classifier is lower. We conclude that the Quadratic Discriminant Analysis (QDA) remains the best one.

Model performance summary

In this section, we create a table containing sensitivity, specificity, accuracy, balanced accuracy, and Cohen’s kappa of every model. To do this, we can see and compare performance of each model easier and clearer.

From the model performance summary table, we observe that the decision tree model outperform the other models in terms of specificity, accuracy, balanced accuracy and Cohen’s kappa. However, some models, such as logistic regression, neural network, and navie bayes classifier, return better result in the sensitivity. Therefore, we can say that the decision tree model is more suitable than the rest in this research.

Model performance summary with balanced data
	Sensitivity	Specifity	Accuracy	Balanced_accuracy	Kappa
Decison Tree	0.6000000	0.7142857	0.680	0.6571429	0.2920354
Random Forest	0.6166667	0.6714286	0.655	0.6440476	0.2596567
Neural Networks	0.6333333	0.6642857	0.655	0.6488095	0.2659574
Logistic Regression	0.6500000	0.6357143	0.640	0.6428571	0.2500000
Linear SVM	0.5833333	0.6571429	0.635	0.6202381	0.2167382
Radial SVM	0.6000000	0.6785714	0.655	0.6392857	0.2532468
K-Nearest Neighbors	0.6500000	0.6071429	0.620	0.6285714	0.2213115
Linear Discriminant Analysis	0.6500000	0.6428571	0.645	0.6464286	0.2573222
Quadratic Discriminant Analysis	0.6833333	0.6857143	0.685	0.6845238	0.3297872
Naive Bayes classifier	0.6333333	0.6142857	0.620	0.6238095	0.2148760

Variable Importance

After the model evaluation, it would be very useful in a business context to see which variables are significant in the selected model.

Variable importance is a method that provides a measure of the importance of each feature for the model prediction quality. We analyze the variables importance of our best model which is the Quadratic Discriminant Analysis (QDA).

We use the AUC loss to compare the model quality of shuffling different variables. AUC is a synthetic measure of the distance to random model in the ROC curve plot. The larger AUC, the better the model.

According to the feature importance of the QDA model, the top 5 most important variables are HISTORY (Credit History), OWN_RES (Applicant owns residence or not), USED_CAR (Purpose of Credit), CHK_ACCT (Balance in checking account) and SAV_ACCT (Balance in savings account). It is important to mention that if we remove these variables, the AUC of the model will have the largest loss.

There are some limitations of the variable importance method that we should also take into consideration. For example, variable importance cannot see the interaction relationships between features. The combination (interaction) of several features might be a good predictor. Furthermore, the variable importance measure is dependent on the data set, so it might be subject to over-fitting.

Deployment

Deployment is the process of using your new insights to make improvements within your organization. We will focus on planning for deployment and planning monitoring and maintenance.

Planning for Deployment

In this step, we plan responsibilities for each key stakeholders, including application developers, database experts, and credit officers.

Application developers need to create an interactive web interface using tools such as R Shiny and then deploy the final model into the web interface.

Database experts need to maintain the application databases and need to ensure the quality of the data. Also, they need to be able to modify the databases in case new variables might be needed for the model improvement in the future.

Credit officers need to learn how to use the web interface and how to use the output from the model in their decision making process. For example, at the beginning, credit officers might make a decision based on both model output (50%) and other qualitative information or any other factors that are not taken into account in the model (50%). Once the model is improved and able to provide better accuracy, the credit decision might be automated and rely 100% on the model output. It’s also very important that they must be informed when there are any changes in the model.

Planning Monitoring and Maintenance

It’s also important to frequently monitor the model performance over time to ensure that the model performs as expected when it’s rolled out.

The frequency of monitoring should depend on the number of applicants over the time horizon. This is because we need a sufficient sample size to ensure that the observed accuracy is robust and meaningful. We suggest to monitor the model accuracy when there are at least 200 new applications, which is the same size as our test set.

In addition, we need to feed new information into the model and re-train the model based on new data. This is because the data distribution and customer behavior can be expected to drift over time.