Import Library

library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

library(e1071)
library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(tidyverse)

## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'readr' was built under R version 3.6.3

## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::combine()  masks randomForest::combine()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x ggplot2::margin() masks randomForest::margin()

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

Data Exploration

Before we jump into modelling, we will start with data exploration first.

loan <- read.csv("loan.csv")
str(loan)

## 'data.frame':    1000 obs. of  17 variables:
##  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
##  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
##  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
##  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

head(loan)

##   checking_balance months_loan_duration credit_history              purpose
## 1           < 0 DM                    6       critical furniture/appliances
## 2       1 - 200 DM                   48           good furniture/appliances
## 3          unknown                   12       critical            education
## 4           < 0 DM                   42           good furniture/appliances
## 5           < 0 DM                   24           poor                  car
## 6          unknown                   36           good            education
##   amount savings_balance employment_duration percent_of_income
## 1   1169         unknown           > 7 years                 4
## 2   5951        < 100 DM         1 - 4 years                 2
## 3   2096        < 100 DM         4 - 7 years                 2
## 4   7882        < 100 DM         4 - 7 years                 2
## 5   4870        < 100 DM         1 - 4 years                 3
## 6   9055         unknown         1 - 4 years                 2
##   years_at_residence age other_credit housing existing_loans_count       job
## 1                  4  67         none     own                    2   skilled
## 2                  2  22         none     own                    1   skilled
## 3                  3  49         none     own                    1 unskilled
## 4                  4  45         none   other                    1   skilled
## 5                  4  53         none   other                    2   skilled
## 6                  4  35         none   other                    1 unskilled
##   dependents phone default
## 1          1   yes      no
## 2          1    no     yes
## 3          2    no      no
## 4          2    no      no
## 5          2    no     yes
## 6          2   yes      no

Based on our investigation above, the loan data consists of 1000 observations and 17 variables, which shows historical data of customers who are likely to default or not in a bank. Meanwhile, the description of each feature explained below:

checking_balance and savings_balance: Status of existing checking/savings account
months_loan_duration : Duration of the loan in months
credit_history: Between critical, good, perfect, poor and very good
purpose: Between business, car(new), car(used), education, furniture, and renovations
employment_duration: Present employment since
percent_of_income: Installment rate in percentage of disposable income
years_at_residence: Present residence since
other_credit: Other installment plans (bank/store)
housing: Between rent, own, or for free
job: Between management, skilled, unskilled and unemployed
dependents: Number of people being liable to provide maintenance for
phone: Between none and yes (registered under customer name)

Loans are risky but at the same time it is also a product that generates profits for the institution through differential borrowing/ lending rates. So identifying risky customers is one way to minimize lender losses. From there, we will try to predict using the given set of predictors and how we model the default variable.

Exploratory Data Analysis

loandefault <- loan %>% 
  filter(default == "yes")

loandefault$purpose %>% 
  table()

## .
##             business                  car                 car0 
##                   34                  106                    5 
##            education furniture/appliances          renovations 
##                   23                  124                    8

Based on the exploration above, we can see that furniture/appliances is most often to default.

Cross-Validation

Before we build our model, we should split the dataset into training and test data.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
index <- sample(1:nrow(loan),0.8*nrow(loan))
data_train<- loan[index,]
data_test<- loan[-index,]

Checking if the train and test data are balance

prop.table(table(loan$default))

## 
##  no yes 
## 0.7 0.3

prop.table(table(data_train$default))

## 
##     no    yes 
## 0.6775 0.3225

prop.table(table(data_test$default))

## 
##   no  yes 
## 0.79 0.21

Based on the proportion of the target variable above, we can conclude that our target variable can be considered to be imbalance; hence we will have to balance the train data before using it for our models.

loan_train_down <- downSample(select(data_train,-default), data_train$default, yname = "default")
prop.table(table(loan_train_down$default))

## 
##  no yes 
## 0.5 0.5

Decision Tree

After splitting our data into data_train and data_test, let us build our first model

model_dt <- ctree(formula = default ~ ., 
                  data = loan_train_down,
                  control = ctree_control(mincriterion = 0.10))
plot(model_dt)

loan_prediction <- predict(model_dt, data_test)
loan_pred_prob <- predict(model_dt, data_test, type = "prob")

In our decision model, we are going to set mincriterion = 0.10 to prune our model, so we let the tree that has maximum p-value <= 0.10 to split the node.

we will try and plot the model using type = "simple" argument.

plot(model_dt, type= "simple")

Predict on data train and test

The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen. There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting. So we are going to validate wheter our model is good enough.

pred_train_dt <- predict(model_dt, loan_train_down, type = "response")
pred_test_dt <- predict(model_dt, data_test, type = "response")

confusionMatrix(
  pred_train_dt,
  loan_train_down$default,
  positive = "yes" # kelas positive nya adalah pos
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  191  57
##        yes  67 201
##                                           
##                Accuracy : 0.7597          
##                  95% CI : (0.7204, 0.7959)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5194          
##                                           
##  Mcnemar's Test P-Value : 0.419           
##                                           
##             Sensitivity : 0.7791          
##             Specificity : 0.7403          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.7702          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3895          
##    Detection Prevalence : 0.5194          
##       Balanced Accuracy : 0.7597          
##                                           
##        'Positive' Class : yes             
##

confusionMatrix(
    pred_test_dt,
  data_test$default,
  positive = "yes" # kelas positive nya adalah pos
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  98  15
##        yes 60  27
##                                           
##                Accuracy : 0.625           
##                  95% CI : (0.5539, 0.6923)
##     No Information Rate : 0.79            
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1888          
##                                           
##  Mcnemar's Test P-Value : 3.761e-07       
##                                           
##             Sensitivity : 0.6429          
##             Specificity : 0.6203          
##          Pos Pred Value : 0.3103          
##          Neg Pred Value : 0.8673          
##              Prevalence : 0.2100          
##          Detection Rate : 0.1350          
##    Detection Prevalence : 0.4350          
##       Balanced Accuracy : 0.6316          
##                                           
##        'Positive' Class : yes             
##

Model Evaluation

From the decision tree performance above, we can conclude that our decision tree model is tends to underfitting.

Random Forest

The second model that we want to build is Random Forest. We have prepared the model in model_rf.RDS. The model_rf.RDS is built with the following hyperparameter:

set.seed(100) # the seed number
number = 5 # the number of k-fold cross-validation
repeats = 3 # the number of the iteration

model_rf <- readRDS("model_rf.RDS")

model_rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 33.61%
## Confusion matrix:
##      no yes class.error
## no  158  80   0.3361345
## yes  80 158   0.3361345

In practice, the random forest already have out-of-bag estimates (OOB) that represent an unbiased estimate of its accuracy on unseen data. Based on the model_rf$finalModel summary above, the out-of-bag error rate from our model is 33.61%. That means we have error 33.61% of our unseen data

Predicting the test data

After building the model, we can now predict the data_train and data_test based on model_rf

loan_prediction <- predict(model_rf, data_test, type = "raw")

Model evaluation

Next, we will evaluate the random forest model we built with confusionMatrix() function and try to evaluate the performance of the random forest model.

confusionMatrix(
  as.factor(loan_prediction),
  as.factor(data_test$default)
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  123   2
##        yes  35  40
##                                           
##                Accuracy : 0.815           
##                  95% CI : (0.7541, 0.8663)
##     No Information Rate : 0.79            
##     P-Value [Acc > NIR] : 0.2193          
##                                           
##                   Kappa : 0.5673          
##                                           
##  Mcnemar's Test P-Value : 1.435e-07       
##                                           
##             Sensitivity : 0.7785          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.9840          
##          Neg Pred Value : 0.5333          
##              Prevalence : 0.7900          
##          Detection Rate : 0.6150          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.8654          
##                                           
##        'Positive' Class : no              
##

varImp(model_rf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 35)
## 
##                                Overall
## amount                          100.00
## months_loan_duration             90.17
## age                              89.60
## checking_balanceunknown          82.43
## percent_of_income                42.50
## years_at_residence               37.09
## savings_balanceunknown           25.77
## existing_loans_count             20.97
## phoneyes                         20.97
## checking_balance1 - 200 DM       19.95
## other_creditnone                 19.87
## credit_historygood               19.46
## housingown                       17.71
## credit_historyperfect            17.64
## jobskilled                       17.30
## savings_balance> 1000 DM         16.72
## purposefurniture/appliances      15.88
## employment_duration1 - 4 years   15.08
## purposecar                       14.44
## employment_duration4 - 7 years   13.36

We can see that the ‘amount’ variable influence the most

Naive Bayes

The last model we are trying to compare to is the Naive Bayes. There are several advantages in using this model, for example:

The model is relatively fast to train
It is estimating a probabilistic prediction
It can handle irrelevant features

Below are the characteristics of Naive Bayes :

Assume that among the predictor variables are independent
Skewness due to data scarcity

Naive Bayes model fitting

Now let us build a naive bayes model using naiveBayes() function from the e1071 package, then set the laplace parameter

model_naive <- naiveBayes(default ~ ., loan_train_down, laplace = 1)

Predict the naive bayes model, using data train and data test

Using our test dataset we have created earlier, We are trying to predict using model_naive. The prediction will results in a probability of positive class happening for each test dataset.

pred_roc <- predict(model_naive, data_test, type = "raw")

Now, let’s take a look at ROC and AUC performance. We are going to compare the positive class in our prediction with the actual data.

spam_roc <- ROCR::prediction(pred_roc[,2],
                             data_test$default)  

# auc
performance(spam_roc, "auc")

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.7248342
## 
## 
## Slot "alpha.values":
## list()

perf <- performance(spam_roc, "tpr", "fpr")

plot(perf, colorize = T)

We will try to evaluate the ROC Curve to see if there are any undesirable results from our model.

performance(spam_roc, "auc")

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.7248342
## 
## 
## Slot "alpha.values":
## list()

Conclusion

In terms of performance, Random Forest has better performance in terms of identifying all high-risk customers. But in a financial institution, we are required to generate a rule-based model that can be implemented to the existing system. So The best model to use is decision Tree because a decision tree model is easily translatable to a set of rules.

Credit Risk Analysis

Ezra Soterion Nugroho

2/21/2020