Import Library
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'readr' was built under R version 3.6.3
## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::combine() masks randomForest::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x ggplot2::margin() masks randomForest::margin()
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Data Exploration
Before we jump into modelling, we will start with data exploration first.
## 'data.frame': 1000 obs. of 17 variables:
## $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
## $ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
## $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
## $ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
## $ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
## $ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
## $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
## checking_balance months_loan_duration credit_history purpose
## 1 < 0 DM 6 critical furniture/appliances
## 2 1 - 200 DM 48 good furniture/appliances
## 3 unknown 12 critical education
## 4 < 0 DM 42 good furniture/appliances
## 5 < 0 DM 24 poor car
## 6 unknown 36 good education
## amount savings_balance employment_duration percent_of_income
## 1 1169 unknown > 7 years 4
## 2 5951 < 100 DM 1 - 4 years 2
## 3 2096 < 100 DM 4 - 7 years 2
## 4 7882 < 100 DM 4 - 7 years 2
## 5 4870 < 100 DM 1 - 4 years 3
## 6 9055 unknown 1 - 4 years 2
## years_at_residence age other_credit housing existing_loans_count job
## 1 4 67 none own 2 skilled
## 2 2 22 none own 1 skilled
## 3 3 49 none own 1 unskilled
## 4 4 45 none other 1 skilled
## 5 4 53 none other 2 skilled
## 6 4 35 none other 1 unskilled
## dependents phone default
## 1 1 yes no
## 2 1 no yes
## 3 2 no no
## 4 2 no no
## 5 2 no yes
## 6 2 yes no
Based on our investigation above, the loan data consists of 1000 observations and 17 variables, which shows historical data of customers who are likely to default or not in a bank. Meanwhile, the description of each feature explained below:
checking_balance and savings_balance: Status of existing checking/savings accountmonths_loan_duration: Duration of the loan in monthscredit_history: Between critical, good, perfect, poor and very goodpurpose: Between business, car(new), car(used), education, furniture, and renovationsemployment_duration: Present employment sincepercent_of_income: Installment rate in percentage of disposable incomeyears_at_residence: Present residence sinceother_credit: Other installment plans (bank/store)housing: Between rent, own, or for freejob: Between management, skilled, unskilled and unemployeddependents: Number of people being liable to provide maintenance forphone: Between none and yes (registered under customer name)
Loans are risky but at the same time it is also a product that generates profits for the institution through differential borrowing/ lending rates. So identifying risky customers is one way to minimize lender losses. From there, we will try to predict using the given set of predictors and how we model the default variable.
Exploratory Data Analysis
## .
## business car car0
## 34 106 5
## education furniture/appliances renovations
## 23 124 8
Based on the exploration above, we can see that furniture/appliances is most often to default.
Cross-Validation
Before we build our model, we should split the dataset into training and test data.
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(1:nrow(loan),0.8*nrow(loan))
data_train<- loan[index,]
data_test<- loan[-index,]Checking if the train and test data are balance
##
## no yes
## 0.7 0.3
##
## no yes
## 0.6775 0.3225
##
## no yes
## 0.79 0.21
Based on the proportion of the target variable above, we can conclude that our target variable can be considered to be imbalance; hence we will have to balance the train data before using it for our models.
loan_train_down <- downSample(select(data_train,-default), data_train$default, yname = "default")
prop.table(table(loan_train_down$default))##
## no yes
## 0.5 0.5
Decision Tree
After splitting our data into data_train and data_test, let us build our first model
model_dt <- ctree(formula = default ~ .,
data = loan_train_down,
control = ctree_control(mincriterion = 0.10))
plot(model_dt)loan_prediction <- predict(model_dt, data_test)
loan_pred_prob <- predict(model_dt, data_test, type = "prob")- In our decision model, we are going to set
mincriterion = 0.10to prune our model, so we let the tree that has maximum p-value <= 0.10 to split the node.
we will try and plot the model using type = "simple" argument.
Predict on data train and test
The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen. There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting. So we are going to validate wheter our model is good enough.
pred_train_dt <- predict(model_dt, loan_train_down, type = "response")
pred_test_dt <- predict(model_dt, data_test, type = "response")
confusionMatrix(
pred_train_dt,
loan_train_down$default,
positive = "yes" # kelas positive nya adalah pos
)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 191 57
## yes 67 201
##
## Accuracy : 0.7597
## 95% CI : (0.7204, 0.7959)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5194
##
## Mcnemar's Test P-Value : 0.419
##
## Sensitivity : 0.7791
## Specificity : 0.7403
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.7702
## Prevalence : 0.5000
## Detection Rate : 0.3895
## Detection Prevalence : 0.5194
## Balanced Accuracy : 0.7597
##
## 'Positive' Class : yes
##
confusionMatrix(
pred_test_dt,
data_test$default,
positive = "yes" # kelas positive nya adalah pos
)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 98 15
## yes 60 27
##
## Accuracy : 0.625
## 95% CI : (0.5539, 0.6923)
## No Information Rate : 0.79
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1888
##
## Mcnemar's Test P-Value : 3.761e-07
##
## Sensitivity : 0.6429
## Specificity : 0.6203
## Pos Pred Value : 0.3103
## Neg Pred Value : 0.8673
## Prevalence : 0.2100
## Detection Rate : 0.1350
## Detection Prevalence : 0.4350
## Balanced Accuracy : 0.6316
##
## 'Positive' Class : yes
##
Model Evaluation
- From the decision tree performance above, we can conclude that our decision tree model is tends to underfitting.
Random Forest
The second model that we want to build is Random Forest. We have prepared the model in model_rf.RDS. The model_rf.RDS is built with the following hyperparameter:
set.seed(100)# the seed numbernumber = 5# the number of k-fold cross-validationrepeats = 3# the number of the iteration
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 33.61%
## Confusion matrix:
## no yes class.error
## no 158 80 0.3361345
## yes 80 158 0.3361345
In practice, the random forest already have out-of-bag estimates (OOB) that represent an unbiased estimate of its accuracy on unseen data. Based on the model_rf$finalModel summary above, the out-of-bag error rate from our model is 33.61%. That means we have error 33.61% of our unseen data
Predicting the test data
After building the model, we can now predict the data_train and data_test based on model_rf
Model evaluation
Next, we will evaluate the random forest model we built with confusionMatrix() function and try to evaluate the performance of the random forest model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 123 2
## yes 35 40
##
## Accuracy : 0.815
## 95% CI : (0.7541, 0.8663)
## No Information Rate : 0.79
## P-Value [Acc > NIR] : 0.2193
##
## Kappa : 0.5673
##
## Mcnemar's Test P-Value : 1.435e-07
##
## Sensitivity : 0.7785
## Specificity : 0.9524
## Pos Pred Value : 0.9840
## Neg Pred Value : 0.5333
## Prevalence : 0.7900
## Detection Rate : 0.6150
## Detection Prevalence : 0.6250
## Balanced Accuracy : 0.8654
##
## 'Positive' Class : no
##
## rf variable importance
##
## only 20 most important variables shown (out of 35)
##
## Overall
## amount 100.00
## months_loan_duration 90.17
## age 89.60
## checking_balanceunknown 82.43
## percent_of_income 42.50
## years_at_residence 37.09
## savings_balanceunknown 25.77
## existing_loans_count 20.97
## phoneyes 20.97
## checking_balance1 - 200 DM 19.95
## other_creditnone 19.87
## credit_historygood 19.46
## housingown 17.71
## credit_historyperfect 17.64
## jobskilled 17.30
## savings_balance> 1000 DM 16.72
## purposefurniture/appliances 15.88
## employment_duration1 - 4 years 15.08
## purposecar 14.44
## employment_duration4 - 7 years 13.36
We can see that the ‘amount’ variable influence the most
Naive Bayes
The last model we are trying to compare to is the Naive Bayes. There are several advantages in using this model, for example:
- The model is relatively fast to train
- It is estimating a probabilistic prediction
- It can handle irrelevant features
Below are the characteristics of Naive Bayes :
- Assume that among the predictor variables are independent
- Skewness due to data scarcity
Naive Bayes model fitting
Now let us build a naive bayes model using naiveBayes() function from the e1071 package, then set the laplace parameter
Predict the naive bayes model, using data train and data test
Using our test dataset we have created earlier, We are trying to predict using model_naive. The prediction will results in a probability of positive class happening for each test dataset.
Now, let’s take a look at ROC and AUC performance. We are going to compare the positive class in our prediction with the actual data.
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.7248342
##
##
## Slot "alpha.values":
## list()
We will try to evaluate the ROC Curve to see if there are any undesirable results from our model.
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.7248342
##
##
## Slot "alpha.values":
## list()
Conclusion
In terms of performance, Random Forest has better performance in terms of identifying all high-risk customers. But in a financial institution, we are required to generate a rule-based model that can be implemented to the existing system. So The best model to use is decision Tree because a decision tree model is easily translatable to a set of rules.