Analyze the loan dataset which shows historical data of customers who are likely to default or not in a bank. The data stored in this repository as loan.csv. To complete this assignment, you will need to build classification models using Naive Bayes, Decision Tree, and Random Forest algorithms by following these steps:
Before we jump into modeling, we will try to explore the data. Load the data given (loan.csv) and assign it to an object named loan, followed by investigating the data using str() or glimpse() function.
# your code here
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Based on our investigation above, the loan data consists of 1000 observations and 17 variables. The description of each feature explained below:
checking_balance and savings_balance: Status of existing checking/savings accountmonths_loan_duration: Duration of the loan period in monthscredit_history: Between critical, good, perfect, poor, and very goodpurpose: Between business, car(new), car(used), education, furniture, and renovationsamount: Loan amount in DM (Deutsche Mark)employment_duration: Length of time at current jobpercent_of_income: Installment rate in percentage of disposable incomeyears_at_residence: Number of years at current residenceage: Customer’s ageother_credit: Other installment plans (bank/store)housing: Between rent, own, or for freeexisting_loans_count: Number of ongoing loansjob: Between management, skilled, unskilled and unemployeddependents: Number of people being liable to provide maintenance forphone: Either no or yes (registered under customer name)default: Either no or yes. A loan’s default is considered as yes when it is defaulted, charged off, or past due dateYou should also make sure that each column store the right data types. You can do data wrangling below if you need to.
Tips: You can also use parameter stringsAsFactors = TRUE from read.csv() so that all character column will automatically stored as factors.
# your code
loan <- read.csv("loan.csv",stringsAsFactors = T)
glimpse(loan)
## Rows: 1,000
## Columns: 17
## $ checking_balance <fct> < 0 DM, 1 - 200 DM, unknown, < 0 DM, < 0 DM, unkn~
## $ months_loan_duration <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12~
## $ credit_history <fct> critical, good, critical, good, poor, good, good,~
## $ purpose <fct> furniture/appliances, furniture/appliances, educa~
## $ amount <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3~
## $ savings_balance <fct> unknown, < 100 DM, < 100 DM, < 100 DM, < 100 DM, ~
## $ employment_duration <fct> > 7 years, 1 - 4 years, 4 - 7 years, 4 - 7 years,~
## $ percent_of_income <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4~
## $ years_at_residence <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4~
## $ age <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 2~
## $ other_credit <fct> none, none, none, none, none, none, none, none, n~
## $ housing <fct> own, own, own, other, other, other, own, rent, ow~
## $ existing_loans_count <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2~
## $ job <fct> skilled, skilled, unskilled, skilled, skilled, un~
## $ dependents <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ phone <fct> yes, no, no, no, no, yes, no, yes, no, no, no, no~
## $ default <fct> no, yes, no, no, yes, no, no, no, no, yes, yes, y~
As a data scientist, you will develop a model that aids management with their decision-making process. The first thing we need to know is what kind of business question we would like to solve. Loans are risky but at the same time it is also a product that generates profits for the institution through differential borrowing/lending rates. So identifying risky customers is one way to minimize lender losses. From there, we will try to predict using the given set of predictors and how we model the default variable.
Before we go through the modeling section, take your time to do the exploration step. Try to investigate the historical number of defaulted customers for each loan purpose. Please do some data aggregation to get the answer.
Hint: Because we only focused of the customers who defaulted, filter the data based on the condition needed (default == “yes”)
# your code here
x <- loan %>% filter(default == "yes")
summary(x$purpose)
## business car car0
## 34 106 5
## education furniture/appliances renovations
## 23 124 8
Before we build our model, we should split the dataset into training and test data. Please split the data into 80% training and 20% test using sample() function, set.seed(100), and store it as data_train and data_test.
Notes: Make sure you use
RNGkind()andset.seed()before splitting and run them together with yoursample()code
#RNGkind(sample.kind = "Rounding")
#set.seed(100)
intrain <- sample(nrow(loan), nrow(loan)*0.8)
data_train <- loan[intrain, ]
data_test <- loan[-intrain, ]
Let’s look at the proportion of our target classes in train data using prop.table(table(object$target)) to make sure we have a balanced proportion in train data.
# your code here
prop.table(table(data_train$default))
##
## no yes
## 0.71 0.29
Based on the proportion above, we can conclude that our target variable can be considered imbalanced; hence we will have to balance the data before using it for our models. One important thing to be kept in mind is that all sub-sampling operations have to be applied only to training dataset. So please do it on data_train using the downSample() function from the caret package, and then store the downsampled data in data_train_down object. You also need to make sure that the target variable already stored in factor data type.
Notes: set the argument
yname = "default"
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(100)
# your code here
data_train_down <- downSample(x = data_train %>% select(-default),
y = data_train$default,
yname = "default")
table(data_train_down$default)
##
## no yes
## 232 232
In the following step, please use
data_train_downto build Naive Bayes, Decision Tree, and Random Forest model below.
After splitting our data into train and test set and downsample our train data, let us build our first model of Naive Bayes. There are several advantages in using this model, for example:
Build a Naive Bayes model using naiveBayes() function from the e1071 package, then set the laplace parameter as 1. Store the model under model_naive before moving on to the next section.
library(e1071)
# your code here
model_naive <- naiveBayes(default~., data = data_train_down, laplace = 1)
model_naive
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## no yes
## 0.5 0.5
##
## Conditional probabilities:
## checking_balance
## Y < 0 DM > 200 DM 1 - 200 DM unknown
## no 0.20762712 0.08050847 0.25847458 0.45338983
## yes 0.42796610 0.05508475 0.37288136 0.14406780
##
## months_loan_duration
## Y [,1] [,2]
## no 18.70259 11.13329
## yes 25.03879 13.65573
##
## credit_history
## Y critical good perfect poor very good
## no 0.39240506 0.47257384 0.02531646 0.08016878 0.02953586
## yes 0.16877637 0.56118143 0.08860759 0.08860759 0.09282700
##
## purpose
## Y business car car0 education furniture/appliances
## no 0.09663866 0.33613445 0.02521008 0.02941176 0.48319328
## yes 0.12605042 0.34033613 0.01260504 0.08403361 0.41596639
## purpose
## Y renovations
## no 0.02941176
## yes 0.02100840
##
## amount
## Y [,1] [,2]
## no 2748.065 2042.307
## yes 4048.832 3658.293
##
## savings_balance
## Y < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown
## no 0.53586498 0.05485232 0.11814346 0.07594937 0.21518987
## yes 0.70464135 0.02531646 0.13080169 0.04219409 0.09704641
##
## employment_duration
## Y < 1 year > 7 years 1 - 4 years 4 - 7 years unemployed
## no 0.12658228 0.28691983 0.31645570 0.20675105 0.06329114
## yes 0.24472574 0.21940928 0.32911392 0.13502110 0.07172996
##
## percent_of_income
## Y [,1] [,2]
## no 2.857759 1.140187
## yes 3.099138 1.070501
##
## years_at_residence
## Y [,1] [,2]
## no 2.818966 1.140359
## yes 2.857759 1.077728
##
## age
## Y [,1] [,2]
## no 36.72845 10.52062
## yes 33.73276 11.12141
##
## other_credit
## Y bank none store
## no 0.11914894 0.83829787 0.04255319
## yes 0.19574468 0.73617021 0.06808511
##
## housing
## Y other own rent
## no 0.08510638 0.77021277 0.14468085
## yes 0.14042553 0.62978723 0.22978723
##
## existing_loans_count
## Y [,1] [,2]
## no 1.409483 0.5586748
## yes 1.366379 0.5499223
##
## job
## Y management skilled unemployed unskilled
## no 0.13983051 0.60593220 0.02966102 0.22457627
## yes 0.16949153 0.61016949 0.02118644 0.19915254
##
## dependents
## Y [,1] [,2]
## no 1.172414 0.3785564
## yes 1.159483 0.3669173
##
## phone
## Y no yes
## no 0.5897436 0.4102564
## yes 0.6324786 0.3675214
Try to predict our test data using model_naive and use type = "class" to obtain class prediction. Store the prediction under pred_naive object.
# your code here
pred_naive <- predict(model_naive, newdata = data_test, type = "class")
pred_naive
## [1] no no no no no no yes no no yes no yes no yes no yes no yes
## [19] yes no yes no no yes no no no yes yes no no no yes no yes no
## [37] yes yes no no yes yes no no yes no yes yes no no no no no yes
## [55] no yes no no no yes no no no no no yes yes no yes yes no yes
## [73] yes no no no no yes no no yes no no no no yes no no no yes
## [91] no no yes no yes yes yes yes no no yes no no yes no yes no yes
## [109] no yes yes no no yes yes no no no no no no yes no no no no
## [127] yes no yes yes yes no yes no no no no yes no no no no yes no
## [145] no yes no yes no no no no no yes yes no no no yes no yes no
## [163] yes yes no no yes no no no no yes yes yes no no no yes yes yes
## [181] yes yes yes no yes no no yes yes yes no no no yes no no no yes
## [199] yes yes
## Levels: no yes
The last part of model building would be the model evaluation. You can check the model performance for the Naive Bayes model using confusionMatrix() and compare the predicted class (pred_naive) with the actual label in data_test. Make sure that you’re using defaulted customer as the positive class (positive = "yes").
# your code here
library(caret)
confusionMatrix(data = pred_naive, reference = data_test$default, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 96 24
## yes 36 44
##
## Accuracy : 0.7
## 95% CI : (0.6314, 0.7626)
## No Information Rate : 0.66
## P-Value [Acc > NIR] : 0.1310
##
## Kappa : 0.359
##
## Mcnemar's Test P-Value : 0.1556
##
## Sensitivity : 0.6471
## Specificity : 0.7273
## Pos Pred Value : 0.5500
## Neg Pred Value : 0.8000
## Prevalence : 0.3400
## Detection Rate : 0.2200
## Detection Prevalence : 0.4000
## Balanced Accuracy : 0.6872
##
## 'Positive' Class : yes
##
The next model we’re trying to build is Decision Tree. Use ctree() function to build the model and store it under the model_dt object. To tune our model, let’s set the parameter mincriterion = 0.90.
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
set.seed(100)
# your code here
model_dt <- ctree(formula = default~.,
data = data_train_down,
control = ctree_control(mincriterion = 0.90))
model_dt
##
## Model formula:
## default ~ checking_balance + months_loan_duration + credit_history +
## purpose + amount + savings_balance + employment_duration +
## percent_of_income + years_at_residence + age + other_credit +
## housing + existing_loans_count + job + dependents + phone
##
## Fitted party:
## [1] root
## | [2] checking_balance < 0 DM, 1 - 200 DM
## | | [3] months_loan_duration <= 11: no (n = 45, err = 33.3%)
## | | [4] months_loan_duration > 11: yes (n = 250, err = 31.2%)
## | [5] checking_balance > 200 DM, unknown
## | | [6] other_credit in bank, store: yes (n = 33, err = 45.5%)
## | | [7] other_credit in none
## | | | [8] employment_duration < 1 year, unemployed: no (n = 31, err = 41.9%)
## | | | [9] employment_duration > 7 years, 1 - 4 years, 4 - 7 years
## | | | | [10] job in management, unskilled: no (n = 32, err = 28.1%)
## | | | | [11] job in skilled: no (n = 73, err = 6.8%)
##
## Number of inner nodes: 5
## Number of terminal nodes: 6
mincriterion = 0.90 is …To have a better grasp of our model, please try to plot the model and set type = "simple".
# your code here
plot(model_dt, type = "simple")
checking_balance > 200 DM, with credit_history labelled “perfect”, and saving_balance that is “unknown” is expected to defaultchecking_balance 1-200 DM, with months_loan_duration < 21 is expected to defaultchecking_balance that is “unknown”, with other_credit consist of “store” is expected to defaultNow that we have the model, please predict towards the test data based on model_dt using predict() function and set the parameter type = "response" to obtain class prediction.
# your code here
pred_dt <- predict(model_dt, newdata = data_test,type = "response")
pred_dt
## 1 7 22 23 25 34 40 46 49 55 57 77 78 79 81 90 94 98 106 110
## no no no no no no no no no yes yes yes no no no yes no yes yes yes
## 112 124 126 132 133 145 152 153 155 159 162 166 176 186 187 188 192 202 210 224
## no no yes yes yes no no no yes yes no no yes no no yes yes yes no no
## 231 237 239 247 248 252 253 256 261 268 270 297 299 305 306 310 312 315 319 321
## no no no no yes no yes yes yes yes no no no yes no no yes no no yes
## 326 336 340 342 349 354 356 359 360 361 363 368 369 372 377 385 387 388 391 395
## no no no yes no yes yes no yes yes yes yes yes no no no no yes no no
## 396 401 406 411 413 418 421 424 430 432 436 443 447 448 451 461 468 476 489 496
## yes no yes yes yes yes no no yes yes yes yes yes no no yes yes yes no yes
## 505 516 521 522 524 531 535 541 545 549 550 563 572 584 586 589 595 602 610 615
## yes no no yes no yes no yes no yes no yes no yes yes yes yes no yes no
## 622 631 633 643 644 645 651 653 654 658 659 662 679 681 684 694 701 702 711 714
## yes yes yes no no yes yes yes yes yes yes yes yes no no no no yes yes no
## 716 725 728 734 743 745 750 752 755 757 758 765 774 776 783 792 798 803 815 818
## no no yes no no yes no yes no no no no yes yes yes no no yes yes no
## 819 831 832 833 837 840 854 857 860 861 862 870 882 886 901 911 913 920 922 926
## yes no yes yes no no yes no no no no yes no yes yes no yes yes no yes
## 927 938 939 943 945 948 950 952 959 969 970 971 972 973 975 977 979 981 989 999
## yes no yes yes yes no no yes yes no no yes no yes no no no yes yes yes
## Levels: no yes
We can use confusionMatrix() to get our model performance. Make sure that you’re using defaulted customer as the positive class (positive = "yes").
# your code here
confusionMatrix(pred_dt, reference = data_test$default, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 85 15
## yes 47 53
##
## Accuracy : 0.69
## 95% CI : (0.6209, 0.7533)
## No Information Rate : 0.66
## P-Value [Acc > NIR] : 0.2066
##
## Kappa : 0.38
##
## Mcnemar's Test P-Value : 8.251e-05
##
## Sensitivity : 0.7794
## Specificity : 0.6439
## Pos Pred Value : 0.5300
## Neg Pred Value : 0.8500
## Prevalence : 0.3400
## Detection Rate : 0.2650
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.7117
##
## 'Positive' Class : yes
##
The last model that we want to build is Random Forest. The following are among the advantages of the random forest model:
Now, let’s explore the random forest model we have prepared in model_rf.RDS. The model_rf.RDS was built with the following hyperparameter:
set.seed(100) # the seed numbernumber = 5 # the number of k-fold cross-validationrepeats = 3 # the number of the iterationIn your environment, please load the random forest model (model_rf.RDS) and save it under the model_rf object using the readRDS() function.
# your code here
set.seed(100)
model_rf <- readRDS('model_rf.RDS')
model_rf
## Random Forest
##
## 476 samples
## 16 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 381, 382, 380, 381, 380, 381, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6812466 0.3623815
## 18 0.6708007 0.3414363
## 35 0.6694115 0.3387421
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Now check the summary of the final model we built using model_rf$finalModel.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# your code here
model_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 33.61%
## Confusion matrix:
## no yes class.error
## no 158 80 0.3361345
## yes 80 158 0.3361345
In practice, the random forest already have an out-of-bag estimates (OOB) that represent its accuracy on out-of-bag data (the data that is not sampled/used for building random forest).
model_rf$finalModel summary above, how can we interpret the out-of-bag error rate from our model?We could also use Variable Importance, to get a list of the most important variables used in our random forest. Many would argue that random forest, being a black box model, can offer no true information beyond its job in accuracy; actually paying special attention to attributes like variable importance for example often do help us gain valuable information about our data.
Please take your time to check which variable has a high influence to the prediction. You can use varImp() function and pass it to the plot() function to get the visualization.
# your code here
plot(varImp(model_rf))
After building the model, we can now predict the test data based on model_rf. You can use predict() function and set the parameter type = "raw" to obtain class prediction.
# your code here
pred_rf <- predict(model_rf,newdata=data_test, type = "raw")
pred_rf
## [1] no no no no no no yes no no yes yes yes no no no yes no yes
## [19] yes no no no no yes no no no yes yes yes no no no no yes no
## [37] yes yes no no yes yes no no yes no yes no no no no no no yes
## [55] no no no no no yes no yes no no no yes yes no yes yes no yes
## [73] yes no no no no no no no no no yes no no no no no yes yes
## [91] yes no yes no no yes no yes no yes yes no yes yes no yes no yes
## [109] no yes no no no yes yes yes no no no yes yes yes yes yes no no
## [127] yes yes yes no no yes no no no no no yes no no no yes yes no
## [145] no no no yes yes no yes no no yes yes no no yes yes no no no
## [163] yes yes no no yes no no no no yes no yes yes no yes yes yes yes
## [181] yes no yes no yes no no yes yes no no yes no yes no no no yes
## [199] yes yes
## Levels: no yes
Next, let us evaluate the random forest model we built using confusionMatrix(). How should you evaluate the model performance?
# your code here
confusionMatrix(pred_rf, reference = data_test$default, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 108 9
## yes 24 59
##
## Accuracy : 0.835
## 95% CI : (0.7762, 0.8836)
## No Information Rate : 0.66
## P-Value [Acc > NIR] : 2.436e-08
##
## Kappa : 0.651
##
## Mcnemar's Test P-Value : 0.01481
##
## Sensitivity : 0.8676
## Specificity : 0.8182
## Pos Pred Value : 0.7108
## Neg Pred Value : 0.9231
## Prevalence : 0.3400
## Detection Rate : 0.2950
## Detection Prevalence : 0.4150
## Balanced Accuracy : 0.8429
##
## 'Positive' Class : yes
##
Another way of evaluating model performance is through its ROC and AUC value. To calculate it, we need the probability of a positive class for each observation. Let’s try focusing on the ROC and AUC value from our random forest prediction. First, predict the test data using model_rf but now using the parameter type = "prob". The prediction will results in the probability values for each class. You can store the prediction in prob_test object.
# your code here
prob_test <- predict(model_rf,newdata=data_test, type = "prob")
Now, use the prediction() function from the ROCR package to compare the probability of positive class in prob_test[,"yes"] with the actual data data_test$default and store it as pred_roc object.
library(ROCR)
# your code here
pred_roc <- prediction(prob_test[,"yes"],data_test$default)
Next, please use the performance() function from the ROCR package, define the axes, and assign it to a perf object. To use the performance() function, please define the arguments as below: - prediction.obj = pred_roc - measure = "tpr" - x.measure = "fpr"
# your code here
perf <- performance(pred_roc, "tpr", "fpr")
perf
## A performance instance
## 'False positive rate' vs. 'True positive rate' (alpha: 'Cutoff')
## with 150 data points
After you created a perf object, plot the performance by passing it in the plot() function.
# your code here
plot(perf)
Try to evaluate the ROC Curve; see if there is any undesirable results from our model. Next, take a look at the AUC value using performance() function by setting the arguments prediction.obj = pred_roc and measure = "auc" then save it under auc object.
# your code here
auc <- auc <-performance(pred_roc, measure = "auc")
print(auc@y.values)
## [[1]]
## [1] 0.9263592
Last but not least, The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on the data the model has never seen. There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting.
To validate whether our model is fit enough, we can predict the train and test data and then evaluate model performance in both data. You can check whether the performance is well balanced based on the threshold you have set.