##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: NLP
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Warning: package 'partykit' was built under R version 4.1.1
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
Bank dataset was obtained from UCI Machine Learning Repository, containing the client data, contact list and some other attributes from Protugese Bank.
bank <- read.csv("bank-full.csv", stringsAsFactors = T, sep = ";")
str(bank)## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Column Description:
Bank client data: - age (numeric) - job : type of job (categorical:“admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student” ,“blue-collar”,“self-employed”,“retired”,“technician”,“services”) - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) - default: has credit in default? (binary: “yes”,“no”) - balance: average yearly balance, in euros (numeric) - housing: has housing loan? (binary: “yes”,“no”) - loan: has personal loan? (binary: “yes”,“no”) Related with the last contact of the current campaign: - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) - day: last contact day of the month (numeric) - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) - duration: last contact duration, in seconds (numeric) Other attributes: - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) - previous: number of contacts performed before this campaign and for this client (numeric) - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target): - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
The objective of this analysis, we are going to build a predictive model to classify whether a bank clients is going to subscribe or reject the subscription. To start this off, we explore deeper into what data are we investigating.
inspect_cat(bank) %>%
show_plot()From the plot, we know that there is an unbalance number of our target variable (y). The table below shows the percentage of people that agrees to subscribe and reject to subscribe.
inspect_num(bank)## # A tibble: 7 x 10
## col_name min q1 median mean q3 max sd pcnt_na hist
## <chr> <int> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <named list>
## 1 age 18 33 39 40.9 48 95 10.6 0 <tibble [18~
## 2 balance -8019 72 448 1362. 1428 102127 3045. 0 <tibble [25~
## 3 day 1 8 16 15.8 21 31 8.32 0 <tibble [18~
## 4 duration 0 103 180 258. 319 4918 258. 0 <tibble [27~
## 5 campaign 1 1 2 2.76 3 63 3.10 0 <tibble [15~
## 6 pdays -1 -1 -1 40.2 -1 871 100. 0 <tibble [21~
## 7 previous 0 0 0 0.580 0 275 2.30 0 <tibble [30~
inspect_num(bank[,-c(15)]) %>% show_plot()GGally::ggcorr(data = bank, hjust = 1, layout.exp = 2, label = T, label_size = 3)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Warning in GGally::ggcorr(data = bank, hjust = 1, layout.exp = 2, label = T, :
## data in column(s) 'job', 'marital', 'education', 'default', 'housing', 'loan',
## 'contact', 'month', 'poutcome', 'y' are not numeric and were ignored
The plot above shows the correlations between each numeric variables. It shows that there are no variables that are significantly correlate between one another. This means that we can use Bayesian Model to use as classification method.
We differ our data into seperate train and test data. the train data is what we are going to use to train our model, and the test data is going to be the validation of our train model.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(299)
index <- sample(nrow(bank), nrow(bank)*.80)
bank_train <- bank[index,]
bank_test <- bank[-index,]prop.table(table(bank$y))##
## no yes
## 0.8830152 0.1169848
For the sake of making a good model we will downsample to even out the proportion. We chose to down sample because the data already contains 45,000 and deleting some data only had a minor effect rather than we use duplicated data.
bank_train <- downSample(x = bank_train %>% select(-y), y = bank_train$y, yname = "y")prop.table(table(bank_train$y))##
## no yes
## 0.5 0.5
nb_bank <- naiveBayes(y ~ ., data = bank_train)nb_pred.test <- predict(object = nb_bank, newdata = bank_test)
nb_pred.train <- predict(object = nb_bank, newdata = bank_train)Train data Model Evaluation
confusionMatrix(data = nb_pred.train, reference = bank_train$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 3453 1057
## yes 802 3198
##
## Accuracy : 0.7816
## 95% CI : (0.7726, 0.7903)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5631
##
## Mcnemar's Test P-Value : 3.837e-09
##
## Sensitivity : 0.8115
## Specificity : 0.7516
## Pos Pred Value : 0.7656
## Neg Pred Value : 0.7995
## Prevalence : 0.5000
## Detection Rate : 0.4058
## Detection Prevalence : 0.5300
## Balanced Accuracy : 0.7816
##
## 'Positive' Class : no
##
Test data Model Evaluation
confusionMatrix(data = nb_pred.test, reference = bank_test$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6510 288
## yes 1499 746
##
## Accuracy : 0.8024
## 95% CI : (0.794, 0.8106)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3538
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8128
## Specificity : 0.7215
## Pos Pred Value : 0.9576
## Neg Pred Value : 0.3323
## Prevalence : 0.8857
## Detection Rate : 0.7199
## Detection Prevalence : 0.7517
## Balanced Accuracy : 0.7672
##
## 'Positive' Class : no
##
The confusion matrix suggests
dt_bank <- ctree(y ~ ., data = bank_train)
plot(dt_bank, type = "simple")dt_pred.train <- predict(object = dt_bank, newdata = bank_train, type = "response")
confusionMatrix(data = dt_pred.train, reference = bank_train$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 3515 441
## yes 740 3814
##
## Accuracy : 0.8612
## 95% CI : (0.8537, 0.8685)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7224
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8261
## Specificity : 0.8964
## Pos Pred Value : 0.8885
## Neg Pred Value : 0.8375
## Prevalence : 0.5000
## Detection Rate : 0.4130
## Detection Prevalence : 0.4649
## Balanced Accuracy : 0.8612
##
## 'Positive' Class : no
##
dt_pred.test <- predict(object = dt_bank, newdata = bank_test)confusionMatrix(data = dt_pred.test, reference = bank_test$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6412 118
## yes 1597 916
##
## Accuracy : 0.8104
## 95% CI : (0.8021, 0.8184)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.423
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8006
## Specificity : 0.8859
## Pos Pred Value : 0.9819
## Neg Pred Value : 0.3645
## Prevalence : 0.8857
## Detection Rate : 0.7091
## Detection Prevalence : 0.7221
## Balanced Accuracy : 0.8432
##
## 'Positive' Class : no
##
The accuracy of the prediction on train data is significantly higher than on our test data. this suggest that the model we made are over-fit.
Our business judgement suggests that we have to minimize the possibility of false positive, where the model suggests that customer are likely to subscribe but evidently is not. This is because to treat a customer that have a possibility to subscribe the bank needs to give more attention and increases cost. So the best strategy is to avoid false Yes prediction.
dtprune_bank <- ctree(y ~ ., data = bank_train, control = ctree_control(mincriterion = .01,
minsplit = 150,
minbucket = 0))
plot(dtprune_bank, type = "simple")dtprune_pred.test <- predict(dtprune_bank, bank_test)
dtprune_pred.train <- predict(dtprune_bank, bank_train)confusionMatrix(dtprune_pred.train, bank_train$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 3611 525
## yes 644 3730
##
## Accuracy : 0.8626
## 95% CI : (0.8551, 0.8699)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7253
##
## Mcnemar's Test P-Value : 0.000558
##
## Sensitivity : 0.8486
## Specificity : 0.8766
## Pos Pred Value : 0.8731
## Neg Pred Value : 0.8528
## Prevalence : 0.5000
## Detection Rate : 0.4243
## Detection Prevalence : 0.4860
## Balanced Accuracy : 0.8626
##
## 'Positive' Class : no
##
confusionMatrix(dtprune_pred.test, bank_test$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6643 139
## yes 1366 895
##
## Accuracy : 0.8336
## 95% CI : (0.8257, 0.8412)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4582
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8294
## Specificity : 0.8656
## Pos Pred Value : 0.9795
## Neg Pred Value : 0.3958
## Prevalence : 0.8857
## Detection Rate : 0.7346
## Detection Prevalence : 0.7500
## Balanced Accuracy : 0.8475
##
## 'Positive' Class : no
##
After pruning, the model accuracy to our test data is slightly lower than our train data, this suggest that the model we made just fits the real case.
dt_pred.prob <- predict(dtprune_bank, newdata = bank_test, type = "prob")[,2]
head(dt_pred.prob)## 1 3 8 12 15 21
## 0.000000000 0.000000000 0.198113208 0.003636364 0.003636364 0.000000000
length(dt_pred.prob)## [1] 9043
length(bank_test$y)## [1] 9043
The area under curve represents the accuracy of our model. the more area is under the curve, the better the model can distinguish between positive or negative outcome.
dt_roc <- prediction(dt_pred.prob, bank_test$y)
dt_roc_vec <- performance(dt_roc, "tpr", "fpr")
plot(dt_roc_vec)
abline(0,1,lty = 2)dt_auc <- performance(dt_roc, measure = "auc")@y.values
dt_auc## [[1]]
## [1] 0.9070958
Finally, the last model that we are going to perform is the random forest model. We are not going to use the original bank data for the sake of time reduction. As a substitute we can use the bank.csv file as it contains only 10% of the original bank_full.csv data.
library(randomForest)
dim(bank)## [1] 45211 17
bank_small <- read.csv("bank.csv", stringsAsFactors = T, sep = ";")
n0_var <- nearZeroVar(bank)
bank_clean <- bank_small[,-n0_var]
bank_clean %>% inspect_cat() %>% show_plot()Before we do the random forest classification, we need to seperate the train and test data. for this instance we are going to use K-fold Cross Validation. In K-Fold we are going to spread the data into seperate equal parts, where each repetition contain (x)% train data and (100-x)% test data. By doing this, every row has at least one time of being a test data or to put it simply the model is going to be more robust.
To do the Random forest model we are going to use bagging (Bootstrap and Aggregation) where the data is randomly selected with replacement so that there is a duplicated data and Out of bag data, this is crucial to determine the error of the model later. Then we are going to use rf() to make random forest model where each bootstrap sample are going to be used to make decision tree, then we classify them to find the best model suited.
ctrl <- trainControl(method = "cv", number = 5)rf_bank <- train(y ~ ., data = bank_clean, method = "rf", trControl = ctrl)
saveRDS(rf_bank, "rf_bank.RDS")rf_bank <- readRDS("rf_bank.RDS")
rf_bank$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 40
##
## OOB estimate of error rate: 10.4%
## Confusion matrix:
## no yes class.error
## no 3849 151 0.0377500
## yes 319 202 0.6122841
Out Of Bag (OOB) data is a sequence of data that are not being used in the process of making the Random Forest model, hence the name Out Of Bag (OOB). It estimates the error rate of the random forest model, that value is 10.4% which is still relatively high. In other words, the accuracy of the model is 90.59% approximately equal value that we get from the confusion matrix.
rf_pred_test <- predict(rf_bank, bank_test)
cm.rf_pred_test <- confusionMatrix(data = rf_pred_test, reference = bank_test$y)
cm.rf_pred_test## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7735 575
## yes 274 459
##
## Accuracy : 0.9061
## 95% CI : (0.8999, 0.9121)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1.876e-10
##
## Kappa : 0.4692
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9658
## Specificity : 0.4439
## Pos Pred Value : 0.9308
## Neg Pred Value : 0.6262
## Prevalence : 0.8857
## Detection Rate : 0.8554
## Detection Prevalence : 0.9189
## Balanced Accuracy : 0.7048
##
## 'Positive' Class : no
##
rf_pred.prob <- predict(rf_bank, newdata = bank_test, type = "prob")[,2]
head(rf_pred.prob)## [1] 0.028 0.000 0.034 0.002 0.000 0.016
rf_roc <- prediction(rf_pred.prob, bank_test$y)
rf_roc_vec <- performance(rf_roc, "tpr", "fpr")
plot(rf_roc_vec)
abline(0,1,lty = 2)rf_auc <- performance(rf_roc, measure = "auc")@y.values
rf_auc## [[1]]
## [1] 0.9131718
The graph below shows the importance of each predictor variable and the comparison of it’s value with the others. varImp suggests how often each variable were used in the making of the decision tree. duration is the most used variable with 100% usage.
varImp(rf_bank)## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## duration 100.000
## balance 35.985
## age 33.717
## day 32.022
## poutcomesuccess 29.297
## campaign 12.484
## previous 8.545
## monthoct 7.137
## educationtertiary 4.987
## contactunknown 4.890
## maritalmarried 4.764
## monthjun 4.414
## housingyes 4.208
## jobtechnician 4.137
## monthaug 4.069
## monthmay 3.814
## monthmar 3.439
## monthfeb 3.324
## educationsecondary 3.079
## monthnov 2.884
plot(varImp(rf_bank))