The objective of this model prediction is to predict the likelihood of costumer to purchase the product again given the costumer profile and how and when the costumer contacted regarding the promotions.
# libraries
library(caret)
library(dplyr)
library(partykit)
library(DMwR)
library(rsample)
library(class)
library(e1071)
Import and Bind two datasets into one.
<- read.csv('bank-additional-full.csv', sep = ';')
bank <- read.csv('bank-additional.csv', sep = ';')
bank2 <- distinct(rbind(bank, bank2))
bankc head(bankc)
Change all character type into category.
<- bankc %>% mutate_if(is.character, as.factor)
bankc summary(bankc)
#> age job marital
#> Min. :17.00 admin. :10419 divorced: 4611
#> 1st Qu.:32.00 blue-collar: 9253 married :24921
#> Median :38.00 technician : 6739 single :11564
#> Mean :40.02 services : 3967 unknown : 80
#> 3rd Qu.:47.00 management : 2924
#> Max. :98.00 retired : 1718
#> (Other) : 6156
#> education default housing loan
#> university.degree :12164 no :32577 no :18615 no :33938
#> high.school : 9512 unknown: 8596 unknown: 990 unknown: 990
#> basic.9y : 6045 yes : 3 yes :21571 yes : 6248
#> professional.course: 5240
#> basic.4y : 4176
#> basic.6y : 2291
#> (Other) : 1748
#> contact month day_of_week duration
#> cellular :26135 may :13767 fri:7826 Min. : 0.0
#> telephone:15041 jul : 7169 mon:8512 1st Qu.: 102.0
#> aug : 6176 thu:8618 Median : 180.0
#> jun : 5318 tue:8086 Mean : 258.3
#> nov : 4100 wed:8134 3rd Qu.: 319.0
#> apr : 2631 Max. :4918.0
#> (Other): 2015
#> campaign pdays previous poutcome
#> Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
#> 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35551
#> Median : 2.000 Median :999.0 Median :0.000 success : 1373
#> Mean : 2.568 Mean :962.5 Mean :0.173
#> 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
#> Max. :56.000 Max. :999.0 Max. :7.000
#>
#> emp.var.rate cons.price.idx cons.conf.idx euribor3m
#> Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
#> 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
#> Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
#> Mean : 0.08192 Mean :93.58 Mean :-40.5 Mean :3.621
#> 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
#> Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
#>
#> nr.employed y
#> Min. :4964 no :36537
#> 1st Qu.:5099 yes: 4639
#> Median :5191
#> Mean :5167
#> 3rd Qu.:5228
#> Max. :5228
#>
Check target proportion.
prop.table(table(bankc$y))
#>
#> no yes
#> 0.8873373 0.1126627
Our datasets have an extreme imbalanced in target variable, we could either remove the majority classes, or generate random data to make the target variable balanced later after we split the dataset.
Check Missing Value
anyNA(bankc)
#> [1] FALSE
set.seed(69420)
<- initial_split(bankc, 0.75, strata = 'y')
index <- training(index)
bank.train <- testing(index) bank.test
<- downSample(bank.train, bank.train$y)
bank.downtrain <- bank.downtrain %>% mutate_if(is.character, as.factor)
bank.downtrain prop.table(table(bank.downtrain$Class))
#>
#> no yes
#> 0.5 0.5
Just like downsampling, the goal remain the same, to balance the target proportion, but rather than removing the majority class, SMOTE will generate new data based on existing data AND also remove the majority class.
<- SMOTE(y~. , data=bank.train)
bank.train prop.table(table(bank.train$y))
#>
#> no yes
#> 0.5714286 0.4285714
Naive Bayes Machine Learning Model are a Machine learning model that leverage Bayes theorem on classification, it is called ‘Naive’ because the model assumed that all features are independent and weigh the same on its prediction.
<- naiveBayes(bank.train %>% select(-y),
model_naive1 $y) bank.train
<- predict(model_naive1, bank.test %>% select(-y)) pred_class1
confusionMatrix(pred_class1, bank.test$y, positive = 'yes')
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 7566 299
#> yes 1568 860
#>
#> Accuracy : 0.8186
#> 95% CI : (0.811, 0.826)
#> No Information Rate : 0.8874
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3859
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.74202
#> Specificity : 0.82833
#> Pos Pred Value : 0.35420
#> Neg Pred Value : 0.96198
#> Prevalence : 0.11260
#> Detection Rate : 0.08355
#> Detection Prevalence : 0.23589
#> Balanced Accuracy : 0.78518
#>
#> 'Positive' Class : yes
#>
Create an object of predictions probability
<- predict(object = model_naive1,
y_pred_prob newdata = bank.test,
type = "raw")
Create Data Frame of predicted probability vs actual
<- data.frame(pred_prob = y_pred_prob[, 'yes'],
data_roc actual = ifelse(bank.test$y == 'yes', 1, 0))
head(data_roc)
Plotting ROC
library(ROCR)
<- prediction(predictions = data_roc$pred_prob,
bank_roc_naive labels = data_roc$actual)
# ROC curve
plot(performance(bank_roc_naive, "tpr", "fpr"))
<- performance(bank_roc_naive, measure = "auc")
bank_auc @y.values bank_auc
#> [[1]]
#> [1] 0.8707561
model_naive1
#>
#> Naive Bayes Classifier for Discrete Predictors
#>
#> Call:
#> naiveBayes.default(x = bank.train %>% select(-y), y = bank.train$y)
#>
#> A-priori probabilities:
#> bank.train$y
#> no yes
#> 0.5714286 0.4285714
#>
#> Conditional probabilities:
#> age
#> bank.train$y [,1] [,2]
#> no 39.93556 9.881943
#> yes 41.17564 12.105453
#>
#> job
#> bank.train$y admin. blue-collar entrepreneur housemaid management
#> no 0.251077586 0.236997126 0.039295977 0.026077586 0.073132184
#> yes 0.262260536 0.156992337 0.025862069 0.019540230 0.062643678
#> job
#> bank.train$y retired self-employed services student technician
#> no 0.034698276 0.034913793 0.096336207 0.016307471 0.159339080
#> yes 0.090804598 0.032567050 0.083141762 0.071839080 0.145785441
#> job
#> bank.train$y unemployed unknown
#> no 0.023132184 0.008692529
#> yes 0.036590038 0.011973180
#>
#> marital
#> bank.train$y divorced married single unknown
#> no 0.115732759 0.606465517 0.275000000 0.002801724
#> yes 0.137835249 0.492432950 0.367624521 0.002107280
#>
#> education
#> bank.train$y basic.4y basic.6y basic.9y high.school illiterate
#> no 0.1035201149 0.0599856322 0.1496408046 0.2303879310 0.0005028736
#> yes 0.1035440613 0.0416666667 0.1255747126 0.2204980843 0.0008620690
#> education
#> bank.train$y professional.course university.degree unknown
#> no 0.1252873563 0.2909482759 0.0397270115
#> yes 0.1282567050 0.3176245211 0.0619731801
#>
#> default
#> bank.train$y no unknown yes
#> no 0.7802443 0.2197557 0.0000000
#> yes 0.7887931 0.2112069 0.0000000
#>
#> housing
#> bank.train$y no unknown yes
#> no 0.45237069 0.02579023 0.52183908
#> yes 0.43611111 0.07021073 0.49367816
#>
#> loan
#> bank.train$y no unknown yes
#> no 0.82356322 0.02579023 0.15064655
#> yes 0.67519157 0.06599617 0.25881226
#>
#> contact
#> bank.train$y cellular telephone
#> no 0.6063937 0.3936063
#> yes 0.7202107 0.2797893
#>
#> month
#> bank.train$y apr aug dec jul jun
#> no 0.052155172 0.150933908 0.001939655 0.173204023 0.131034483
#> yes 0.112835249 0.117432950 0.025191571 0.140900383 0.141666667
#> month
#> bank.train$y mar may nov oct sep
#> no 0.006393678 0.361709770 0.104669540 0.009267241 0.008692529
#> yes 0.057950192 0.205651341 0.087260536 0.058524904 0.052586207
#>
#> day_of_week
#> bank.train$y fri mon thu tue wed
#> no 0.1898707 0.2099138 0.2038793 0.2038793 0.1924569
#> yes 0.1772989 0.2022989 0.2188697 0.2040230 0.1975096
#>
#> duration
#> bank.train$y [,1] [,2]
#> no 218.9117 202.2535
#> yes 571.6600 367.0978
#>
#> campaign
#> bank.train$y [,1] [,2]
#> no 2.664296 2.890663
#> yes 2.076719 1.540320
#>
#> pdays
#> bank.train$y [,1] [,2]
#> no 983.9510 121.3076
#> yes 815.2704 370.2243
#>
#> previous
#> bank.train$y [,1] [,2]
#> no 0.1289511 0.4046867
#> yes 0.5496135 0.7895448
#>
#> poutcome
#> bank.train$y failure nonexistent success
#> no 0.09683908 0.88994253 0.01321839
#> yes 0.22404215 0.62538314 0.15057471
#>
#> emp.var.rate
#> bank.train$y [,1] [,2]
#> no 0.256602 1.473560
#> yes -1.186749 1.605262
#>
#> cons.price.idx
#> bank.train$y [,1] [,2]
#> no 93.60411 0.5552965
#> yes 93.39178 0.6374455
#>
#> cons.conf.idx
#> bank.train$y [,1] [,2]
#> no -40.57040 4.346403
#> yes -39.86443 5.568549
#>
#> euribor3m
#> bank.train$y [,1] [,2]
#> no 3.822509 1.629807
#> yes 2.154097 1.744165
#>
#> nr.employed
#> bank.train$y [,1] [,2]
#> no 5176.542 63.99564
#> yes 5095.886 86.60270
We can interpret every feature by the result of their own dependent probability with the target, for example :
Multiply every probability on every predictor, then divide it with total(positive and negative class), and you will have the probability.
Decision Tree are Tree Based model that create rules to decide the output. the model are easily interpreted but it is prone to overfitting, and most likely we will find ourself to use Random Forest Model.
Create Object Model
<- ctree(formula = y ~ ., data = bank.train, control = ctree_control(minbucket = 50)) bank_tree
Because our train set did not have one of the class in default columns, i have to remove the class that train did not have in my test set.
summary(bank.train$default)
#> no unknown yes
#> 19096 5264 0
summary(bank.test$default)
#> no unknown yes
#> 8102 2190 1
<- bank.test %>% filter(default != 'yes') bank.test2
<- predict(bank_tree, bank.test2, type = 'response') bank_pred_test
confusionMatrix(bank_pred_test, bank.test2$y, 'yes')
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 8048 194
#> yes 1085 965
#>
#> Accuracy : 0.8757
#> 95% CI : (0.8692, 0.882)
#> No Information Rate : 0.8874
#> P-Value [Acc > NIR] : 0.9999
#>
#> Kappa : 0.5345
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.83261
#> Specificity : 0.88120
#> Pos Pred Value : 0.47073
#> Neg Pred Value : 0.97646
#> Prevalence : 0.11261
#> Detection Rate : 0.09376
#> Detection Prevalence : 0.19918
#> Balanced Accuracy : 0.85691
#>
#> 'Positive' Class : yes
#>
Create an object of predictions probability
<- predict(object = bank_tree,
y_pred_prob_tree newdata = bank.test2,
type = "prob")
Create Data Frame of prediction vs actual
<- data.frame(pred_prob = y_pred_prob_tree[, 'yes'],
data_roc_tree actual = ifelse(bank.test2$y == 'yes', 1, 0))
head(data_roc_tree)
Plotting ROC
<- prediction(predictions = data_roc_tree$pred_prob,
bank_roc_tree labels = data_roc_tree$actual)
plot(performance(bank_roc_tree, "tpr", "fpr"))
<- performance(bank_roc_tree, measure = "auc")
bank_auc_tree @y.values bank_auc_tree
#> [[1]]
#> [1] 0.9304156
Because Decision Tree alone are prone to overfitting, we can use Random Forest model instead, random forest model is a ensemble method that use many different tree model, then aggregate the result.
Removing Columns with variance close to zero
<- nearZeroVar(bank)
n0var <- bank[,-n0var]
bankt <- bankt %>% mutate_if(is.character, as.factor) bankt
Create model and save model for future use
# set.seed(420)
#
# ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# repeats = 3)
#
# bank_forest2 <- train(y ~ .,
# data = bankt,
# method = "rf",
# trControl = ctrl,
# tuneLength = 5) #membuat 5 kombinasi mtry
# bank_forest2
#
# saveRDS(bank_forest2, 'bank_forest2.RDS')
Import model
<- readRDS('bank_forest2.RDS')
bank_forest2 bank_forest2
#> Random Forest
#>
#> 41188 samples
#> 19 predictor
#> 2 classes: 'no', 'yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 32951, 32950, 32950, 32950, 32951, 32950, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.9011039 0.2479975
#> 14 0.9150076 0.5399231
#> 27 0.9141498 0.5423452
#> 39 0.9141174 0.5429429
#> 52 0.9130573 0.5382130
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 14.
<- predict(bank_forest2, bank.test) bank_forest_pred
confusionMatrix(bank_forest_pred, bank.test$y)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 9134 1
#> yes 0 1158
#>
#> Accuracy : 0.9999
#> 95% CI : (0.9995, 1)
#> No Information Rate : 0.8874
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.9995
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 1.0000
#> Specificity : 0.9991
#> Pos Pred Value : 0.9999
#> Neg Pred Value : 1.0000
#> Prevalence : 0.8874
#> Detection Rate : 0.8874
#> Detection Prevalence : 0.8875
#> Balanced Accuracy : 0.9996
#>
#> 'Positive' Class : no
#>
Identifying the best mtry
$finalModel bank_forest2
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 14
#>
#> OOB estimate of error rate: 8.58%
#> Confusion matrix:
#> no yes class.error
#> no 35165 1383 0.03784065
#> yes 2149 2491 0.46314655
Create an object of predictions probability
<- predict(object = bank_forest2,
y_pred_prob_rf newdata = bank.train,
type = "prob")
Create Data Frame of prediction vs actual
<- data.frame(pred_prob = y_pred_prob_rf[, 'yes'],
data_roc_rf actual = ifelse(bank.train$y == 'yes', 1, 0))
head(data_roc_tree)
Plotting ROC
<- prediction(predictions = data_roc_rf$pred_prob,
bank_roc_rf labels = data_roc_rf$actual)
# ROC curve
plot(performance(bank_roc_rf, "tpr", "fpr"))
<- performance(bank_roc_rf, measure = "auc")
bank_auc_rf @y.values bank_auc_rf
#> [[1]]
#> [1] 0.9966882
varImp(bank_forest2)
#> rf variable importance
#>
#> only 20 most important variables shown (out of 52)
#>
#> Overall
#> duration 100.000
#> euribor3m 34.250
#> age 25.871
#> nr.employed 23.301
#> campaign 11.467
#> poutcomesuccess 10.283
#> cons.conf.idx 7.561
#> cons.price.idx 6.032
#> emp.var.rate 5.816
#> housingyes 5.005
#> previous 3.920
#> educationuniversity.degree 3.838
#> maritalmarried 3.655
#> loanyes 3.540
#> day_of_weekthu 3.396
#> day_of_weekwed 3.337
#> maritalsingle 3.336
#> educationhigh.school 3.330
#> day_of_weekmon 3.300
#> day_of_weektue 3.219
Random Forest weakness are limitation in interpretation, we could only tell which variables is the most important to the least important. In this model, duration are the most important and so on according the table, this mean duration variables are used in all tree in our random forest model.
Out of all the model we create, Random Forest turn out to be the best model according to all metrics. but if the goal is to understand our data, we can not use random forest model because we can not interpret it really well outside the importance of each predictor, instead, we can use Decision Tree, this model still maintaining a decent performance and can be interpreted in ruleset or plot the model to create an image of said rules (but the intepretation will get harder as the tree get deeper, and the image can not be seen cleary). Lastly we have Naive Bayes model, while it has the lowest, it still manage to have a decent accuracy and recall and the model can be interpreted really well using Naive Bayes formula to count its probability
To summarize it :
Naive Bayes : + Can be interpreted really well using its probability given x events
Decision Tree : + Can be interpreted using rule-set (Not Really when the rules get too much)
Fast to compute and decent performance
Tend to overfit
Random Forest : + the result are aggregate of decision trees, so it tend to not overfit, not bias.
High performance
Took a lot of time to compute
No interpretation other than variable importance