Introduction
In Machine Learning, Experimentation refers to the systematic process of
designing, executing, and analyzing different configurations to identify
the optimal settings that performs best on a given task. Experimentation
is learning by doing. It involves systematically changing parameters,
evaluating results with metrics, and comparing different approaches to
find the best solution; essentially, it’s the practice of testing and
refining machine learning models through controlled experiments to
improve their performance.
The key is to modify only one or a few variables at a time to isolate the impact of each change and understand its effect on model performance. In the assignment you will conduct at least 6 experiments. In real life, data scientists run anywhere from a dozen to hundreds of experiments (depending on the dataset and problem domain).
Dataset
A Portuguese bank conducted a marketing campaign(phone calls) to predict
if a client will subscribe to a term deposit The records of their
efforts are available in the form of a dataset. The objective here is to
apply machine learning techniques to analyze the dataset and figure out
most effective tactics that will help the bank in next campaign to
persuade more customers to subscribe to the bank’s term deposit.
Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
Assignment
This assignment consists of conducting at least two (2) experiments for
different algorithms: Decision Trees, Random Forest and Adaboost. That
is, at least six (6) experiments in total (3 algorithms x 2 experiments
each). For each experiment you will define what you are trying to
achieve (before each run), conduct the experiment, and at the end you
will review how your experiment went. These experiments will allow you
to compare algorithms and choose the optimal model.
Transformations and Algorithm Selection You will perform experiments using the following algorithms:
Experiment
For each of the algorithms (above), perform at least two (2)
experiments. In a typical experiment you should:
Variations
There are many things you can vary between experiments, here are some
examples:
Essay Write a short essay summarizing your findings. Your essay should include:
EDA
EDA was performed in the previous assignment: https://rpubs.com/greggmaloy/1275261
Below the EDA from the previous assignment is summarized:
There was considerable class imbalance in the target variable (y), where ~11% of clients subscribed (yes), while ~88% did not (no). There are seven numerical and ten categorical variables in the dataset. Most numerical features were right-skewed and many had outliers(detected via IQR and scatterplots). There were no strong linear relationships between features, with most variables showing either very weak correlations via correlation matrix or none at all. There was no missing data.
What the EDA Means for Decision Trees, Random Forest &
ADA Boost Models
Since ADA Boost cannot handle categorical variables, as it needs
numerical representation of categorical variables in order to assign
weights to features, and since both decision trees and random forrest
models can accomodate one-hot encoding of categorical variables, below
categorical variables are transformed via one-hot encoding. This
transformation standardizes the dataset for all three models.
Evalulation Metrics
Confusion matrix were used to assess all models in this analysis. More
specifically, accuracy, sensitivity, specificity and balance accuracy
were of particular importance.
########################################## ONLY CODE BELOW###############################################################
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(fastDummies)
library(rpart)
library(rpart.plot)
library(smotefamily)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(mlbench)
bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# Categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]
# One-hot encoding
bank_data_encoded <- fastDummies::dummy_cols(bank_data, select_columns = categorical_cols,
remove_first_dummy = TRUE, remove_selected_columns = TRUE)
# Split data 80 training/ 20 testing
set.seed(42)
trainIndex <- createDataPartition(bank_data_encoded$y_yes, p = 0.8, list = FALSE)
trainData <- bank_data_encoded[trainIndex, ]
testData <- bank_data_encoded[-trainIndex, ]
# factorize target
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))
set.seed(123)
| Model | Y=0 | Y=1 | TP | FN | FP | TN | Accuracy | Sensitivity | Specificity | Bal Acc |
|---|---|---|---|---|---|---|---|---|---|---|
| DECISION TREE | ||||||||||
| Initial Decision Tree (DT) | 39822 | 5289 | 364 | 706 | 239 | 7733 | 0.8955 | 0.9700 | 0.3402 | 0.6551 |
| SMOTE #1(K=5,dup_size=1) & DT | 31950 | 8438 | 497 | 573 | 409 | 7563 | 0.8914 | 0.9487 | 0.4645 | 0.7066 |
| SMOTE #2(K=5,dup_size=2) & DT | 31950 | 12657 | 577 | 493 | 510 | 7462 | 0.8891 | 0.9360 | 0.5393 | 0.7376 |
| Hyperparameter tuning #1 (cp=0.05) | 31950 | 12657 | 628 | 442 | 700 | 7272 | 0.8737 | 0.9122 | 0.5869 | 0.7496 |
| Hyperparameter tuning #2 (weight) | 31950 | 12657 | 901 | 169 | 1851 | 6121 | 0.7766 | 0.7678 | 0.8421 | 0.8049 |
Initial Decision Tree
The initial decision tree resulted in a model with high accuracy (90%),
sensitivity (97%), moderate class imbalance (66%) and low specificity
(34%). Since class imbalance was not adequatley addressed by the initial
decision tree, the experiments will focus on balancing the minority
class imbalance present in the dependent variable while also attempting
to improve specificity.
Experiments
The following models/experiments were run:
1. SMOTE dup_size=1 in an attempt to balance the dataset
2. SMOTE dup_size=2 in another attempt to balance dataset
3. Hyperparameter tuning #1 cp=0.05 in an attempt to improve the
specificity
4. Hyperparameter tuning which weighted the minority class in an attempt
to improve specificity
EXPERIEMENT 1: SMOTE
Since predicting class Y=1 is important, SMOTE was utilized to resample.
SMOTE generates synthetic data using k-nearest neighbors of the minority
class in order to overcome majority class imbalance. Below SMOTE was
used to generate approximately 7,000 more rows in which Y=1 and the
decision trees are again run. The expected result would be an increase
in specificity and a decrease in sensitivity. Overall accuracy would
also be expected to decrease since synthetic data is being generated.
Balance accuracy, however, would be expected to increase.
SMOTE Hypothesis
Null Hypothesis: The use of SMOTE will not effectively balance
the dataset and will not result in an improvement in specificity.
Alternate Hypothesis: The use of SMOTE will create a more
balanced dataset, leading to an increase in specificity.
SMOTE Results
Utilization of SMOTE created ~7,000 synthetic instances of the minority
class (Y=1). SMOTE did improve specificity (initial=34% versus SMOTE #2
= 54%) and the minority class was further balanced (initial model
balance accuracy = 66% versus SMOTE balance accuracy =74% ) with minimal
impact on sensitivity and accuracy. Although specificity only increase
from low to relatively moderate (initial=34% versus SMOTE #2 = 54%), we
are still able to reject the null hypothesis.
EXPERIEMENT 2: Hyperparameter Tuning
In an attempt to improve the specificity and balance the data,
hyperparameter tuning was utilized. Two experiments were run:
1.) The complex parameter (cp) was set to 0.05.
2.) The decision tree model adjusted the weights in favor of the
minority class (Y=0: 40% weight; Y=1: 60% weight).
Hyperparameter Tuning Hypothesis #1: Complexity Parameter
Increase
Null Hypothesis: The use of SMOTE will not effectively balance the
dataset and will not result in an improvement in specificity.
Alternate Hypothesis: The increase of the complexity parameter will
increase specificity.
Hyperparameter Tuning #1 Results
As expected, increasing the complexity parameter increased
specificity(SMOTE #2 specificity=54% versus HP Tuning #1 specificity
=59%) and decreased the accuracy and sensitivity slightly. As such, the
null hypothesis can be rejected. The complexity parameter increase led
to an increased specificity.
Hyperparameter Tuning Hypothesis #2: Increase the Minority Class
Weight
Null Hypothesis: Increasing the weight of the minority class will not
increase specificity.
Alternate Hypothesis: Increasing the weight of the minority class will
increase specificity.
Hyperparameter Tuning #2 Results
The decision tree model adjusted the weights in favor of the minority
class (Y=0: 40% weight; Y=1: 60% weight) and resulted in a significantly
increase specificity (HP Tuning #1 specificity =59% versus Adjusted
Weight specificity= 84%), a decrease in sensitivy (77%) and decreased
model accuracy (78%). Of note, balance accuracy increased to 80%. As
such we are able to reject the null hypothesis. Adjusting the weight in
the minority class was determined to dramatically increase the decision
tree model specificity.
Final Decision Tree Model
Among the models discussed, the final model, with adjusted weights for
the minority class in the target variable, is the preferred model. It’s
important to note, however, that this model was trained on a dataset
with 1.) ~7000 rows of synthetic data generated by SMOTE, 2.) the
model’s complexity parameter was increased, and finally 3.) class
weights were adjusted in favor of the minority class.
Visualization
Below is a visualization of the final decision tree
Generally speaking the decision tree, SMOTE and hyperparameter tuning
denoted the ‘duration’ feature to be the most influencial feature,
followed by poutcome_success, and significantly lesser influence by the
remainder of the features.
########################################## ONLY CODE BELOW###############################################################
set.seed(123)
#INITIAL DT MODEL!!!!!!!!!!!!!!!!!!
dt_model <- rpart(y_yes ~ ., data = trainData, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
rpart.plot(dt_model)
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# COnfusion matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7733 706
## 1 239 364
##
## Accuracy : 0.8955
## 95% CI : (0.889, 0.9017)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1.886e-05
##
## Kappa : 0.3825
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9700
## Specificity : 0.3402
## Pos Pred Value : 0.9163
## Neg Pred Value : 0.6036
## Prevalence : 0.8817
## Detection Rate : 0.8552
## Detection Prevalence : 0.9333
## Balanced Accuracy : 0.6551
##
## 'Positive' Class : 0
##
# Feature importance
print(dt_model$variable.importance)
## duration poutcome_success contact_unknown pdays
## 1110.9180804 685.8992109 3.9494818 0.8776626
## previous age campaign
## 0.5831593 0.1443280 0.1443280
#SMOTE 1!!!!!!!!!!!!!!!!!!
set.seed(123)
trainData$y_yes <- as.factor(trainData$y_yes)
#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)
# Creation of new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
# New class distribution
table(trainData_balanced$y_yes)
##
## 0 1
## 31950 8438
set.seed(123)
# Train decision tree
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# Confusion Matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7563 573
## 1 409 497
##
## Accuracy : 0.8914
## 95% CI : (0.8848, 0.8977)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 0.001992
##
## Kappa : 0.4425
##
## Mcnemar's Test P-Value : 1.976e-07
##
## Sensitivity : 0.9487
## Specificity : 0.4645
## Pos Pred Value : 0.9296
## Neg Pred Value : 0.5486
## Prevalence : 0.8817
## Detection Rate : 0.8364
## Detection Prevalence : 0.8998
## Balanced Accuracy : 0.7066
##
## 'Positive' Class : 0
##
# Plot
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE)
#SMOTE 2!!!!!!!!!!!
set.seed(123)
trainData$y_yes <- as.factor(trainData$y_yes)
#SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)
# new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
# class distribution
table(trainData_balanced$y_yes)
##
## 0 1
## 31950 12657
# Train dt
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# conf matrix
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7462 493
## 1 510 577
##
## Accuracy : 0.8891
## 95% CI : (0.8824, 0.8955)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 0.0146
##
## Kappa : 0.472
##
## Mcnemar's Test P-Value : 0.6134
##
## Sensitivity : 0.9360
## Specificity : 0.5393
## Pos Pred Value : 0.9380
## Neg Pred Value : 0.5308
## Prevalence : 0.8817
## Detection Rate : 0.8253
## Detection Prevalence : 0.8798
## Balanced Accuracy : 0.7376
##
## 'Positive' Class : 0
##
# Plot
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE)
#HYPERPARAMETER TUNING #1
#MODIFY COMPLEXITY PARAMETER
set.seed(123)
dt_model <- rpart(y_yes ~ ., data = trainData_balanced, method = "class",
control = rpart.control(minsplit = 20, cp = 0.05))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7272 442
## 1 700 628
##
## Accuracy : 0.8737
## 95% CI : (0.8667, 0.8805)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 0.9904
##
## Kappa : 0.4519
##
## Mcnemar's Test P-Value : 2.849e-14
##
## Sensitivity : 0.9122
## Specificity : 0.5869
## Pos Pred Value : 0.9427
## Neg Pred Value : 0.4729
## Prevalence : 0.8817
## Detection Rate : 0.8042
## Detection Prevalence : 0.8531
## Balanced Accuracy : 0.7496
##
## 'Positive' Class : 0
##
# Plot
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE)
#HYPERPARAMETER TUNING #2 CHANGING WEIGHTS!!!!!!!!!!!
set.seed(123)
dt_model <- rpart(y_yes ~ .,
data = trainData_balanced,
method = "class",
parms = list(prior = c(0.4, 0.6)), # Corrected argument name
control = rpart.control(minsplit = 20, cp = 0.01))
# Predictions
dt_predictions <- predict(dt_model, newdata = testData, type = "class")
# Confusion
conf_matrix <- confusionMatrix(dt_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6121 169
## 1 1851 901
##
## Accuracy : 0.7766
## 95% CI : (0.7679, 0.7851)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3629
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7678
## Specificity : 0.8421
## Pos Pred Value : 0.9731
## Neg Pred Value : 0.3274
## Prevalence : 0.8817
## Detection Rate : 0.6770
## Detection Prevalence : 0.6956
## Balanced Accuracy : 0.8049
##
## 'Positive' Class : 0
##
# Plot DT
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE)
| Model | Y=0 | Y=1 | TP | FN | FP | TN | Accuracy | Sensitivity | Specificity | Bal Acc |
|---|---|---|---|---|---|---|---|---|---|---|
| RANDOM FOREST | ||||||||||
| Initial Random Forest (RF) | 39822 | 5289 | 410 | 660 | 212 | 7760 | 0.9036 | 0.9734 | 0.3832 | 0.6783 |
| SMOTE #1(K=5,dup_size=1) & RF | 31950 | 8438 | 445 | 625 | 234 | 7738 | 0.9050 | 0.9706 | 0.4159 | 0.6933 |
| SMOTE #2(K=5,dup_size=2) & RF | 31950 | 12657 | 448 | 622 | 249 | 7723 | 0.9037 | 0.9688 | 0.4187 | 0.6937 |
| Hyperparameter tuning #1 (weight) | 31950 | 12657 | 530 | 540 | 371 | 7601 | 0.8992 | 0.9535 | 0.4953 | 0.7244 |
| Hyperparameter tuning #2 (weight) | 31950 | 12657 | 537 | 533 | 395 | 7577 | 0.8974 | 0.9505 | 0.5019 | 0.7319 |
Initial Random Forest
Comparable to the initial decision tree model, the initial random forest
resulted in a model with high accuracy(90%), high sensitivity(97%),
moderate class imbalance(67%) and low specificity(98%). Again, since
class imbalance was not adequatley addressed by the initial random
forest, the experiments will focus on balancing the minority class
imbalance present in the dependent variable while also attempting to
improve specificity.
Experiments
Below the following models were run:
1.) Initial random forest
2.) Experiment 1 SMOTE: SMOTE was utilized and dup_size was set to
dup_size=1 in an attempt to address class imbalance
3.) Experiment 1 SMOTE: Again a SMOTE was utilized and dup_size set to
dup_size=2 in an attempt to address class imbalance
4.) Experiement 2 hyperparameter Tuning: hyperparameter tuning which
increased the weighted of the minority class (classwt = c(1, 2))
5.) Experiement 2 hyperparameter Tuning: hyperparameter tuning which
further increased the weighted of the minority class (classwt = c(1,
3))
Again, confusion matrix were used to assess the models. The model’s specificity still is of the utmost importance since the dependent variable/feature suffers from considerable class imbalance (Y=0: ~89% versus Y=1: ~11%).
EXPERIEMENT 1: SMOTE
Since predicting class Y=1 is important, SMOTE was again utilized to
resample. SMOTE generates synthetic data using k-nearest neighbors of
the minority class in order to overcome majority class imbalance. Below
SMOTE is used to generate approximately 7000 more rows in which Y=1 and
the random forest models are again run. The expected result would be an
increase in specificity and a decrease in sensitivity. Accuracy would
also be expected to decrease since synthetic data is being
generated.
SMOTE Hypothesis
Null Hypothesis: The use of SMOTE will not create a more
balance the dataset and will not result in an improvement in
specificity.
Alternate Hypothesis: The use of SMOTE will create a more
balanced dataset, leading to an increase in specificity.
SMOTE Results
The initial model was a poor predictor of identifying the minority class
due to the class imbalance (balance accuracy = 68%). The model also
denoted low specificity(38%), good sensitivity(97%) and good model
accuracy(90%). These results were comparable with the inital random
forest model. Unlike the decision tree model, SMOTE did not improve
specificity as much (initial model=38% versus SMOTE #2 = 42%) and the
minority class was only slightly more balanced (balance accuracy SMOTE 2
=69% versus initial=68%) with minimal impact on sensitivity and
accuracy. As such SMOTE was not as effective in balancing the dataset
for random forest as for decision trees. However, the null hypothesis
can be rejected because SMOTE did address the class imbalance.
EXPERIEMENT 2: Hyperparameter Tuning
In an attempt to improve the class imbalance and specificity,
hyperparameter tuning was utilized.
Hyperparameters were adjusted twice:
1.) hyperparameter tuning which increased the weight of the minority
class. (classwt = c(1, 2))
2.) hyperparameter tuning which increased the weight of the minority
class further. (classwt = c(1, 3))
Hyperparameter Tuning Hypothesis: Increase the Minority Class
Weight
Null Hypothesis: Increasing the weight of the minority class
will not increase specificity
Alternate Hypothesis: Increasing the weight of the minority
class will increase specificity
Hyperparameter Tuning
The weighting improved specificity from 38% (initial model) to 50%, but
it was still weaker than the final decision tree model (84%). Generally
speaking, the ‘duration’ feature was determined as the most influencial
feature for this final random forest model, followed by
poutcome_success. Although weighting increased balance accuracy, the
increase in balance accuracy was not very large (Weight =73%, SMOTE=69%,
initial model= 67%). We are still, however, able to reject the null
hypothesis.
Final Random Forest Model
Of the above random forest models, the final model with the adjusted
weights for the target variable minority class would be favored.
Important to note, however, this model uses a dataset which 1.) had
synthetic data generated by SMOTE 2.) utilized the SMOTE generated
dataset to increase the target variable weight to improve class
imbalance and specificity. In other words, the final model uses both
SMOTE data and adjusted the weights of the target variable.
########################################## ONLY CODE BELOW###############################################################
colnames(trainData) <- trimws(colnames(trainData))
colnames(testData) <- trimws(colnames(testData))
colnames(trainData) <- make.names(colnames(trainData))
colnames(testData) <- make.names(colnames(testData))
#initial RF
set.seed(123)
#clean data by elimiating spaces, etc
colnames(trainData) <- trimws(colnames(trainData))
colnames(testData) <- trimws(colnames(testData))
colnames(trainData) <- make.names(colnames(trainData))
colnames(testData) <- make.names(colnames(testData))
# Train RF model
rf_model <- randomForest(y_yes ~ ., data = trainData, ntree = 100, mtry = sqrt(ncol(trainData) - 1), importance = TRUE)
# Predictions
rf_predictions <- predict(rf_model, newdata = testData, type = "class")
# confusion matrix
conf_matrix <- confusionMatrix(factor(rf_predictions), factor(testData$y_yes))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7760 660
## 1 212 410
##
## Accuracy : 0.9036
## 95% CI : (0.8973, 0.9096)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1.888e-11
##
## Kappa : 0.4355
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9734
## Specificity : 0.3832
## Pos Pred Value : 0.9216
## Neg Pred Value : 0.6592
## Prevalence : 0.8817
## Detection Rate : 0.8582
## Detection Prevalence : 0.9312
## Balanced Accuracy : 0.6783
##
## 'Positive' Class : 0
##
# Feature importance
print(importance(rf_model))
## 0 1 MeanDecreaseAccuracy
## age 19.256946455 9.8764595 23.2592877
## balance 2.207072030 4.9043589 5.2897120
## day 25.674296666 2.4417613 25.9694724
## duration 55.268989858 99.1389350 87.8663890
## campaign 8.398792635 6.2730665 10.4628118
## pdays 13.484429807 9.9624594 15.2659462
## previous 6.884958037 5.9887055 7.4877267
## job_blue.collar 4.392803753 2.5880214 5.0773431
## job_entrepreneur 0.628089794 0.3553440 0.7493254
## job_housemaid 2.445014392 -2.3434614 0.8673570
## job_management 3.266437014 1.4669602 3.8311581
## job_retired 1.718451225 2.3962232 2.8974627
## job_self.employed 0.004851697 -0.5527589 -0.2594183
## job_services 2.159301219 0.1135077 1.9644682
## job_student 5.439205194 1.1827379 6.0150204
## job_technician 2.312954662 1.1110553 2.5486284
## job_unemployed 0.387435937 -1.4569323 -0.5483327
## job_unknown 1.953482526 -1.4909916 1.2639386
## marital_married 5.274191354 4.0475651 7.3539484
## marital_single 4.987173524 2.4868982 6.3379350
## education_secondary 0.477539883 0.5195440 0.8186026
## education_tertiary 4.851742773 3.3924724 5.7183920
## education_unknown -0.102589961 -0.5151976 -0.3523125
## default_yes 0.641010985 2.6382661 2.1114791
## housing_yes 15.142366680 16.4947867 21.1485187
## loan_yes -2.304014299 10.2540403 4.6931123
## contact_telephone 4.716429159 0.5906281 4.2907130
## contact_unknown 21.552401015 6.4277420 22.5738388
## month_aug 16.754644269 -1.7683791 16.7655979
## month_dec 7.838966414 4.6731467 8.8037796
## month_feb 14.438009607 0.2258086 14.4305664
## month_jan 15.735228042 -3.1363352 15.1970251
## month_jul 17.169187032 1.2876320 17.3416497
## month_jun 15.170957081 -1.3380552 15.3592556
## month_mar 15.402942706 21.1085927 21.2418963
## month_may 11.466287170 7.4654130 12.0426590
## month_nov 13.033046784 -1.2610798 12.9620166
## month_oct 15.183891145 12.0224645 17.6886436
## month_sep 12.746151446 7.8365523 13.5633357
## poutcome_other 2.659058836 2.9152952 3.1951152
## poutcome_success 5.011075672 40.7228659 18.3215721
## poutcome_unknown 7.298410344 3.8488444 7.2858040
## MeanDecreaseGini
## age 584.081660
## balance 572.136195
## day 505.459817
## duration 1847.481764
## campaign 224.465591
## pdays 294.629309
## previous 147.859857
## job_blue.collar 54.948871
## job_entrepreneur 22.433592
## job_housemaid 21.489897
## job_management 68.102311
## job_retired 35.377496
## job_self.employed 28.797968
## job_services 41.295741
## job_student 33.514945
## job_technician 67.481750
## job_unemployed 30.835596
## job_unknown 8.837568
## marital_married 73.618299
## marital_single 58.923226
## education_secondary 80.522398
## education_tertiary 76.049339
## education_unknown 35.477192
## default_yes 11.412613
## housing_yes 145.890260
## loan_yes 62.238459
## contact_telephone 44.827778
## contact_unknown 94.706311
## month_aug 64.774379
## month_dec 27.218596
## month_feb 59.358935
## month_jan 39.782869
## month_jul 64.712422
## month_jun 73.708280
## month_mar 91.195724
## month_may 66.843562
## month_nov 57.203945
## month_oct 79.025797
## month_sep 61.471510
## poutcome_other 26.397949
## poutcome_success 411.395694
## poutcome_unknown 46.885629
varImpPlot(rf_model)
set.seed(123)
#SMOTE #1
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)
# Create new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
# Train RF
rf_model <- randomForest(y_yes ~ ., data = trainData_balanced, ntree = 100,
mtry = sqrt(ncol(trainData_balanced) - 1), importance = TRUE)
# Predictions
rf_predictions <- predict(rf_model, newdata = testData, type = "class")
# confusion
rf_conf_matrix <- confusionMatrix(rf_predictions, testData$y_yes)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7738 625
## 1 234 445
##
## Accuracy : 0.905
## 95% CI : (0.8988, 0.911)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 8.176e-13
##
## Kappa : 0.4592
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9706
## Specificity : 0.4159
## Pos Pred Value : 0.9253
## Neg Pred Value : 0.6554
## Prevalence : 0.8817
## Detection Rate : 0.8558
## Detection Prevalence : 0.9249
## Balanced Accuracy : 0.6933
##
## 'Positive' Class : 0
##
# Plot
varImpPlot(rf_model)
table(trainData_balanced$y_yes)
##
## 0 1
## 31950 8438
#SMOTE 2
set.seed(123)
# SMOTE
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)
# Create new dataset
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
# Train RF
rf_model <- randomForest(y_yes ~ ., data = trainData_balanced, ntree = 100,
mtry = sqrt(ncol(trainData_balanced) - 1), importance = TRUE)
# Predictions
rf_predictions <- predict(rf_model, newdata = testData, type = "class")
# confusion
rf_conf_matrix <- confusionMatrix(rf_predictions, testData$y_yes)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7723 622
## 1 249 448
##
## Accuracy : 0.9037
## 95% CI : (0.8974, 0.9097)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1.494e-11
##
## Kappa : 0.4563
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9688
## Specificity : 0.4187
## Pos Pred Value : 0.9255
## Neg Pred Value : 0.6428
## Prevalence : 0.8817
## Detection Rate : 0.8541
## Detection Prevalence : 0.9229
## Balanced Accuracy : 0.6937
##
## 'Positive' Class : 0
##
# Plot
varImpPlot(rf_model)
table(trainData_balanced$y_yes)
##
## 0 1
## 31950 12657
#HYPERPARAMETER TUNING 1
set.seed(123)
rf_model <- randomForest(y_yes ~ .,
data = trainData_balanced,
ntree = 100,
mtry = sqrt(ncol(trainData_balanced)),
classwt = c(1, 2), # Penalizing false positives
importance = TRUE)
rf_predictions <- predict(rf_model, newdata = testData, type = "class")
rf_conf_matrix <- confusionMatrix(rf_predictions, testData$y_yes)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7601 540
## 1 371 530
##
## Accuracy : 0.8992
## 95% CI : (0.8929, 0.9054)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 6.665e-08
##
## Kappa : 0.4817
##
## Mcnemar's Test P-Value : 2.605e-08
##
## Sensitivity : 0.9535
## Specificity : 0.4953
## Pos Pred Value : 0.9337
## Neg Pred Value : 0.5882
## Prevalence : 0.8817
## Detection Rate : 0.8406
## Detection Prevalence : 0.9004
## Balanced Accuracy : 0.7244
##
## 'Positive' Class : 0
##
varImpPlot(rf_model)
#HYPERPARAMETER TUNING 2
set.seed(123)
rf_model <- randomForest(y_yes ~ .,
data = trainData_balanced,
ntree = 100,
mtry = sqrt(ncol(trainData_balanced)),
classwt = c(1, 3),
importance = TRUE)
rf_predictions <- predict(rf_model, newdata = testData, type = "class")
rf_conf_matrix <- confusionMatrix(rf_predictions, testData$y_yes)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7577 533
## 1 395 537
##
## Accuracy : 0.8974
## 95% CI : (0.8909, 0.9035)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1.326e-06
##
## Kappa : 0.4791
##
## Mcnemar's Test P-Value : 6.884e-06
##
## Sensitivity : 0.9505
## Specificity : 0.5019
## Pos Pred Value : 0.9343
## Neg Pred Value : 0.5762
## Prevalence : 0.8817
## Detection Rate : 0.8380
## Detection Prevalence : 0.8969
## Balanced Accuracy : 0.7262
##
## 'Positive' Class : 0
##
varImpPlot(rf_model)
| Model | Y=0 | Y=1 | TP | FN | FP | TN | Accuracy | Sensitivity | Specificity | Balance Acc |
|---|---|---|---|---|---|---|---|---|---|---|
| ADA BOOST | ||||||||||
| Initial ADA BOOST (ADA) | 39822 | 5289 | 236 | 807 | 134 | 7838 | 0.8959 | 0.9832 | 0.2458 | 0.6145 |
| SMOTE #1(K=5,dup_size=1) & ADA | 31950 | 8438 | 461 | 609 | 305 | 7667 | 0.8989 | 0.9617 | 0.4308 | 0.6963 |
| SMOTE #2(K=5,dup_size=2) & ADA | 31950 | 12657 | 520 | 550 | 377 | 7595 | 0.8975 | 0.9527 | 0.4860 | 0.7193 |
| Hyperparameter tuning #1 (weight4,1) | 31950 | 12657 | 507 | 563 | 333 | 7639 | 0.9009 | 0.9582 | 0.4738 | 0.7160 |
| Hyperparameter tuning #2 (weight5,1) | 31950 | 12657 | 507 | 563 | 333 | 7639 | 0.9009 | 0.9582 | 0.4738 | 0.7160 |
Initial ADA Boost
Comparable to the initial decision tree and the initial random forest
models, the initial ADA Boost resulted in a model with high
accuracy(90%), sensitivity (98%), moderate class imbalance(61%) and low
specificity(25%). Of note, the initial ADA Boost model had a specificity
which was ~10% lower than that of the initial decision tree and random
forest models. Again, as with the initial decision tree and random
forest models, class imbalance was not adequately addressed by the
initial ADA Boost model. As such, the experiments will focus on
balancing the minority class imbalance present in the dependent variable
while also attempting to improve specificity.
Experiments
Below the following models were run:
1.) Initial ADA Boost
2.) Experiment 1 SMOTE: SMOTE was utilized and dup_size was set to
dup_size=1 in an attempt to address class imbalance
3.) Experiment 1 SMOTE: Again a SMOTE was utilized and dup_size set to
dup_size=2 in an attempt to address class imbalance
4.) Experiement 2 hyperparameter Tuning: hyperparameter tuning which
increased the weighted of the minority class (weight (4,1))
5.) Experiement 2 hyperparameter Tuning: hyperparameter tuning which
further increased the weighted of the minority class (weight (5,1))
Again, confusion matrix were used to assess the models. The model’s specificity still is of the utmost importance since the dependent variable/feature suffers from considerable class imbalance (Y=0: ~89% versus Y=1: ~11%) and the initial ADA Boost model failed to address this imbalance.
EXPERIEMENT 1: SMOTE
Since predicting class Y=1 is important, SMOTE was again utilized to
resample. SMOTE generates synthetic data using k-nearest neighbors of
the minority class in order to overcome majority class imbalance. Below
SMOTE is used to generate approximately 7000 more rows in which Y=1 and
the decision trees are again run. The expected result would be an
increase in specificity and a decrease in sensitivity. Accuracy would
also be expected to decrease since synthetic data is being
generated.
SMOTE Hypothesis
Null Hypothesis: The use of SMOTE will not effectively balance
the dataset and will not result in an improvement in specificity.
Alternate Hypothesis: The use of SMOTE will create a more
balanced dataset, leading to an increase in specificity.
SMOTE RESULTS
Utilization of SMOTE created ~7000 synthetic instances of the minority
class (Y=1). Utilization of SMOTE improve specificity (initial model=24%
versus SMOTE #2 = 49%) and the minority class was more balanced (initial
ada boost =61% versus SMOTE balance accuracy =72%). Again, althought
sensitivity decreased with SMOTE, the decrease was only negliable (~3%).
Model accuracy remained unchanged by SMOTE. As such SMOTE was useful in
balancing the class imbalance present in the dependent variable.
Accordingly, the null hypothesis can be rejected because SMOTE did
address the class imbalance.
EXPERIEMENT 2: Hyperparameter Tuning
In an attempt to improve the specificity, hyperparameter tuning was
utilized. Two experiments were run:
1.) hyperparameter tuning which increased the weight of the minority
class. (weight4,1)
2.) hyperparameter tuning which increased the weight of the minority
class further. (weight9,1)
Hyperparameter Tuning Hypothesis: Increase the Minority Class
Weight
Null Hypothesis: Increasing the weight of the minority class will not
increase specificity
Alternate Hypothesis: Increasing the weight of the minority class will
increase specificity
Hyperparameter Tuning Results
The weight-adjusted hyperparameter tuning failed to improve specificity
as expected. In fact, specificity slightly decreased by 1%, making it an
ineffective for balancing the target variable. As such, for the ADA
Boost model we cannot reject the null hypothesis. Of note, the second
weighted model returned identical results as the first weighted model,
alluding to the impracticality of using weights with ADA Boost
models.
Final ADA Boost Model
The SMOTE model was selected as the preferred for the ADA Boost subset
since it significantly improved specificity (from 25% to 49%) while
maintaining high accuracy and sensitivity. In contrast, weighting the
minority class had no positive effect on specificity and led to the same
results across different weights. This may be because that ADA Boost
already adjusts for class imbalance, making the above weight adjustments
less effective.
########################################## ONLY CODE BELOW###############################################################
#initial ADA
set.seed(123)
trainData$y_yes <- factor(trainData$y_yes, levels = c(0, 1))
testData$y_yes <- factor(testData$y_yes, levels = c(0, 1))
# train control for boosting
train_control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
# AdaBoost model
ada_model <- train(
y_yes ~ .,
data = trainData,
method = "AdaBoost.M1", # AdaBoost method in caret
trControl = train_control,
tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman") # 100 weak learners (stumps)
)
# Predictions
ada_predictions <- predict(ada_model, newdata = testData)
# confusion
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7838 807
## 1 134 263
##
## Accuracy : 0.8959
## 95% CI : (0.8895, 0.9022)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1.04e-05
##
## Kappa : 0.3147
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9832
## Specificity : 0.2458
## Pos Pred Value : 0.9067
## Neg Pred Value : 0.6625
## Prevalence : 0.8817
## Detection Rate : 0.8668
## Detection Prevalence : 0.9561
## Balanced Accuracy : 0.6145
##
## 'Positive' Class : 0
##
# Feature importance
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
##
## only 20 most important variables shown (out of 42)
##
## Overall
## duration 100.0000
## poutcome_success 30.4888
## contact_unknown 21.6696
## month_mar 5.6232
## housing_yes 5.4766
## pdays 2.6769
## age 2.6350
## month_oct 2.2611
## month_may 1.5601
## balance 1.1913
## campaign 1.0611
## loan_yes 0.6342
## month_sep 0.5849
## marital_married 0.5319
## job_blue.collar 0.5197
## month_jul 0.4191
## month_nov 0.0000
## contact_telephone 0.0000
## month_feb 0.0000
## month_aug 0.0000
plot(var_importance)
#SMOTE #1
set.seed(123)
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 1)
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
#boost control
train_control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
# AdaBoost model
ada_model <- train(
y_yes ~ .,
data = trainData_balanced,
method = "AdaBoost.M1", # AdaBoost method in caret
trControl = train_control,
tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman") # 100 weak learners (stumps)
)
ada_predictions <- predict(ada_model, newdata = testData)
# confusion
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7667 609
## 1 305 461
##
## Accuracy : 0.8989
## 95% CI : (0.8925, 0.9051)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1.158e-07
##
## Kappa : 0.4476
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9617
## Specificity : 0.4308
## Pos Pred Value : 0.9264
## Neg Pred Value : 0.6018
## Prevalence : 0.8817
## Detection Rate : 0.8479
## Detection Prevalence : 0.9153
## Balanced Accuracy : 0.6963
##
## 'Positive' Class : 0
##
# Feature importance
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
##
## only 20 most important variables shown (out of 42)
##
## Overall
## duration 100.0000
## poutcome_success 47.4535
## contact_unknown 28.3437
## housing_yes 17.5030
## month_mar 6.0438
## marital_married 4.8515
## month_jul 3.8791
## month_may 3.8326
## education_secondary 3.6657
## education_tertiary 3.5538
## month_aug 2.8578
## month_oct 2.7972
## campaign 2.4225
## loan_yes 2.1898
## month_jan 1.7651
## month_nov 1.6598
## month_sep 1.4084
## pdays 0.9504
## age 0.6801
## job_blue.collar 0.5957
plot(var_importance)
table(trainData_balanced$y_yes)
##
## 0 1
## 31950 8438
#SMOTE 2
set.seed(123)
smote_data <- SMOTE(trainData[,-which(names(trainData) == "y_yes")], trainData$y_yes, K = 5, dup_size = 2)
trainData_balanced <- smote_data$data
colnames(trainData_balanced)[ncol(trainData_balanced)] <- "y_yes"
trainData_balanced$y_yes <- as.factor(trainData_balanced$y_yes)
# control for boost
train_control <- trainControl(method = "cv", number = 5)
# daBoost model
ada_model <- train(
y_yes ~ .,
data = trainData_balanced,
method = "AdaBoost.M1", #
trControl = train_control,
tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman")
)
ada_predictions <- predict(ada_model, newdata = testData)
# confusion
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7595 550
## 1 377 520
##
## Accuracy : 0.8975
## 95% CI : (0.891, 0.9037)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 1.123e-06
##
## Kappa : 0.4717
##
## Mcnemar's Test P-Value : 1.612e-08
##
## Sensitivity : 0.9527
## Specificity : 0.4860
## Pos Pred Value : 0.9325
## Neg Pred Value : 0.5797
## Prevalence : 0.8817
## Detection Rate : 0.8400
## Detection Prevalence : 0.9008
## Balanced Accuracy : 0.7193
##
## 'Positive' Class : 0
##
# Feature importance
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
##
## only 20 most important variables shown (out of 42)
##
## Overall
## duration 100.0000
## poutcome_success 48.2869
## contact_unknown 26.4591
## housing_yes 24.7824
## marital_married 9.9207
## education_secondary 9.1166
## month_mar 5.8022
## education_tertiary 5.1447
## month_aug 5.0308
## month_jul 4.9175
## month_may 4.5223
## month_oct 3.6235
## month_jan 3.5693
## loan_yes 2.8208
## campaign 2.8148
## marital_single 1.9306
## month_nov 1.8669
## job_retired 0.9932
## job_blue.collar 0.7114
## month_sep 0.5606
plot(var_importance)
table(trainData_balanced$y_yes)
##
## 0 1
## 31950 12657
#weight 1
#HYPERPARAMETER TUNING 1 WEIGHT
set.seed(123)
train_control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
ada_model <- train(
y_yes ~ .,
data = trainData_balanced,
method = "AdaBoost.M1",
trControl = train_control,
weights = ifelse(trainData_balanced$y_yes == "MinorityClass", 4, 1), # Assign higher weight to minority class
tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman")
)
ada_predictions <- predict(ada_model, newdata = testData)
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7639 563
## 1 333 507
##
## Accuracy : 0.9009
## 95% CI : (0.8946, 0.907)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 3.583e-09
##
## Kappa : 0.4764
##
## Mcnemar's Test P-Value : 2.004e-14
##
## Sensitivity : 0.9582
## Specificity : 0.4738
## Pos Pred Value : 0.9314
## Neg Pred Value : 0.6036
## Prevalence : 0.8817
## Detection Rate : 0.8448
## Detection Prevalence : 0.9071
## Balanced Accuracy : 0.7160
##
## 'Positive' Class : 0
##
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
##
## only 20 most important variables shown (out of 42)
##
## Overall
## duration 100.0000
## poutcome_success 52.5760
## contact_unknown 37.2597
## housing_yes 27.8386
## marital_married 13.4268
## education_secondary 10.3990
## month_mar 6.7924
## education_tertiary 5.6332
## month_jul 5.4873
## month_may 5.4068
## month_nov 5.4048
## month_aug 5.3463
## month_oct 4.5147
## loan_yes 3.5497
## campaign 3.1920
## month_jan 2.5875
## month_sep 1.4705
## marital_single 0.9854
## job_blue.collar 0.7762
## job_retired 0.7395
plot(var_importance)
weight#2
#HYPERPARAMETER TUNING 2
set.seed(123)
train_control <- trainControl(method = "cv", number = 5)
ada_model <- train(
y_yes ~ .,
data = trainData_balanced,
method = "AdaBoost.M1",
trControl = train_control,
weights = ifelse(trainData_balanced$y_yes == "MinorityClass", 9, 1),
tuneGrid = expand.grid(mfinal = 100, maxdepth = 1, coeflearn = "Breiman")
)
ada_predictions <- predict(ada_model, newdata = testData)
conf_matrix <- confusionMatrix(ada_predictions, testData$y_yes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7639 563
## 1 333 507
##
## Accuracy : 0.9009
## 95% CI : (0.8946, 0.907)
## No Information Rate : 0.8817
## P-Value [Acc > NIR] : 3.583e-09
##
## Kappa : 0.4764
##
## Mcnemar's Test P-Value : 2.004e-14
##
## Sensitivity : 0.9582
## Specificity : 0.4738
## Pos Pred Value : 0.9314
## Neg Pred Value : 0.6036
## Prevalence : 0.8817
## Detection Rate : 0.8448
## Detection Prevalence : 0.9071
## Balanced Accuracy : 0.7160
##
## 'Positive' Class : 0
##
var_importance <- varImp(ada_model)
print(var_importance)
## AdaBoost.M1 variable importance
##
## only 20 most important variables shown (out of 42)
##
## Overall
## duration 100.0000
## poutcome_success 52.5760
## contact_unknown 37.2597
## housing_yes 27.8386
## marital_married 13.4268
## education_secondary 10.3990
## month_mar 6.7924
## education_tertiary 5.6332
## month_jul 5.4873
## month_may 5.4068
## month_nov 5.4048
## month_aug 5.3463
## month_oct 4.5147
## loan_yes 3.5497
## campaign 3.1920
## month_jan 2.5875
## month_sep 1.4705
## marital_single 0.9854
## job_blue.collar 0.7762
## job_retired 0.7395
plot(var_importance)
| Model | Y=0 | Y=1 | TP | FN | FP | TN | Accuracy | Sensitivity | Specificity | Balance Acc |
|---|---|---|---|---|---|---|---|---|---|---|
| Decision Tree | ||||||||||
| Initial Decision Tree (DT) | 39822 | 5289 | 364 | 706 | 239 | 7733 | 0.8955 | 0.9700 | 0.3402 | 0.6551 |
| SMOTE #1(K=5,dup_size=1) & DT | 31950 | 8438 | 497 | 573 | 409 | 7563 | 0.8914 | 0.9487 | 0.4645 | 0.7066 |
| SMOTE #2(K=5,dup_size=2) & DT | 31950 | 12657 | 577 | 493 | 510 | 7462 | 0.8891 | 0.9360 | 0.5393 | 0.7376 |
| Hyperparameter tuning #1 (cp=0.05) | 31950 | 12657 | 628 | 442 | 700 | 7272 | 0.8737 | 0.9122 | 0.5869 | 0.7496 |
| Hyperparameter tuning #2 (weight) | 31950 | 12657 | 901 | 169 | 1851 | 6121 | 0.7766 | 0.7678 | 0.8421 | 0.8049 |
| Random Forest | ||||||||||
| Initial Random Forest (RF) | 39822 | 5289 | 410 | 660 | 212 | 7760 | 0.9036 | 0.9734 | 0.3832 | 0.6783 |
| SMOTE #1(K=5,dup_size=1) & RF | 31950 | 8438 | 445 | 625 | 234 | 7738 | 0.9050 | 0.9706 | 0.4159 | 0.6933 |
| SMOTE #2(K=5,dup_size=2) & RF | 31950 | 12657 | 448 | 622 | 249 | 7723 | 0.9037 | 0.9688 | 0.4187 | 0.6937 |
| Hyperparameter tuning #1 (weight) | 31950 | 12657 | 530 | 540 | 371 | 7601 | 0.8992 | 0.9535 | 0.4953 | 0.7244 |
| Hyperparameter tuning #2 (weight) | 31950 | 12657 | 537 | 533 | 395 | 7577 | 0.8974 | 0.9505 | 0.5019 | 0.7319 |
| ADA Boost | ||||||||||
| Initial ADA BOOST (ADA) | 39822 | 5289 | 236 | 807 | 134 | 7838 | 0.8959 | 0.9832 | 0.2458 | 0.6145 |
| SMOTE #1(K=5,dup_size=1) & RF | 31950 | 8438 | 461 | 609 | 305 | 7667 | 0.8989 | 0.9617 | 0.4308 | 0.6963 |
| SMOTE #2(K=5,dup_size=2) & RF | 31950 | 12657 | 520 | 550 | 377 | 7595 | 0.8975 | 0.9527 | 0.4860 | 0.7193 |
| Hyperparameter tuning #1 (weight4,1) | 31950 | 12657 | 507 | 563 | 333 | 7639 | 0.9009 | 0.9582 | 0.4738 | 0.7160 |
| Hyperparameter tuning #2 (weight5,1) | 31950 | 12657 | 507 | 563 | 333 | 7639 | 0.9009 | 0.9582 | 0.4738 | 0.7160 |
The overarching aim of this assignment was to address the class imbalance in the minority class of the target variable and increase specificity. Initial models struggled with high numbers of false negatives (type II errors) and a low specificity, meaning the models struggled to identifity correctly those who did no sign up for the subscription.
The final decision tree model was the preferred model. The final
decision tree model was analyzed using:
1.) SMOTE to genterate ~7000 synthetic rows where Y=1
2.) the complexity parameter set to 0.05
3.) an adjusted/increased weight for the target variable minority
class
In deciding which model was the most preferred, we assumed the business had unlimited resources and would therefore prefer a model that maximizes true positives while minimizing false negatives. As a result, even though accuracy and sensitivity decreased compared to the initial model, the increase in true positives in the final decision tree expanded the pool of potential clients that could be contacted, increasing the likelihood of successful subscriptions/phone calls.
SMOTE Results
The purpose of using SMOTE was to address imbalance in the minority
class of the target variable. The use of SMOTE varied across models:
The varied results were expected, as SMOTE is more effective in
models that utlize decision boundries, such as decision trees and ADA
Boost.
Random forest, in contrast, is an ensemble method which handles some
class imbalance. As such, the impact of SMOTE on the random forest model
was not as large because it was somewhat already addressed by the
initial model.
Use of SMOTE did have it’s draw backs however. Of note, SMOTE increased the variance present in the ADA boost and decision tree models as denoted by improved specificty (decision tree: 34%-54%, ada boost: 25%-49%) and decreased/plateauing of accuracy and sensitivity (decision tree accruacy: 89%-88%, ada boost: 89% unchanged; decision tree sensitivity: 97% to 93% ada boost sensitivity: 98% to 95%). In general, for the decision tree and ada boost models, the use of SMOTE data resulting in possible overfitting and increased model sensitivity to noise from the syntheitic data.
Weight Results
Manipulating the weight of the minority class also had the end goal of
balancing the minority class of the target variable. The use of weights
varied across models:
Decision Trees: Weighting decreased accuracy (78%) and sensitivity (77%), though both remained within acceptable ranges. However, weighing significantly increased specificity (84%) and improved balanced accuracy (80%). The drop in accuracy and sensitivity alludes to increased model variance due to overfitting the training data.
Random Forest: Weighting caused a minor decrease in accuracy and sensitivity (~1%), but specificity improved by ~10%. This increase i specificity was smaller than the ~16% specificity improvement in Decision Trees. However, the increase in specificity does denote an increase in model variance.
ADA Boost: Weight adjustments had a minimal effect on accuracy, sensitivity, specificity, and balanced accuracy since ADA boost already accounts for class imbalance internally. As such, adjusting the class weights had negligible affects on variance and bias.
Final Model Interpreted
Below the final decision tree model is visualized and features
interpreted:
Duration Duration: The duration of the call was determined to be the strongest predictor of successful subscription. The first split occured at 204 seconds meaning that those calls which were longer than 203 seconds were strongly associated with successful subscriptions.
Other significant features
Poutcome Success: If a client’s previous campaign outcome success was
less than 0.0014, the client was less likely to subscribe (94%
probability of not subscribing for 25% of total cases).
Housing Loan Status: Clients without a housing loan and with call duration of more than 472 seconds had an 88% probability of subscribing (26% of total cases), while those with a housing loan are not as likely to subscribe.
Contact Method & Call Duration: If the contact method is unknown and call duration is more than 497 seconds, the likelihood of subscription increases to 72% (8% of cases), while calls shorter calls than 498 seconds have a higher probability of non-subscription (94%).
Recommendation for Data Scientists
When dealing with such data recommendations for data scientists include:
1. Use SMOTE with caution with decision tree and ada boost models, since
both model types rely heavily on decision boundaries 2. There is no need
to adjust weights for class imbalance of the target variable for ada
boost models since ada boost already adjusts weights internally
3. Use balance accuracy as a metric for imbalanced data sets
4. The experiments conducted in this model were just the tip of the
iceberg. The results of this analysis are in no way conclusive, as the
preferred model likely suffers from overfitting among other
problems
5. Bias-variance/sensitivity-specificity will always be a trade off.
Recommendation for Business Problem
To increase the likelihood of securing client subscriptions, the
business should prioritize longer duration calls, especially with
clients whose previous campaign outcome was successful, and do not have
housing loans. Additionally, since the final model had a high
specificity and high balance accuracy, this model should be used to
prioritize daily work lists for client outreach.
rpart.plot(dt_model,
type = 3,
extra = 104,
under = TRUE,
tweak = 1.2,
box.palette = "RdYlGn",
fallen.leaves = TRUE)
## Warning: Cannot retrieve the data used to build the model (model.frame: object 'job_blue-collar' not found).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
#tinytex::install_tinytex()