DATA 622 Assignment 2

Background

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.

Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

NOTE: The experiments section is in part 6

Dataset

For this assignment, I chose the full dataset, bank-additional-full.csv which has all observations and features. I would prefer to start out with all the data, then narrow it down after performing my EDA and algorithm selection. I wouldn’t want to disregard some features or observations before looking at them and knowing if they’re important or not.

Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) 5 - default: has credit in default? (binary: “yes”,“no”) 6 - housing: has housing loan? (binary: “yes”,“no”) 7 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 9 - day: last contact day of the month (numeric) 10 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 11 - duration: last contact duration, in seconds (numeric) # other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”) 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

1. Importing Libraries

#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(e1071)  # For skewness function

## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:ggplot2':
## 
##     element

library(corrplot)

## corrplot 0.95 loaded

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(ROSE)

## Loaded ROSE 0.0-4

library(smotefamily)

library(dplyr)
library(caret)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(GGally)
library(gmodels)
library(correlationfunnel)

## ══ correlationfunnel Tip #2 ════════════════════════════════════════════════════
## Clean your NA's prior to using `binarize()`.
## Missing values and cleaning data are critical to getting great correlations. :)

library(DataExplorer)
library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(readr)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.10     ✔ rsample      1.3.1 
## ✔ dials        1.4.2      ✔ tailor       0.1.0 
## ✔ infer        1.0.9      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.3.3      ✔ workflowsets 1.1.1 
## ✔ recipes      1.3.1      ✔ yardstick    1.3.2 
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ rsample::calibration()   masks caret::calibration()
## ✖ scales::discard()        masks purrr::discard()
## ✖ e1071::element()         masks ggplot2::element()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ rsample::permutations()  masks e1071::permutations()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ car::recode()            masks dplyr::recode()
## ✖ MASS::select()           masks dplyr::select()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ car::some()              masks purrr::some()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## ✖ tune::tune()             masks parsnip::tune(), e1071::tune()

library(themis)
library(randomForest)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(vcd)

## Loading required package: grid

library(rpart)

## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune

library(rpart.plot)
options(rgl.useNULL=TRUE)
library(adabag)

## Loading required package: foreach
## 
## Attaching package: 'foreach'
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following object is masked from 'package:gmodels':
## 
##     ci
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(vip)

## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi

#library(fastAdaboost)

2. Data Ingestion

# Full dataset
data_raw <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-608/refs/heads/main/bank-additional-full.csv", sep = ";")

# Rename columns
names(data_raw)[names(data_raw) == "emp.var.rate"] <- "emp_var_rate"
names(data_raw)[names(data_raw) == "cons.price.idx"] <- "cons_price_idx"
names(data_raw)[names(data_raw) == "cons.conf.idx"] <- "cons_conf_idx"
names(data_raw)[names(data_raw) == "nr.employed"] <- "nr_employed"

# Remove duplicates
data_raw <- unique(data_raw)
str(data_raw)

## 'data.frame':    41176 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp_var_rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons_price_idx: num  94 94 94 94 94 ...
##  $ cons_conf_idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr_employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

4 EDA

EDA was done in assignment 1 so I won’t include it here.

5 Data Cleansing for Decision Trees

This section is from assignment 1, but altered very slightly due to the fact tree algorithms are being used for this assignment.

5.1 Convert Categorical Variable to Factor

To use the algorithms, let’s convert categorical variables to factor data type:

# Convert categorical to factor since glm can use factors
data_raw_clean <- data_raw
data_raw_clean$job <- as.factor(data_raw_clean$job)
data_raw_clean$marital <- as.factor(data_raw_clean$marital)
data_raw_clean$education <- as.factor(data_raw_clean$education)
data_raw_clean$default <- as.factor(data_raw_clean$default)
data_raw_clean$housing <- as.factor(data_raw_clean$housing)
data_raw_clean$loan <- as.factor(data_raw_clean$loan)
data_raw_clean$contact <- as.factor(data_raw_clean$contact)
data_raw_clean$month <- as.factor(data_raw_clean$month)
data_raw_clean$day_of_week <- as.factor(data_raw_clean$day_of_week)
data_raw_clean$poutcome <- as.factor(data_raw_clean$poutcome)
data_raw_clean$y <- as.factor(data_raw_clean$y)

5.2 Remove Variables

Since over 96% of pdays is inapplicable, let’s treat this as NA data which should just be removed to simplify the model. Having too much data can be harmful as it hurts efficiency.

Additionally, as the dataset suggests, duration should be discarded if the intention is to have a realistic predictive model, so let’s remove that as well.

# Remove variable pdays (96% inapplicable) and duration
data_raw_clean2 <- data_raw_clean |>
  dplyr::select(-c(pdays, duration))

Address Multicollinearity:

Since there can not be any highly correlated variables for Logistic Regression, let’s remove the ones we found in EDA which have correlations > 0.4. Let’s remove euribor3m, emp_var_rate, and nr_employed. This will also help reduce the amount of data in the model.

# Remove `euribor3m`, `emp_var_rate`, and `nr_employed` (absolute value correlations > 0.4)
data_raw_clean3 <- data_raw_clean2 |>
  dplyr::select(-c(euribor3m, emp_var_rate, nr_employed))

5.3 Feature Engineering

Although I combined variables in the dataset data_raw_combined for EDA, I would probably start out my model with just data_raw_clean3. After creating the model, I would then test data_raw_combined (after applying the same changes I made to data_raw_clean3 above) to see if it would improve the model’s performance.

5.4 Split Data for Validation

To ensure objective model evaluation and prevent overfitting, let’s split each dataset into training (80%) and testing (20%) sets.

# Split data into train and test
set.seed(123)
split <- initial_split(data_raw_clean3, prop = 0.8, strata = y)
train <- split |>
         training()
test <- split |>
        testing()

# Try with other 60% training, 40% test
set.seed(123)
split1 <- initial_split(data_raw_clean3, prop = 0.6, strata = y)
train1 <- split1 |>
         training()
test1 <- split1 |>
        testing()

5.5 Standardization

Standardization could be helpful for Decision Trees, but it is not necessary. Since there is a difference in spreads among my predictors (as shown in the EDA), standardizing the data might be useful. This would be especially important if I ever decided to use regularization.

# Standardize the training and test sets (using training standardization)
set.seed(123)
preproc_params <- preProcess(train, method = c("center", "scale"))

# Apply standardization from the train set
train_standardized <- predict(preproc_params, train)
test_standardized <- predict(preproc_params, test)

5.6 Address Imbalanced Dataset

As the EDA showed, this dataset is heavily imbalanced which can cause bias. To resolve this, let’s use SMOTE (Synthetic Minority Over-sampling Technique):

set.seed(123)
# Only apply SMOTE to training set
smoted_train_standardized <- smotenc(train_standardized, var = "y", over_ratio = 1)
# Do the same for original training set
smoted_train <- smotenc(train, var = "y", over_ratio = 1)
smoted_train1 <- smotenc(train1, var = "y", over_ratio = 1)

# Distribution of target variable
smoted_train_standardized |>
  dplyr::select(y) |>
  ggplot() + 
  aes(x = y) + 
  geom_histogram(bins= 40,fill = "blue", color = "black", stat="count") + 
  labs(title = "Distribution of y", y = "Count") +  
  theme_minimal()

## Warning in geom_histogram(bins = 40, fill = "blue", color = "black", stat =
## "count"): Ignoring unknown parameters: `binwidth` and `bins`

We can now see that the distribution is even for our target variable, y.

The final training set to be used in a Logistic Regression model is smoted_train_standardized and the test set is test_standardized.

6 Experiments

Decision Trees

Default Hyperparameters with Original Training Dataset (Non-standardized)

Since there is no variation in this model, as it uses the the default and original datasets, we won’t count this as an actual experiment. We will only use this for comparison purposes and learning about the default params.

Metric: Accuracy

From the R documentation, here are the parameters for this model type:

minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted
minbucket: the minimum number of observations in any terminal node. If only one of minbucket or minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3, as appropriate.
cp: complexity parameter. Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall R-squared must increase by cp at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it.
maxcompete: the number of competitor splits retained in the output. It is useful to know not just which split was chosen, but which variable came in second, third, etc.
maxsurrogate: the number of surrogate splits retained in the output. If this is set to zero the compute time will be reduced, since approximately half of the computational time (other than setup) is used in the search for surrogate splits.
usesurrogate: how to use surrogates in the splitting process. 0 means display only; an observation with a missing value for the primary split rule is not sent further down the tree. 1 means use surrogates, in order, to split subjects missing the primary variable; if all surrogates are missing the observation is not split. For value 2 ,if all surrogates are missing, then send the observation in the majority direction. A value of 0 corresponds to the action of tree, and 2 to the recommendations of Breiman et.al (1984).
xval: number of cross-validations.
surrogatestyle: controls the selection of a best surrogate. If set to 0 (default) the program uses the total number of correct classification for a potential surrogate variable, if set to 1 it uses the percent correct, calculated over the non-missing values of the surrogate. The first option more severely penalizes covariates with a large number of missing values.
maxdepth: Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Maximum value 30 (as node numbers have to be stored in 32-bit signed integer types).

The evaluation metric we will focus on accuracy.

# Default model
# minsplit=20
# minbucket=round(minsplit/3)
# cp=0.01
# maxdepth=30
# xval=10
# maxcompete=4
# maxsurrogate=5
# usesurrogate=0
set.seed(125)
start <- Sys.time()
decision_tree <- rpart(y ~ ., data = smoted_train, method = "class")
print(summary(decision_tree))

## Call:
## rpart(formula = y ~ ., data = smoted_train, method = "class")
##   n= 58458 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.18040302      0 1.0000000 1.0074926 0.004135860
## 2 0.14690889      1 0.8195970 0.8195970 0.004068116
## 3 0.06455917      2 0.6726881 0.6722433 0.003907514
## 4 0.03907079      3 0.6081289 0.6081289 0.003805181
## 5 0.02035650      4 0.5690581 0.5719662 0.003737941
## 6 0.01800723      6 0.5283451 0.5312874 0.003653518
## 7 0.01293236      9 0.4743234 0.4772999 0.003525989
## 8 0.01000000     11 0.4484587 0.4514694 0.003458216
## 
## Variable importance
##          month cons_price_idx  cons_conf_idx       poutcome       campaign 
##             21             18             18             17             13 
##       previous        contact            age            job      education 
##              7              3              2              1              1 
## 
## Node number 1: 58458 observations,    complexity param=0.180403
##   predicted class=no   expected loss=0.5  P(node) =1
##     class counts: 29229 29229
##    probabilities: 0.500 0.500 
##   left son=2 (52419 obs) right son=3 (6039 obs)
##   Primary splits:
##       poutcome      splits as  LLR, improve=2567.295, (0 missing)
##       contact       splits as  RL, improve=2391.529, (0 missing)
##       month         splits as  RLRLLRLLRR, improve=2282.421, (0 missing)
##       cons_conf_idx < -36.09886    to the left,  improve=2165.345, (0 missing)
##       previous      < 0.0004073881 to the left,  improve=1978.245, (0 missing)
##   Surrogate splits:
##       previous       < 1.001562     to the left,  agree=0.926, adj=0.287, (0 split)
##       cons_price_idx < 94.60799     to the left,  agree=0.897, adj=0.003, (0 split)
##       cons_conf_idx  < -50.7846     to the right, agree=0.897, adj=0.002, (0 split)
##       age            < 18.7708      to the right, agree=0.897, adj=0.001, (0 split)
## 
## Node number 2: 52419 observations,    complexity param=0.1469089
##   predicted class=no   expected loss=0.4497034  P(node) =0.8966951
##     class counts: 28846 23573
##    probabilities: 0.550 0.450 
##   left son=4 (43283 obs) right son=5 (9136 obs)
##   Primary splits:
##       month          splits as  RLRLLRLLRR, improve=1801.2090, (0 missing)
##       cons_price_idx < 93.19988     to the right, improve=1793.0170, (0 missing)
##       contact        splits as  RL, improve=1651.3490, (0 missing)
##       cons_conf_idx  < -36.09886    to the left,  improve=1557.9710, (0 missing)
##       default        splits as  RLL, improve= 894.2134, (0 missing)
##   Surrogate splits:
##       cons_conf_idx  < -46.68129    to the right, agree=0.919, adj=0.537, (0 split)
##       age            < 60.99529     to the left,  agree=0.829, adj=0.017, (0 split)
##       cons_price_idx < 94.5206      to the left,  agree=0.827, adj=0.009, (0 split)
## 
## Node number 3: 6039 observations
##   predicted class=yes  expected loss=0.0634211  P(node) =0.1033049
##     class counts:   383  5656
##    probabilities: 0.063 0.937 
## 
## Node number 4: 43283 observations,    complexity param=0.06455917
##   predicted class=no   expected loss=0.3894832  P(node) =0.7404119
##     class counts: 26425 16858
##    probabilities: 0.611 0.389 
##   left son=8 (31312 obs) right son=9 (11971 obs)
##   Primary splits:
##       cons_price_idx < 93.19988     to the right, improve=1186.3600, (0 missing)
##       cons_conf_idx  < -36.09674    to the left,  improve=1107.2040, (0 missing)
##       contact        splits as  RL, improve=1079.9350, (0 missing)
##       default        splits as  RLL, improve= 442.0817, (0 missing)
##       age            < 60.00359     to the left,  improve= 381.0203, (0 missing)
##   Surrogate splits:
##       cons_conf_idx < -44.15814    to the right, agree=0.885, adj=0.583, (0 split)
##       previous      < 0.00118432   to the left,  agree=0.761, adj=0.136, (0 split)
##       poutcome      splits as  RL-, agree=0.758, adj=0.126, (0 split)
##       job           splits as  LLLLLLLLRLLL, agree=0.729, adj=0.019, (0 split)
##       age           < 60.53519     to the left,  agree=0.727, adj=0.015, (0 split)
## 
## Node number 5: 9136 observations
##   predicted class=yes  expected loss=0.2649956  P(node) =0.1562831
##     class counts:  2421  6715
##    probabilities: 0.265 0.735 
## 
## Node number 8: 31312 observations,    complexity param=0.0203565
##   predicted class=no   expected loss=0.3170989  P(node) =0.5356324
##     class counts: 21383  9929
##    probabilities: 0.683 0.317 
##   left son=16 (13637 obs) right son=17 (17675 obs)
##   Primary splits:
##       contact       splits as  RL, improve=434.5561, (0 missing)
##       campaign      < 1.000115     to the left,  improve=347.2238, (0 missing)
##       cons_conf_idx < -36.40503    to the right, improve=317.5052, (0 missing)
##       age           < 60.00359     to the left,  improve=256.2774, (0 missing)
##       previous      < 1.000371     to the left,  improve=238.8876, (0 missing)
##   Surrogate splits:
##       month          splits as  -R-RL-LR--, agree=0.928, adj=0.835, (0 split)
##       cons_price_idx < 93.99106     to the right, agree=0.902, adj=0.775, (0 split)
##       cons_conf_idx  < -41.80137    to the right, agree=0.726, adj=0.371, (0 split)
##       job            splits as  RLRRRRRLRRLL, agree=0.604, adj=0.092, (0 split)
##       education      splits as  LLLRRRRL, agree=0.589, adj=0.057, (0 split)
## 
## Node number 9: 11971 observations,    complexity param=0.03907079
##   predicted class=yes  expected loss=0.4211845  P(node) =0.2047795
##     class counts:  5042  6929
##    probabilities: 0.421 0.579 
##   left son=18 (7082 obs) right son=19 (4889 obs)
##   Primary splits:
##       cons_conf_idx  < -46.19657    to the left,  improve=881.6659, (0 missing)
##       month          splits as  -R-RR-LR--, improve=775.1438, (0 missing)
##       cons_price_idx < 92.89325     to the left,  improve=416.1322, (0 missing)
##       job            splits as  RLLRRRLLRRRL, improve=204.9930, (0 missing)
##       campaign       < 4.997728     to the right, improve=201.7275, (0 missing)
##   Surrogate splits:
##       month          splits as  -R-RR-LR--, agree=0.974, adj=0.936, (0 split)
##       cons_price_idx < 92.88986     to the right, agree=0.811, adj=0.536, (0 split)
##       job            splits as  RLLRLRLLRLRR, agree=0.672, adj=0.196, (0 split)
##       education      splits as  LLLLRLRR, agree=0.646, adj=0.133, (0 split)
##       age            < 55.01647     to the left,  agree=0.638, adj=0.113, (0 split)
## 
## Node number 16: 13637 observations
##   predicted class=no   expected loss=0.222263  P(node) =0.2332786
##     class counts: 10606  3031
##    probabilities: 0.778 0.222 
## 
## Node number 17: 17675 observations,    complexity param=0.0203565
##   predicted class=no   expected loss=0.3902687  P(node) =0.3023538
##     class counts: 10777  6898
##    probabilities: 0.610 0.390 
##   left son=34 (16113 obs) right son=35 (1562 obs)
##   Primary splits:
##       cons_price_idx < 93.91801     to the left,  improve=824.9796, (0 missing)
##       month          splits as  -L-LR-RL--, improve=396.8921, (0 missing)
##       campaign       < 1.000115     to the left,  improve=256.5630, (0 missing)
##       cons_conf_idx  < -42.71392    to the right, improve=222.1548, (0 missing)
##       age            < 60.00524     to the left,  improve=184.8284, (0 missing)
##   Surrogate splits:
##       cons_conf_idx < -46.28575    to the right, agree=0.923, adj=0.125, (0 split)
##       previous      < 1.041097     to the left,  agree=0.922, adj=0.117, (0 split)
##       age           < 60.99399     to the left,  agree=0.922, adj=0.113, (0 split)
##       month         splits as  -L-LR-LL--, agree=0.914, adj=0.031, (0 split)
## 
## Node number 18: 7082 observations
##   predicted class=no   expected loss=0.4193731  P(node) =0.1211468
##     class counts:  4112  2970
##    probabilities: 0.581 0.419 
## 
## Node number 19: 4889 observations
##   predicted class=yes  expected loss=0.1902229  P(node) =0.08363269
##     class counts:   930  3959
##    probabilities: 0.190 0.810 
## 
## Node number 34: 16113 observations,    complexity param=0.01800723
##   predicted class=no   expected loss=0.3427046  P(node) =0.2756338
##     class counts: 10591  5522
##    probabilities: 0.657 0.343 
##   left son=68 (15595 obs) right son=69 (518 obs)
##   Primary splits:
##       month          splits as  -L-LR-RL--, improve=348.29280, (0 missing)
##       campaign       < 1.000631     to the left,  improve=222.63500, (0 missing)
##       cons_conf_idx  < -42.71681    to the right, improve=150.22440, (0 missing)
##       cons_price_idx < 93.44448     to the left,  improve= 86.71672, (0 missing)
##       age            < 43.99796     to the right, improve= 66.11546, (0 missing)
##   Surrogate splits:
##       previous < 1.017587     to the left,  agree=0.971, adj=0.106, (0 split)
##       age      < 60.91176     to the left,  agree=0.969, adj=0.044, (0 split)
## 
## Node number 35: 1562 observations
##   predicted class=yes  expected loss=0.1190781  P(node) =0.02672004
##     class counts:   186  1376
##    probabilities: 0.119 0.881 
## 
## Node number 68: 15595 observations,    complexity param=0.01800723
##   predicted class=no   expected loss=0.3237576  P(node) =0.2667727
##     class counts: 10546  5049
##    probabilities: 0.676 0.324 
##   left son=136 (5560 obs) right son=137 (10035 obs)
##   Primary splits:
##       campaign       < 1.000631     to the left,  improve=214.25650, (0 missing)
##       cons_conf_idx  < -36.09546    to the left,  improve=138.52350, (0 missing)
##       cons_price_idx < 93.44448     to the left,  improve= 65.02158, (0 missing)
##       age            < 43.99796     to the right, improve= 61.36487, (0 missing)
##       job            splits as  RRRLRRRRRRLL, improve= 52.25816, (0 missing)
##   Surrogate splits:
##       marital       splits as  RRRL, agree=0.644, adj=0.001, (0 split)
##       cons_conf_idx < -45.88844    to the left,  agree=0.644, adj=0.001, (0 split)
##       education     splits as  RRRRLRRR, agree=0.644, adj=0.000, (0 split)
##       default       splits as  RRL, agree=0.644, adj=0.000, (0 split)
## 
## Node number 69: 518 observations
##   predicted class=yes  expected loss=0.08687259  P(node) =0.008861063
##     class counts:    45   473
##    probabilities: 0.087 0.913 
## 
## Node number 136: 5560 observations
##   predicted class=no   expected loss=0.2124101  P(node) =0.09511102
##     class counts:  4379  1181
##    probabilities: 0.788 0.212 
## 
## Node number 137: 10035 observations,    complexity param=0.01800723
##   predicted class=no   expected loss=0.3854509  P(node) =0.1716617
##     class counts:  6167  3868
##    probabilities: 0.615 0.385 
##   left son=274 (8884 obs) right son=275 (1151 obs)
##   Primary splits:
##       campaign       < 1.999861     to the right, improve=982.03570, (0 missing)
##       cons_conf_idx  < -36.09546    to the left,  improve= 87.87123, (0 missing)
##       job            splits as  RRLLRRRRRRLL, improve= 57.45433, (0 missing)
##       age            < 42.99197     to the right, improve= 53.21787, (0 missing)
##       cons_price_idx < 93.44448     to the left,  improve= 40.59531, (0 missing)
##   Surrogate splits:
##       age           < 73.95204     to the left,  agree=0.886, adj=0.009, (0 split)
##       cons_conf_idx < -34.4013     to the left,  agree=0.886, adj=0.003, (0 split)
##       previous      < 1.373351     to the left,  agree=0.886, adj=0.002, (0 split)
## 
## Node number 274: 8884 observations,    complexity param=0.01293236
##   predicted class=no   expected loss=0.3058307  P(node) =0.1519724
##     class counts:  6167  2717
##    probabilities: 0.694 0.306 
##   left son=548 (3123 obs) right son=549 (5761 obs)
##   Primary splits:
##       campaign       < 2.000554     to the left,  improve=126.64850, (0 missing)
##       cons_price_idx < 93.20023     to the left,  improve= 69.11038, (0 missing)
##       cons_conf_idx  < -36.09546    to the left,  improve= 66.03994, (0 missing)
##       month          splits as  -R-R---L--, improve= 48.22153, (0 missing)
##       job            splits as  RRLLRRRRRRLL, improve= 47.21639, (0 missing)
##   Surrogate splits:
##       cons_price_idx < 93.20023     to the left,  agree=0.666, adj=0.050, (0 split)
##       month          splits as  -R-R---L--, agree=0.662, adj=0.040, (0 split)
##       previous       < 0.2940022    to the right, agree=0.654, adj=0.014, (0 split)
##       poutcome       splits as  LR-, agree=0.653, adj=0.013, (0 split)
##       default        splits as  RRL, agree=0.649, adj=0.000, (0 split)
## 
## Node number 275: 1151 observations
##   predicted class=yes  expected loss=0  P(node) =0.01968935
##     class counts:     0  1151
##    probabilities: 0.000 1.000 
## 
## Node number 548: 3123 observations
##   predicted class=no   expected loss=0.1911623  P(node) =0.05342297
##     class counts:  2526   597
##    probabilities: 0.809 0.191 
## 
## Node number 549: 5761 observations,    complexity param=0.01293236
##   predicted class=no   expected loss=0.3679917  P(node) =0.09854939
##     class counts:  3641  2120
##    probabilities: 0.632 0.368 
##   left son=1098 (5005 obs) right son=1099 (756 obs)
##   Primary splits:
##       campaign       < 2.999257     to the right, improve=695.17030, (0 missing)
##       cons_conf_idx  < -36.09546    to the left,  improve= 40.29316, (0 missing)
##       education      splits as  LLRRLRRL, improve= 39.18488, (0 missing)
##       job            splits as  RRLLRRRRLRLL, improve= 35.95156, (0 missing)
##       cons_price_idx < 93.20023     to the left,  improve= 35.52620, (0 missing)
##   Surrogate splits:
##       cons_conf_idx < -35.93195    to the left,  agree=0.872, adj=0.022, (0 split)
##       previous      < 2.218332     to the left,  agree=0.869, adj=0.004, (0 split)
##       age           < 60.0642      to the left,  agree=0.869, adj=0.001, (0 split)
## 
## Node number 1098: 5005 observations
##   predicted class=no   expected loss=0.2725275  P(node) =0.08561702
##     class counts:  3641  1364
##    probabilities: 0.727 0.273 
## 
## Node number 1099: 756 observations
##   predicted class=yes  expected loss=0  P(node) =0.01293236
##     class counts:     0   756
##    probabilities: 0.000 1.000 
## 
## n= 58458 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##    1) root 58458 29229 no (0.50000000 0.50000000)  
##      2) poutcome=failure,nonexistent 52419 23573 no (0.55029665 0.44970335)  
##        4) month=aug,jul,jun,may,nov 43283 16858 no (0.61051683 0.38948317)  
##          8) cons_price_idx>=93.19988 31312  9929 no (0.68290112 0.31709888)  
##           16) contact=telephone 13637  3031 no (0.77773704 0.22226296) *
##           17) contact=cellular 17675  6898 no (0.60973126 0.39026874)  
##             34) cons_price_idx< 93.91801 16113  5522 no (0.65729535 0.34270465)  
##               68) month=aug,jul,nov 15595  5049 no (0.67624239 0.32375761)  
##                136) campaign< 1.000631 5560  1181 no (0.78758993 0.21241007) *
##                137) campaign>=1.000631 10035  3868 no (0.61454908 0.38545092)  
##                  274) campaign>=1.999861 8884  2717 no (0.69416929 0.30583071)  
##                    548) campaign< 2.000554 3123   597 no (0.80883766 0.19116234) *
##                    549) campaign>=2.000554 5761  2120 no (0.63200833 0.36799167)  
##                     1098) campaign>=2.999257 5005  1364 no (0.72747253 0.27252747) *
##                     1099) campaign< 2.999257 756     0 yes (0.00000000 1.00000000) *
##                  275) campaign< 1.999861 1151     0 yes (0.00000000 1.00000000) *
##               69) month=jun,may 518    45 yes (0.08687259 0.91312741) *
##             35) cons_price_idx>=93.91801 1562   186 yes (0.11907810 0.88092190) *
##          9) cons_price_idx< 93.19988 11971  5042 yes (0.42118453 0.57881547)  
##           18) cons_conf_idx< -46.19657 7082  2970 no (0.58062694 0.41937306) *
##           19) cons_conf_idx>=-46.19657 4889   930 yes (0.19022295 0.80977705) *
##        5) month=apr,dec,mar,oct,sep 9136  2421 yes (0.26499562 0.73500438) *
##      3) poutcome=success 6039   383 yes (0.06342110 0.93657890) *

end <- Sys.time()

# rpart does k-fold cross validation automatically with default 10-fold for cp
# Plot cross-validation results
plotcp(decision_tree)

# Print cp values
printcp(decision_tree)

## 
## Classification tree:
## rpart(formula = y ~ ., data = smoted_train, method = "class")
## 
## Variables actually used in tree construction:
## [1] campaign       cons_conf_idx  cons_price_idx contact        month         
## [6] poutcome      
## 
## Root node error: 29229/58458 = 0.5
## 
## n= 58458 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.180403      0   1.00000 1.00749 0.0041359
## 2 0.146909      1   0.81960 0.81960 0.0040681
## 3 0.064559      2   0.67269 0.67224 0.0039075
## 4 0.039071      3   0.60813 0.60813 0.0038052
## 5 0.020356      4   0.56906 0.57197 0.0037379
## 6 0.018007      6   0.52835 0.53129 0.0036535
## 7 0.012932      9   0.47432 0.47730 0.0035260
## 8 0.010000     11   0.44846 0.45147 0.0034582

# Default predictions
predictions <- predict(decision_tree, newdata = test, type = "class")
cf_matrix <- confusionMatrix(predictions, test$y)
# print(cf_matrix)

# Bar plot of the target variable
predictions_df <- as.data.frame(table(predictions))
colnames(predictions_df) <- c("Target", "Count")
ggplot(predictions_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
probs <- predict(decision_tree, newdata = test, type = "vector")
roc <- roc(test$y, probs, quiet = TRUE)
decision_tree_auc <- auc(roc)
decision_tree_accuracy <- cf_matrix$overall['Accuracy']
decision_tree_kappa <- cf_matrix$overall['Kappa']
decision_tree_sensitivity <- cf_matrix$byClass['Sensitivity']
decision_tree_specificity <- cf_matrix$byClass['Specificity']
decision_tree_f1 <- cf_matrix$byClass["F1"]
decision_tree_precision <- cf_matrix$byClass["Precision"]
decision_tree_model_duration <- end - start

output1 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Default Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix)), collapse = "\n"), "\n",
                 "Accuracy:", round(decision_tree_accuracy, 4), "| Precision:", round(decision_tree_precision, 4), 
                 "| Sensitivity:", round(decision_tree_sensitivity, 4), "| Specificity:", round(decision_tree_specificity, 4), "\n",
                 "F1 Score:", round(decision_tree_f1, 4), "| Kappa:", round(decision_tree_kappa, 4), "| AUC:", round(decision_tree_auc, 4), 
                 "| Model Duration:", round(decision_tree_model_duration, 4), "\n\n", sep = " ")
cat(output1)

## 
## === Model Selection and Evaluation ===
## 
##  === Default Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6342  383
##        yes  966  545
##                                          
##                Accuracy : 0.8362         
##                  95% CI : (0.828, 0.8441)
##     No Information Rate : 0.8873         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3572         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.8678         
##             Specificity : 0.5873         
##          Pos Pred Value : 0.9430         
##          Neg Pred Value : 0.3607         
##              Prevalence : 0.8873         
##          Detection Rate : 0.7700         
##    Detection Prevalence : 0.8165         
##       Balanced Accuracy : 0.7276         
##                                          
##        'Positive' Class : no             
##                                           
##  Accuracy: 0.8362 | Precision: 0.943 | Sensitivity: 0.8678 | Specificity: 0.5873 
##  F1 Score: 0.9039 | Kappa: 0.3572 | AUC: 0.7276 | Model Duration: 1.0857

Results:

The accuracy, our chosen metric, of the most basic model is 0.8362069 which is not a terrible result. We can see in the cross-validation graph that as the complexity inreases (cp decreases), the relative error decreases. With a higher complexity, the decision tree can be more precise, but at a computational cost. As we can see, with cp=0.01 and nsplit=11 gives the smallest relative error.

Since the optimal cp value from this model is 0.01, there is no need for using the prune method with the optimal cp to prune the tree since it will give the same results as above.

# Plot decision tree
rpart.plot(decision_tree)

# Extract important predictors from model
vip(decision_tree)

# Rules
rpart.rules(decision_tree)

##     y                                                                                                                                                                                   
##  0.19 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is 93 to 94 & contact is  cellular & campaign is 2 to 2                       
##  0.21 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is 93 to 94 & contact is  cellular & campaign <  1                            
##  0.22 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >=       93 & contact is telephone                                            
##  0.27 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is 93 to 94 & contact is  cellular & campaign >=      3                       
##  0.42 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx <  93                                                   & cons_conf_idx <  -46
##  0.74 when poutcome is failure or nonexistent & month is apr or dec or mar or oct or sep                                                                                                
##  0.81 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx <  93                                                   & cons_conf_idx >= -46
##  0.88 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >=       94 & contact is  cellular                                            
##  0.91 when poutcome is failure or nonexistent & month is                      jun or may & cons_price_idx is 93 to 94 & contact is  cellular                                            
##  0.94 when poutcome is                success                                                                                                                                           
##  1.00 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is 93 to 94 & contact is  cellular & campaign is 2 to 3                       
##  1.00 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is 93 to 94 & contact is  cellular & campaign is 1 to 2

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are month, cons_price_idx, cons_conf_idx, poutcome, and campaign.

The rules above can also be used to follow the decision tree structure.

Experiment 1: Smaller maximum depth of the final tree

Objective: Discover how a smaller maximum depth affects the model accuracy

The point of this model is to experiment with the maxdepth hyperparameter, which is the maximum depth of any node of the tree. So for this experiment, instead of the default value of 30, I will use 5. I chose to decrease this number since a smaller max depth can help with overfitting, and make the model simpler which can be helpful with model interpretability. This should help the tree adapt to new data.

Metric: Accuracy

# Decision Tree Experiment 1 Model
set.seed(125)
start <- Sys.time()
decision_tree1 <- rpart(y ~ ., data = smoted_train, method = "class", control = rpart.control(maxdepth = 5))
print(summary(decision_tree1))

## Call:
## rpart(formula = y ~ ., data = smoted_train, method = "class", 
##     control = rpart.control(maxdepth = 5))
##   n= 58458 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.18040302      0 1.0000000 1.0074926 0.004135860
## 2 0.14690889      1 0.8195970 0.8195970 0.004068116
## 3 0.06455917      2 0.6726881 0.6722433 0.003907514
## 4 0.03907079      3 0.6081289 0.6081289 0.003805181
## 5 0.02035650      4 0.5690581 0.5719662 0.003737941
## 6 0.01000000      6 0.5283451 0.5312874 0.003653518
## 
## Variable importance
##          month cons_price_idx  cons_conf_idx       poutcome       previous 
##             22             21             21             20              7 
##        contact            age            job      education 
##              3              2              2              1 
## 
## Node number 1: 58458 observations,    complexity param=0.180403
##   predicted class=no   expected loss=0.5  P(node) =1
##     class counts: 29229 29229
##    probabilities: 0.500 0.500 
##   left son=2 (52419 obs) right son=3 (6039 obs)
##   Primary splits:
##       poutcome      splits as  LLR, improve=2567.295, (0 missing)
##       contact       splits as  RL, improve=2391.529, (0 missing)
##       month         splits as  RLRLLRLLRR, improve=2282.421, (0 missing)
##       cons_conf_idx < -36.09886    to the left,  improve=2165.345, (0 missing)
##       previous      < 0.0004073881 to the left,  improve=1978.245, (0 missing)
##   Surrogate splits:
##       previous       < 1.001562     to the left,  agree=0.926, adj=0.287, (0 split)
##       cons_price_idx < 94.60799     to the left,  agree=0.897, adj=0.003, (0 split)
##       cons_conf_idx  < -50.7846     to the right, agree=0.897, adj=0.002, (0 split)
##       age            < 18.7708      to the right, agree=0.897, adj=0.001, (0 split)
## 
## Node number 2: 52419 observations,    complexity param=0.1469089
##   predicted class=no   expected loss=0.4497034  P(node) =0.8966951
##     class counts: 28846 23573
##    probabilities: 0.550 0.450 
##   left son=4 (43283 obs) right son=5 (9136 obs)
##   Primary splits:
##       month          splits as  RLRLLRLLRR, improve=1801.2090, (0 missing)
##       cons_price_idx < 93.19988     to the right, improve=1793.0170, (0 missing)
##       contact        splits as  RL, improve=1651.3490, (0 missing)
##       cons_conf_idx  < -36.09886    to the left,  improve=1557.9710, (0 missing)
##       default        splits as  RLL, improve= 894.2134, (0 missing)
##   Surrogate splits:
##       cons_conf_idx  < -46.68129    to the right, agree=0.919, adj=0.537, (0 split)
##       age            < 60.99529     to the left,  agree=0.829, adj=0.017, (0 split)
##       cons_price_idx < 94.5206      to the left,  agree=0.827, adj=0.009, (0 split)
## 
## Node number 3: 6039 observations
##   predicted class=yes  expected loss=0.0634211  P(node) =0.1033049
##     class counts:   383  5656
##    probabilities: 0.063 0.937 
## 
## Node number 4: 43283 observations,    complexity param=0.06455917
##   predicted class=no   expected loss=0.3894832  P(node) =0.7404119
##     class counts: 26425 16858
##    probabilities: 0.611 0.389 
##   left son=8 (31312 obs) right son=9 (11971 obs)
##   Primary splits:
##       cons_price_idx < 93.19988     to the right, improve=1186.3600, (0 missing)
##       cons_conf_idx  < -36.09674    to the left,  improve=1107.2040, (0 missing)
##       contact        splits as  RL, improve=1079.9350, (0 missing)
##       default        splits as  RLL, improve= 442.0817, (0 missing)
##       age            < 60.00359     to the left,  improve= 381.0203, (0 missing)
##   Surrogate splits:
##       cons_conf_idx < -44.15814    to the right, agree=0.885, adj=0.583, (0 split)
##       previous      < 0.00118432   to the left,  agree=0.761, adj=0.136, (0 split)
##       poutcome      splits as  RL-, agree=0.758, adj=0.126, (0 split)
##       job           splits as  LLLLLLLLRLLL, agree=0.729, adj=0.019, (0 split)
##       age           < 60.53519     to the left,  agree=0.727, adj=0.015, (0 split)
## 
## Node number 5: 9136 observations
##   predicted class=yes  expected loss=0.2649956  P(node) =0.1562831
##     class counts:  2421  6715
##    probabilities: 0.265 0.735 
## 
## Node number 8: 31312 observations,    complexity param=0.0203565
##   predicted class=no   expected loss=0.3170989  P(node) =0.5356324
##     class counts: 21383  9929
##    probabilities: 0.683 0.317 
##   left son=16 (13637 obs) right son=17 (17675 obs)
##   Primary splits:
##       contact       splits as  RL, improve=434.5561, (0 missing)
##       campaign      < 1.000115     to the left,  improve=347.2238, (0 missing)
##       cons_conf_idx < -36.40503    to the right, improve=317.5052, (0 missing)
##       age           < 60.00359     to the left,  improve=256.2774, (0 missing)
##       previous      < 1.000371     to the left,  improve=238.8876, (0 missing)
##   Surrogate splits:
##       month          splits as  -R-RL-LR--, agree=0.928, adj=0.835, (0 split)
##       cons_price_idx < 93.99106     to the right, agree=0.902, adj=0.775, (0 split)
##       cons_conf_idx  < -41.80137    to the right, agree=0.726, adj=0.371, (0 split)
##       job            splits as  RLRRRRRLRRLL, agree=0.604, adj=0.092, (0 split)
##       education      splits as  LLLRRRRL, agree=0.589, adj=0.057, (0 split)
## 
## Node number 9: 11971 observations,    complexity param=0.03907079
##   predicted class=yes  expected loss=0.4211845  P(node) =0.2047795
##     class counts:  5042  6929
##    probabilities: 0.421 0.579 
##   left son=18 (7082 obs) right son=19 (4889 obs)
##   Primary splits:
##       cons_conf_idx  < -46.19657    to the left,  improve=881.6659, (0 missing)
##       month          splits as  -R-RR-LR--, improve=775.1438, (0 missing)
##       cons_price_idx < 92.89325     to the left,  improve=416.1322, (0 missing)
##       job            splits as  RLLRRRLLRRRL, improve=204.9930, (0 missing)
##       campaign       < 4.997728     to the right, improve=201.7275, (0 missing)
##   Surrogate splits:
##       month          splits as  -R-RR-LR--, agree=0.974, adj=0.936, (0 split)
##       cons_price_idx < 92.88986     to the right, agree=0.811, adj=0.536, (0 split)
##       job            splits as  RLLRLRLLRLRR, agree=0.672, adj=0.196, (0 split)
##       education      splits as  LLLLRLRR, agree=0.646, adj=0.133, (0 split)
##       age            < 55.01647     to the left,  agree=0.638, adj=0.113, (0 split)
## 
## Node number 16: 13637 observations
##   predicted class=no   expected loss=0.222263  P(node) =0.2332786
##     class counts: 10606  3031
##    probabilities: 0.778 0.222 
## 
## Node number 17: 17675 observations,    complexity param=0.0203565
##   predicted class=no   expected loss=0.3902687  P(node) =0.3023538
##     class counts: 10777  6898
##    probabilities: 0.610 0.390 
##   left son=34 (16113 obs) right son=35 (1562 obs)
##   Primary splits:
##       cons_price_idx < 93.91801     to the left,  improve=824.9796, (0 missing)
##       month          splits as  -L-LR-RL--, improve=396.8921, (0 missing)
##       campaign       < 1.000115     to the left,  improve=256.5630, (0 missing)
##       cons_conf_idx  < -42.71392    to the right, improve=222.1548, (0 missing)
##       age            < 60.00524     to the left,  improve=184.8284, (0 missing)
##   Surrogate splits:
##       cons_conf_idx < -46.28575    to the right, agree=0.923, adj=0.125, (0 split)
##       previous      < 1.041097     to the left,  agree=0.922, adj=0.117, (0 split)
##       age           < 60.99399     to the left,  agree=0.922, adj=0.113, (0 split)
##       month         splits as  -L-LR-LL--, agree=0.914, adj=0.031, (0 split)
## 
## Node number 18: 7082 observations
##   predicted class=no   expected loss=0.4193731  P(node) =0.1211468
##     class counts:  4112  2970
##    probabilities: 0.581 0.419 
## 
## Node number 19: 4889 observations
##   predicted class=yes  expected loss=0.1902229  P(node) =0.08363269
##     class counts:   930  3959
##    probabilities: 0.190 0.810 
## 
## Node number 34: 16113 observations
##   predicted class=no   expected loss=0.3427046  P(node) =0.2756338
##     class counts: 10591  5522
##    probabilities: 0.657 0.343 
## 
## Node number 35: 1562 observations
##   predicted class=yes  expected loss=0.1190781  P(node) =0.02672004
##     class counts:   186  1376
##    probabilities: 0.119 0.881 
## 
## n= 58458 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 58458 29229 no (0.5000000 0.5000000)  
##    2) poutcome=failure,nonexistent 52419 23573 no (0.5502966 0.4497034)  
##      4) month=aug,jul,jun,may,nov 43283 16858 no (0.6105168 0.3894832)  
##        8) cons_price_idx>=93.19988 31312  9929 no (0.6829011 0.3170989)  
##         16) contact=telephone 13637  3031 no (0.7777370 0.2222630) *
##         17) contact=cellular 17675  6898 no (0.6097313 0.3902687)  
##           34) cons_price_idx< 93.91801 16113  5522 no (0.6572954 0.3427046) *
##           35) cons_price_idx>=93.91801 1562   186 yes (0.1190781 0.8809219) *
##        9) cons_price_idx< 93.19988 11971  5042 yes (0.4211845 0.5788155)  
##         18) cons_conf_idx< -46.19657 7082  2970 no (0.5806269 0.4193731) *
##         19) cons_conf_idx>=-46.19657 4889   930 yes (0.1902229 0.8097771) *
##      5) month=apr,dec,mar,oct,sep 9136  2421 yes (0.2649956 0.7350044) *
##    3) poutcome=success 6039   383 yes (0.0634211 0.9365789) *

end <- Sys.time()

# rpart does k-fold cross validation automatically with default 10-fold for cp
# Plot cross-validation results
plotcp(decision_tree1)

# Print cp values
printcp(decision_tree1)

## 
## Classification tree:
## rpart(formula = y ~ ., data = smoted_train, method = "class", 
##     control = rpart.control(maxdepth = 5))
## 
## Variables actually used in tree construction:
## [1] cons_conf_idx  cons_price_idx contact        month          poutcome      
## 
## Root node error: 29229/58458 = 0.5
## 
## n= 58458 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.180403      0   1.00000 1.00749 0.0041359
## 2 0.146909      1   0.81960 0.81960 0.0040681
## 3 0.064559      2   0.67269 0.67224 0.0039075
## 4 0.039071      3   0.60813 0.60813 0.0038052
## 5 0.020356      4   0.56906 0.57197 0.0037379
## 6 0.010000      6   0.52835 0.53129 0.0036535

# Default predictions
predictions_decision_tree1 <- predict(decision_tree1, newdata = test, type = "class")
cf_matrix_decision_tree1 <- confusionMatrix(predictions_decision_tree1, test$y)
print(cf_matrix_decision_tree1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6348  394
##        yes  960  534
##                                           
##                Accuracy : 0.8356          
##                  95% CI : (0.8274, 0.8435)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3507          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8686          
##             Specificity : 0.5754          
##          Pos Pred Value : 0.9416          
##          Neg Pred Value : 0.3574          
##              Prevalence : 0.8873          
##          Detection Rate : 0.7708          
##    Detection Prevalence : 0.8186          
##       Balanced Accuracy : 0.7220          
##                                           
##        'Positive' Class : no              
##

# Bar plot of the target variable
predictions_df <- as.data.frame(table(predictions_decision_tree1))
colnames(predictions_df) <- c("Target", "Count")
ggplot(predictions_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
probs1 <- predict(decision_tree1, newdata = test, type = "vector")
roc1 <- roc(test$y, probs1, quiet = TRUE)
decision_tree1_auc <- auc(roc1)
decision_tree1_accuracy <- cf_matrix_decision_tree1$overall['Accuracy']
decision_tree1_kappa <- cf_matrix_decision_tree1$overall['Kappa']
decision_tree1_sensitivity <- cf_matrix_decision_tree1$byClass['Sensitivity']
decision_tree1_specificity <- cf_matrix_decision_tree1$byClass['Specificity']
decision_tree1_f1 <- cf_matrix_decision_tree1$byClass["F1"]
decision_tree1_precision <- cf_matrix_decision_tree1$byClass["Precision"]
decision_tree1_model_duration <- end - start

output2 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 1 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_decision_tree1)), collapse = "\n"), "\n",
                 "Accuracy:", round(decision_tree1_accuracy, 4), "| Precision:", round(decision_tree1_precision, 4), 
                 "| Sensitivity:", round(decision_tree1_sensitivity, 4), "| Specificity:", round(decision_tree1_specificity, 4), "\n",
                 "F1 Score:", round(decision_tree1_f1, 4), "| Kappa:", round(decision_tree1_kappa, 4), "| AUC:", round(decision_tree1_auc, 4), 
                 "| Model Duration:", round(decision_tree1_model_duration, 4), "\n\n", sep = " ")
cat(output2)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 1 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6348  394
##        yes  960  534
##                                           
##                Accuracy : 0.8356          
##                  95% CI : (0.8274, 0.8435)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3507          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8686          
##             Specificity : 0.5754          
##          Pos Pred Value : 0.9416          
##          Neg Pred Value : 0.3574          
##              Prevalence : 0.8873          
##          Detection Rate : 0.7708          
##    Detection Prevalence : 0.8186          
##       Balanced Accuracy : 0.7220          
##                                           
##        'Positive' Class : no              
##                                            
##  Accuracy: 0.8356 | Precision: 0.9416 | Sensitivity: 0.8686 | Specificity: 0.5754 
##  F1 Score: 0.9036 | Kappa: 0.3507 | AUC: 0.722 | Model Duration: 0.5946

Results:

With a smaller max depth, the accuracy slightly decreased compared to the default model with a value of 0.8355998.

# Recommendation
# Plot decision tree
rpart.plot(decision_tree1)

# Extract important predictors from model
vip(decision_tree1)

# Rules
rpart.rules(decision_tree1)

##     y                                                                                                                                                              
##  0.22 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >=       93 & contact is telephone                       
##  0.34 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx is 93 to 94 & contact is  cellular                       
##  0.42 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx <  93                              & cons_conf_idx <  -46
##  0.74 when poutcome is failure or nonexistent & month is apr or dec or mar or oct or sep                                                                           
##  0.81 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx <  93                              & cons_conf_idx >= -46
##  0.88 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >=       94 & contact is  cellular                       
##  0.94 when poutcome is                success

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are month, cons_price_idx, cons_conf_idx, poutcome, and previous. The main difference for this model compared to the previous is campaign is not a top predictor at all.

The rules above can also be used to follow the decision tree structure.

Experiment 2: Standardized datasets

Objective: Discover how using standardized datasets affects model accuracy.

The point of this model is to experiment with the original datasets. The standardized datasets (created earlier) will be used to see if it has any impact on the model’s accuracy. This should not have any affect for decision trees, but let’s confirm this.

Metric: Accuracy

# Experiment 2 model - standardized datasets
set.seed(125)
start <- Sys.time()
decision_tree2 <- rpart(y ~ ., data = smoted_train_standardized, method = "class")
print(decision_tree2)

## n= 58458 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##    1) root 58458 29229 no (0.50000000 0.50000000)  
##      2) poutcome=failure,nonexistent 52478 23632 no (0.54967796 0.45032204)  
##        4) month=aug,jul,jun,may,nov 43295 16870 no (0.61034762 0.38965238)  
##          8) cons_price_idx>=-0.6487716 31264  9881 no (0.68394959 0.31605041)  
##           16) contact=telephone 13563  2957 no (0.78198039 0.21801961) *
##           17) contact=cellular 17701  6924 no (0.60883566 0.39116434)  
##             34) cons_price_idx< 0.5930096 16205  5614 no (0.65356371 0.34643629)  
##               68) month=aug,jul,nov 15572  5026 no (0.67724120 0.32275880)  
##                136) campaign< -0.5647457 5526  1147 no (0.79243576 0.20756424) *
##                137) campaign>=-0.5647457 10046  3879 no (0.61387617 0.38612383)  
##                  274) campaign>=-0.201818 8898  2731 no (0.69307710 0.30692290)  
##                    548) campaign< -0.2014506 3121   595 no (0.80935598 0.19064402) *
##                    549) campaign>=-0.2014506 5777  2136 no (0.63025792 0.36974208)  
##                     1098) campaign>=0.1607861 5057  1416 no (0.71999209 0.28000791) *
##                     1099) campaign< 0.1607861 720     0 yes (0.00000000 1.00000000) *
##                  275) campaign< -0.201818 1148     0 yes (0.00000000 1.00000000) *
##               69) month=jun,may 633    45 yes (0.07109005 0.92890995) *
##             35) cons_price_idx>=0.5930096 1496   186 yes (0.12433155 0.87566845) *
##          9) cons_price_idx< -0.6487716 12031  5042 yes (0.41908403 0.58091597)  
##           18) cons_conf_idx< -1.23077 7122  3010 no (0.57736591 0.42263409) *
##           19) cons_conf_idx>=-1.23077 4909   930 yes (0.18944795 0.81055205) *
##        5) month=apr,dec,mar,oct,sep 9183  2421 yes (0.26363933 0.73636067) *
##      3) poutcome=success 5980   383 yes (0.06404682 0.93595318) *

end <- Sys.time()

# Predictions
predictions_decision_tree2 <- predict(decision_tree2, newdata = test_standardized, type = "class")
cf_matrix_decision_tree2 <- confusionMatrix(predictions_decision_tree2, test_standardized$y)
# print(cf_matrix_decision_tree2)

# Bar plot of the target variable
predictions_df2 <- as.data.frame(table(predictions_decision_tree2))
colnames(predictions_df2) <- c("Target", "Count")
ggplot(predictions_df2, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
probs2 <- predict(decision_tree2, newdata = test, type = "vector")
roc2 <- roc(test$y, probs2, quiet = TRUE)
decision_tree2_auc <- auc(roc2)
decision_tree2_accuracy <- cf_matrix_decision_tree2$overall['Accuracy']
decision_tree2_kappa <- cf_matrix_decision_tree2$overall['Kappa']
decision_tree2_sensitivity <- cf_matrix_decision_tree2$byClass['Sensitivity']
decision_tree2_specificity <- cf_matrix_decision_tree2$byClass['Specificity']
decision_tree2_f1 <- cf_matrix_decision_tree2$byClass["F1"]
decision_tree2_precision <- cf_matrix_decision_tree2$byClass["Precision"]
decision_tree2_model_duration <- end - start

output3 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 2 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_decision_tree2)), collapse = "\n"), "\n",
                 "Accuracy:", round(decision_tree2_accuracy, 4), "| Precision:", round(decision_tree2_precision, 4), 
                 "| Sensitivity:", round(decision_tree2_sensitivity, 4), "| Specificity:", round(decision_tree2_specificity, 4), "\n",
                 "F1 Score:", round(decision_tree2_f1, 4), "| Kappa:", round(decision_tree2_kappa, 4), "| AUC:", round(decision_tree2_auc, 4),
                 "| Model Duration:", round(decision_tree2_model_duration, 4), "\n\n", sep = " ")
cat(output3)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 2 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6342  383
##        yes  966  545
##                                          
##                Accuracy : 0.8362         
##                  95% CI : (0.828, 0.8441)
##     No Information Rate : 0.8873         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3572         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.8678         
##             Specificity : 0.5873         
##          Pos Pred Value : 0.9430         
##          Neg Pred Value : 0.3607         
##              Prevalence : 0.8873         
##          Detection Rate : 0.7700         
##    Detection Prevalence : 0.8165         
##       Balanced Accuracy : 0.7276         
##                                          
##        'Positive' Class : no             
##                                           
##  Accuracy: 0.8362 | Precision: 0.943 | Sensitivity: 0.8678 | Specificity: 0.5873 
##  F1 Score: 0.9039 | Kappa: 0.3572 | AUC: 0.6165 | Model Duration: 0.9451

Results:

As expected, the model accuracy remained the same, meaning it is not necessary to standardize the dataset ahead of time. That being said, it does not hurt the model at all.

# Recommendation
# Plot decision tree
rpart.plot(decision_tree2)

# Extract important predictors from model
vip(decision_tree2)

# Rules
rpart.rules(decision_tree2)

##     y                                                                                                                                                                                                  
##  0.19 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is -0.65 to  0.59 & contact is  cellular & campaign is -0.20 to -0.20                        
##  0.21 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is -0.65 to  0.59 & contact is  cellular & campaign <  -0.56                                 
##  0.22 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >=          -0.65 & contact is telephone                                                     
##  0.28 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is -0.65 to  0.59 & contact is  cellular & campaign >=           0.16                        
##  0.42 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx <  -0.65                                                              & cons_conf_idx <  -1.2
##  0.74 when poutcome is failure or nonexistent & month is apr or dec or mar or oct or sep                                                                                                               
##  0.81 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx <  -0.65                                                              & cons_conf_idx >= -1.2
##  0.88 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >=           0.59 & contact is  cellular                                                     
##  0.93 when poutcome is failure or nonexistent & month is                      jun or may & cons_price_idx is -0.65 to  0.59 & contact is  cellular                                                     
##  0.94 when poutcome is                success                                                                                                                                                          
##  1.00 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is -0.65 to  0.59 & contact is  cellular & campaign is -0.20 to  0.16                        
##  1.00 when poutcome is failure or nonexistent & month is               aug or jul or nov & cons_price_idx is -0.65 to  0.59 & contact is  cellular & campaign is -0.56 to -0.20

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are month, cons_price_idx, cons_conf_idx, poutcome, and campaign. This matches the default model.

The rules above can also be used to follow the decision tree structure.

Overall results

# Create experiments results table
decision_tree_experiments <- data.frame(
  Model = c("Default Model", "Model 1", "Model 2"),
  Accuracy = c(decision_tree_accuracy, decision_tree1_accuracy, decision_tree2_accuracy),
  Error_Rate = c(1-decision_tree_accuracy, 1-decision_tree1_accuracy, 1-decision_tree2_accuracy),
  Precision = c(decision_tree_precision, decision_tree1_precision, decision_tree2_precision),
  Sensitivity = c(decision_tree_sensitivity, decision_tree1_sensitivity, decision_tree2_sensitivity),
  Specificity = c(decision_tree_specificity, decision_tree1_specificity, decision_tree2_specificity),
  F1_Score = c(decision_tree_f1, decision_tree1_f1, decision_tree2_f1),
  Kappa = c(decision_tree_kappa, decision_tree1_kappa, decision_tree2_kappa),
  AUC = c(decision_tree_auc, decision_tree1_auc, decision_tree2_auc),
  Duration = c(decision_tree_model_duration, decision_tree1_model_duration, decision_tree2_model_duration)
)
print(decision_tree_experiments)

##           Model  Accuracy Error_Rate Precision Sensitivity Specificity
## 1 Default Model 0.8362069  0.1637931 0.9430483   0.8678161   0.5872845
## 2       Model 1 0.8355998  0.1644002 0.9415604   0.8686371   0.5754310
## 3       Model 2 0.8362069  0.1637931 0.9430483   0.8678161   0.5872845
##    F1_Score     Kappa       AUC       Duration
## 1 0.9038695 0.3571577 0.7275503 1.0857201 secs
## 2 0.9036299 0.3507004 0.7220341 0.5946162 secs
## 3 0.9038695 0.3571577 0.6164905 0.9451001 secs

# Plot ROC curves
plot(roc, col = "blue", main = "ROC Curves Comparison")
plot(roc1, col = "red", add = TRUE)
plot(roc2, col = "green", add = TRUE)
legend("bottomright", 
       legend = paste0(c("Default Model", "Model 1", "Model 2"), " (AUC=", 
                      round(c(decision_tree_auc, decision_tree1_auc, decision_tree2_auc), 3), ")"),
       col = c("blue", "red", "green"),
       lwd = 2)

Overall, the default Model and Model 2 performed the best focusing on accuracy. This showed that using standardized dataset isn’t necessary, which is pro for decision trees. Model 1 showed the trade off between bias and variance, that lowering the maximum depth of the tree could go too far in the prevention of overfitting, and actually cause underfitting. For this case, too much information was lost. Therefore, the default model would be the winner of the decision trees.

Random Forests

Default Hyperparameters with Original Dataset

Metric: Accuracy

There are a lot of arguments for randomForest, so I’ll reference this R documentation website for all the params: https://www.rdocumentation.org/packages/randomForest/versions/4.7-1.2/topics/randomForest

# Default model
set.seed(125)
start <- Sys.time()
rf <- randomForest(y ~ ., data = smoted_train)
print(rf)

## 
## Call:
##  randomForest(formula = y ~ ., data = smoted_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.38%
## Confusion matrix:
##        no   yes class.error
## no  26982  2247  0.07687571
## yes  3235 25994  0.11067775

end <- Sys.time()

# Default predictions
predictions_rf <- predict(rf, newdata = test)
cf_matrix_rf <- confusionMatrix(predictions_rf, test$y)
print(cf_matrix_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6794  514
##        yes  514  414
##                                           
##                Accuracy : 0.8752          
##                  95% CI : (0.8679, 0.8822)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.9997          
##                                           
##                   Kappa : 0.3758          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.9297          
##             Specificity : 0.4461          
##          Pos Pred Value : 0.9297          
##          Neg Pred Value : 0.4461          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8249          
##    Detection Prevalence : 0.8873          
##       Balanced Accuracy : 0.6879          
##                                           
##        'Positive' Class : no              
##

# Bar plot of the target variable
predictions_rf_df <- as.data.frame(table(predictions_rf))
colnames(predictions_rf_df) <- c("Target", "Count")
ggplot(predictions_rf_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
# rf_probs <- predict(rf, newdata = test, type = "prob")
rf_probs <- predict(rf, newdata = test, type = "prob")[, 2]
rf_roc <- roc(test$y, rf_probs, quiet = TRUE)
rf_auc <- auc(rf_roc)
rf_accuracy <- cf_matrix_rf$overall['Accuracy']
rf_kappa <- cf_matrix_rf$overall['Kappa']
rf_sensitivity <- cf_matrix_rf$byClass['Sensitivity']
rf_specificity <- cf_matrix_rf$byClass['Specificity']
rf_f1 <- cf_matrix_rf$byClass["F1"]
rf_precision <- cf_matrix_rf$byClass["Precision"]
rf_model_duration <- end - start

output2 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Default Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_rf)), collapse = "\n"), "\n",
                 "Accuracy:", round(rf_accuracy, 4), "| Precision:", round(rf_precision, 4), 
                 "| Sensitivity:", round(rf_sensitivity, 4), "| Specificity:", round(rf_specificity, 4), "\n",
                 "F1 Score:", round(rf_f1, 4), "| Kappa:", round(rf_kappa, 4), "| AUC:", round(rf_auc, 4),
                 "| Model Duration:", round(rf_model_duration, 4), "\n\n", sep = " ")
cat(output2)

## 
## === Model Selection and Evaluation ===
## 
##  === Default Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6794  514
##        yes  514  414
##                                           
##                Accuracy : 0.8752          
##                  95% CI : (0.8679, 0.8822)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.9997          
##                                           
##                   Kappa : 0.3758          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.9297          
##             Specificity : 0.4461          
##          Pos Pred Value : 0.9297          
##          Neg Pred Value : 0.4461          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8249          
##    Detection Prevalence : 0.8873          
##       Balanced Accuracy : 0.6879          
##                                           
##        'Positive' Class : no              
##                                            
##  Accuracy: 0.8752 | Precision: 0.9297 | Sensitivity: 0.9297 | Specificity: 0.4461 
##  F1 Score: 0.9297 | Kappa: 0.3758 | AUC: 0.7676 | Model Duration: 33.4794

Results (focusing on accuracy):

Using the randomForest model increased the accuracy compared to the decision tree with a value of 0.8751821. This makes sense since a random forest model is an ensemble model, which is able to use a “team” of decision trees.

# Extract important predictors from model
vip(rf)

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are campaign, cons_conf_idx, month, age, and cons_price_idx.

Experiment 1: Large number of features to randomly select (mtry)

Objective: Discover how a larger number of features to randomly select affects the model accuracy.

The point of this model is to experiment with the mtry hyperparameter, which is the number of features that will be randomly selected at each split. So for this experiment, instead of the default I will force it to use 10. I chose to increase this number since increasing the number of features should give the model more data to work with, and be more accurate.

Metric: Accuracy

# Build model 1 with mtry = 10
set.seed(125)

# Tried to use cross-validation, but it took too long
# # Use cross-validation
# ctrl <- trainControl(method = "cv", number = 5)
# 
# # Set up param tune grid for cross-validation
# tune_grid = expand.grid(mtry=c(10, 15))

start <- Sys.time()
rf1 <- randomForest(y ~ ., data = smoted_train, mtry = 10)
print(rf1)

## 
## Call:
##  randomForest(formula = y ~ ., data = smoted_train, mtry = 10) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 7.37%
## Confusion matrix:
##        no   yes class.error
## no  27201  2028  0.06938315
## yes  2282 26947  0.07807315

end <- Sys.time()

# Make predictions
predictions_rf1 <- predict(rf1, newdata = test)
cf_matrix_rf1 <- confusionMatrix(predictions_rf1, test$y)
print(cf_matrix_rf1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6806  590
##        yes  502  338
##                                           
##                Accuracy : 0.8674          
##                  95% CI : (0.8599, 0.8747)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 1.00000         
##                                           
##                   Kappa : 0.3083          
##                                           
##  Mcnemar's Test P-Value : 0.00847         
##                                           
##             Sensitivity : 0.9313          
##             Specificity : 0.3642          
##          Pos Pred Value : 0.9202          
##          Neg Pred Value : 0.4024          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8264          
##    Detection Prevalence : 0.8980          
##       Balanced Accuracy : 0.6478          
##                                           
##        'Positive' Class : no              
##

# Bar plot of the target variable
predictions_rf1_df <- as.data.frame(table(predictions_rf1))
colnames(predictions_rf1_df) <- c("Target", "Count")
ggplot(predictions_rf1_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
rf1_probs <- predict(rf1, newdata = test, type = "prob")[, 2]
rf1_roc <- roc(test$y, rf1_probs, quiet = TRUE)
rf1_auc <- auc(rf1_roc)
rf1_accuracy <- cf_matrix_rf1$overall['Accuracy']
rf1_kappa <- cf_matrix_rf1$overall['Kappa']
rf1_sensitivity <- cf_matrix_rf1$byClass['Sensitivity']
rf1_specificity <- cf_matrix_rf1$byClass['Specificity']
rf1_f1 <- cf_matrix_rf1$byClass["F1"]
rf1_precision <- cf_matrix_rf1$byClass["Precision"]
rf1_model_duration <- end - start

output3 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 1 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_rf1)), collapse = "\n"), "\n",
                 "Accuracy:", round(rf1_accuracy, 4), "| Precision:", round(rf1_precision, 4), 
                 "| Sensitivity:", round(rf1_sensitivity, 4), "| Specificity:", round(rf1_specificity, 4), "\n",
                 "F1 Score:", round(rf1_f1, 4), "| Kappa:", round(rf1_kappa, 4), "| AUC:", round(rf1_auc, 4),
                 "| Model Duration:", round(rf1_model_duration, 4), "\n\n", sep = " ")
cat(output3)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 1 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6806  590
##        yes  502  338
##                                           
##                Accuracy : 0.8674          
##                  95% CI : (0.8599, 0.8747)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 1.00000         
##                                           
##                   Kappa : 0.3083          
##                                           
##  Mcnemar's Test P-Value : 0.00847         
##                                           
##             Sensitivity : 0.9313          
##             Specificity : 0.3642          
##          Pos Pred Value : 0.9202          
##          Neg Pred Value : 0.4024          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8264          
##    Detection Prevalence : 0.8980          
##       Balanced Accuracy : 0.6478          
##                                           
##        'Positive' Class : no              
##                                            
##  Accuracy: 0.8674 | Precision: 0.9202 | Sensitivity: 0.9313 | Specificity: 0.3642 
##  F1 Score: 0.9257 | Kappa: 0.3083 | AUC: 0.7397 | Model Duration: 1.105

Results (focusing on accuracy):

The first experimental random forest model showed a lower accuracy value than the default random forest model with a value of 86.74%. This means that there were too many features in the model that resulted in noisy data. With so many features, the model wasn’t able to strategically pick the most important features and make accurate predictions.

# Extract important predictors from model
vip(rf1)

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are campaign, age, cons_price_idx, month, and poutcome.

Experiment 2: Larger number of trees to grow (ntree)

Objective: Discover how a larger number of trees affects the model accuracy.

The point of this model is to experiment with the ntree hyperparameter, which is the number of trees created. So for this experiment, instead of the default of 500, 700 will be used. I initially tried to do this via cross-validation, but I ended up getting a “mtry” error, even though this param was specified.

Metric: Accuracy

# Build model 2 with ntree = 700
set.seed(125)
start <- Sys.time()
rf2 <- randomForest(y ~ ., data = smoted_train, ntree = 700)

# train_control <- trainControl(method = "cv", number = 10)
# tune_grid = expand.grid(ntree=c(500, 700))
# 
# rf2 <- train(
#   y ~ ., 
#   data = smoted_train,
#   method = "rf", 
#   trControl = train_control,
#   tuneGrid = tune_grid,
#   mtry = 3
# )

print(rf2)

## 
## Call:
##  randomForest(formula = y ~ ., data = smoted_train, ntree = 700) 
##                Type of random forest: classification
##                      Number of trees: 700
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.32%
## Confusion matrix:
##        no   yes class.error
## no  26984  2245  0.07680728
## yes  3201 26028  0.10951452

end <- Sys.time()

# Make predictions
predictions_rf2 <- predict(rf2, newdata = test)
cf_matrix_rf2 <- confusionMatrix(predictions_rf2, test$y)

# Bar plot of the target variable
predictions_rf2_df <- as.data.frame(table(predictions_rf2))
colnames(predictions_rf2_df) <- c("Target", "Count")
ggplot(predictions_rf2_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
rf2_probs <- predict(rf2, newdata = test, type = "prob")[, 2]
rf2_roc <- roc(test$y, rf2_probs, quiet = TRUE)
rf2_auc <- auc(rf2_roc)
rf2_accuracy <- cf_matrix_rf2$overall['Accuracy']
rf2_kappa <- cf_matrix_rf2$overall['Kappa']
rf2_sensitivity <- cf_matrix_rf2$byClass['Sensitivity']
rf2_specificity <- cf_matrix_rf2$byClass['Specificity']
rf2_f1 <- cf_matrix_rf2$byClass["F1"]
rf2_precision <- cf_matrix_rf2$byClass["Precision"]
rf2_model_duration <- end - start

output4 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 1 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_rf2)), collapse = "\n"), "\n",
                 "Accuracy:", round(rf2_accuracy, 4), "| Precision:", round(rf2_precision, 4), 
                 "| Sensitivity:", round(rf2_sensitivity, 4), "| Specificity:", round(rf2_specificity, 4), "\n",
                 "F1 Score:", round(rf2_f1, 4), "| Kappa:", round(rf2_kappa, 4), "| AUC:", round(rf2_auc, 4),
                 "| Model Duration:", round(rf2_model_duration, 4), "\n\n", sep = " ")
cat(output4)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 1 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6796  508
##        yes  512  420
##                                           
##                Accuracy : 0.8762          
##                  95% CI : (0.8688, 0.8832)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.9993          
##                                           
##                   Kappa : 0.3818          
##                                           
##  Mcnemar's Test P-Value : 0.9252          
##                                           
##             Sensitivity : 0.9299          
##             Specificity : 0.4526          
##          Pos Pred Value : 0.9304          
##          Neg Pred Value : 0.4506          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8252          
##    Detection Prevalence : 0.8868          
##       Balanced Accuracy : 0.6913          
##                                           
##        'Positive' Class : no              
##                                            
##  Accuracy: 0.8762 | Precision: 0.9304 | Sensitivity: 0.9299 | Specificity: 0.4526 
##  F1 Score: 0.9302 | Kappa: 0.3818 | AUC: 0.7678 | Model Duration: 10.9643

Results (focusing on accuracy):

The second experimental random forest model showed a slightly higher accuracy value than the default random forest model with a value of 87.62%. This shows that increasing the number of trees does improve the model’s accuracy, as more “team members” are now part of team (tree) making it more successful.

# Extract important predictors from model
vip(rf2)

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are campaign, cons_conf_idx, month, age, and cons_price_idx.

Overall results

# Create experiments results table
rf_experiments <- data.frame(
  Model = c("Default Model", "Model 1", "Model 2"),
  Accuracy = c(rf_accuracy, rf1_accuracy, rf2_accuracy),
  Error_Rate = c(1-rf_accuracy, 1-rf1_accuracy, 1-rf2_accuracy),
  Precision = c(rf_precision, rf1_precision, rf2_precision),
  Sensitivity = c(rf_sensitivity, rf1_sensitivity, rf2_sensitivity),
  Specificity = c(rf_specificity, rf1_specificity, rf2_specificity),
  F1_Score = c(rf_f1, rf1_f1, rf2_f1),
  Kappa = c(rf_kappa, rf1_kappa, rf2_kappa),
  AUC = c(rf_auc, rf1_auc, rf2_auc),
  Duration = c(rf_model_duration, rf1_model_duration, rf2_model_duration)
)
print(rf_experiments)

##           Model  Accuracy Error_Rate Precision Sensitivity Specificity
## 1 Default Model 0.8751821  0.1248179 0.9296661   0.9296661   0.4461207
## 2       Model 1 0.8674114  0.1325886 0.9202271   0.9313082   0.3642241
## 3       Model 2 0.8761535  0.1238465 0.9304491   0.9299398   0.4525862
##    F1_Score     Kappa       AUC       Duration
## 1 0.9296661 0.3757868 0.7676071  33.47942 secs
## 2 0.9257345 0.3082935 0.7396810  66.29887 secs
## 3 0.9301944 0.3818076 0.7678411 657.85647 secs

# Plot ROC curves
plot(rf_roc, col = "blue", main = "ROC Curves Comparison")
plot(rf1_roc, col = "red", add = TRUE)
plot(rf2_roc, col = "green", add = TRUE)
legend("bottomright", 
       legend = paste0(c("Default Model", "Model 1", "Model 2"), " (AUC=", 
                      round(c(rf_auc, rf1_auc, rf2_auc), 3), ")"),
       col = c("blue", "red", "green"),
       lwd = 2)

Overall, increasing the number of trees (Model 2) improved the model’s performance, but not by much. It did however increase the runtime of the model, making it more computationally expensive. Increasing the number of features (Model 1) actually hurt the model’s performance, showing that having too many unimportant features can lower the accuracy of a decision tree. It’s important to not make important splits on noisy data that does not actually influence the outcome variable. Additionally, adding more features added a lot of time to the runtime of the model. The fastest model was the default, followed by Model 2. Model 2 across the board had the highest metrics except for specificity, so Model 2 has to be the winner for the random forest models.

Adaboost

Default Hyperparameters with Original Dataset

From the R documentation, here are the parameters for this model type:

formula: a formula, as in the lm function.
data: a data frame in which to interpret the variables named in formula.
boos: if TRUE (by default), a bootstrap sample of the training set is drawn using the weights for each observation on that iteration. If FALSE, every observation is used with its weights.
mfinal: an integer, the number of iterations for which boosting is run or the number of trees to use. Defaults to mfinal=100 iterations.
coeflearn: if ‘Breiman’(by default), alpha=1/2ln((1-err)/err) is used. If ‘Freund’ alpha=ln((1-err)/err) is used. In both cases the AdaBoost.M1 algorithm is used and alpha is the weight updating coefficient. On the other hand, if coeflearn is ‘Zhu’ the SAMME algorithm is implemented with alpha=ln((1-err)/err)+ ln(nclasses-1).
control: options that control details of the rpart algorithm. See rpart.control for more details.

# Default model
set.seed(125)
start <- Sys.time()
adaboost <- boosting(y ~ ., data = smoted_train)
print(summary(adaboost))

##            Length Class   Mode     
## formula         3 formula call     
## trees         100 -none-  list     
## weights       100 -none-  numeric  
## votes      116916 -none-  numeric  
## prob       116916 -none-  numeric  
## class       58458 -none-  character
## importance     15 -none-  numeric  
## terms           3 terms   call     
## call            3 -none-  call

end <- Sys.time()

# Make predictions
# When using boosting, use predict.boosting for the predictions
# Note: must convert the output of predict.boosting to a factor to match the test dataset and avoid errors
predictions_adaboost <- predict.boosting(adaboost, newdata = test)
cf_matrix_adaboost <- confusionMatrix(as.factor(predictions_adaboost$class), test$y)
# print(cf_matrix_adaboost)

# Model Evaluation - use accuracy metric
# accuracy_adaboost <- cf_matrix_adaboost$overall['Accuracy']
# print(accuracy_adaboost)

# Bar plot of the target variable
predictions_adaboost_df <- as.data.frame(table(predictions_adaboost$class))
colnames(predictions_adaboost_df) <- c("Target", "Count")
ggplot(predictions_adaboost_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
# adaboost_probs <- predict(adaboost, newdata = test, type = "prob")[, 2]
adaboost_roc <- roc(test$y, predictions_adaboost$prob[, 2], quiet = TRUE)
adaboost_auc <- auc(adaboost_roc)
adaboost_accuracy <- cf_matrix_adaboost$overall['Accuracy']
adaboost_kappa <- cf_matrix_adaboost$overall['Kappa']
adaboost_sensitivity <- cf_matrix_adaboost$byClass['Sensitivity']
adaboost_specificity <- cf_matrix_adaboost$byClass['Specificity']
adaboost_f1 <- cf_matrix_adaboost$byClass["F1"]
adaboost_precision <- cf_matrix_adaboost$byClass["Precision"]
adaboost_model_duration <- end - start

output5 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Default Model Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_adaboost)), collapse = "\n"), "\n",
                 "Accuracy:", round(adaboost_accuracy, 4), "| Precision:", round(adaboost_precision, 4), 
                 "| Sensitivity:", round(adaboost_sensitivity, 4), "| Specificity:", round(adaboost_specificity, 4), "\n",
                 "F1 Score:", round(adaboost_f1, 4), "| Kappa:", round(adaboost_kappa, 4), "| AUC:", round(adaboost_auc, 4),
                 "| Model Duration:", round(adaboost_model_duration, 4), "\n\n", sep = " ")
cat(output5)

## 
## === Model Selection and Evaluation ===
## 
##  === Default Model Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6719  448
##        yes  589  480
##                                           
##                Accuracy : 0.8741          
##                  95% CI : (0.8667, 0.8812)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.4095          
##                                           
##  Mcnemar's Test P-Value : 1.377e-05       
##                                           
##             Sensitivity : 0.9194          
##             Specificity : 0.5172          
##          Pos Pred Value : 0.9375          
##          Neg Pred Value : 0.4490          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8158          
##    Detection Prevalence : 0.8702          
##       Balanced Accuracy : 0.7183          
##                                           
##        'Positive' Class : no              
##                                            
##  Accuracy: 0.8741 | Precision: 0.9375 | Sensitivity: 0.9194 | Specificity: 0.5172 
##  F1 Score: 0.9284 | Kappa: 0.4095 | AUC: 0.7702 | Model Duration: 4.0928

Results (focusing on accuracy):

For the default adaboost model, the accuracy is 87.41%. This default model did not beat the best random forest model, but is extremely close.

# Extract important predictors from model
importanceplot(adaboost, las = 2)

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are campaign, age, poutcome, cons_price_index, and month.

Experiment 1: Increase number of iterations

Objective: Discover how much accuracy improves with an increase in iterations (number of trees to run)

The point of this model is to experiment with the mfinal hyperparameter, which is the number of trees to run. So for this experiment, instead of the default of 100, let’s try 150. Increasing the number of trees to learn from should increase the model’s accuracy.

# Experiment 1 - increase mfinal param
set.seed(125)
start <- Sys.time()
adaboost1 <- boosting(y ~ ., data = smoted_train, mfinal=150)
print(summary(adaboost1))

##            Length Class   Mode     
## formula         3 formula call     
## trees         150 -none-  list     
## weights       150 -none-  numeric  
## votes      116916 -none-  numeric  
## prob       116916 -none-  numeric  
## class       58458 -none-  character
## importance     15 -none-  numeric  
## terms           3 terms   call     
## call            4 -none-  call

end <- Sys.time()

# Make predictions
# When using boosting, use predict.boosting for the predictions
# Note: must convert the output of predict.boosting to a factor to match the test dataset and avoid errors
predictions_adaboost1 <- predict.boosting(adaboost1, newdata = test)
cf_matrix_adaboost1 <- confusionMatrix(as.factor(predictions_adaboost1$class), test$y)

# Bar plot of the target variable
predictions_adaboost1_df <- as.data.frame(table(predictions_adaboost1$class))
colnames(predictions_adaboost1_df) <- c("Target", "Count")
ggplot(predictions_adaboost1_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
adaboost1_roc <- roc(test$y, predictions_adaboost1$prob[, 2], quiet = TRUE)
adaboost1_auc <- auc(adaboost1_roc)
adaboost1_accuracy <- cf_matrix_adaboost1$overall['Accuracy']
adaboost1_kappa <- cf_matrix_adaboost1$overall['Kappa']
adaboost1_sensitivity <- cf_matrix_adaboost1$byClass['Sensitivity']
adaboost1_specificity <- cf_matrix_adaboost1$byClass['Specificity']
adaboost1_f1 <- cf_matrix_adaboost1$byClass["F1"]
adaboost1_precision <- cf_matrix_adaboost1$byClass["Precision"]
adaboost1_model_duration <- end - start

output6 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 1 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_adaboost1)), collapse = "\n"), "\n",
                 "Accuracy:", round(adaboost1_accuracy, 4), "| Precision:", round(adaboost1_precision, 4), 
                 "| Sensitivity:", round(adaboost1_sensitivity, 4), "| Specificity:", round(adaboost1_specificity, 4), "\n",
                 "F1 Score:", round(adaboost1_f1, 4), "| Kappa:", round(adaboost1_kappa, 4), "| AUC:", round(adaboost1_auc, 4),
                 "| Model Duration:", round(adaboost1_model_duration, 4), "\n\n", sep = " ")
cat(output6)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 1 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6738  452
##        yes  570  476
##                                          
##                Accuracy : 0.8759         
##                  95% CI : (0.8686, 0.883)
##     No Information Rate : 0.8873         
##     P-Value [Acc > NIR] : 0.9994247      
##                                          
##                   Kappa : 0.4121         
##                                          
##  Mcnemar's Test P-Value : 0.0002524      
##                                          
##             Sensitivity : 0.9220         
##             Specificity : 0.5129         
##          Pos Pred Value : 0.9371         
##          Neg Pred Value : 0.4551         
##              Prevalence : 0.8873         
##          Detection Rate : 0.8181         
##    Detection Prevalence : 0.8730         
##       Balanced Accuracy : 0.7175         
##                                          
##        'Positive' Class : no             
##                                           
##  Accuracy: 0.8759 | Precision: 0.9371 | Sensitivity: 0.922 | Specificity: 0.5129 
##  F1 Score: 0.9295 | Kappa: 0.4121 | AUC: 0.7712 | Model Duration: 6.6524

Results (focusing on accuracy):

For this adaboost model, the accuracy is higher than the default accuracy with a value of 87.59%. Although it gives the highest accuracy, this model took the longest time of 4.4 mins, meaning there is a cost for this level of accuracy.

# Extract important predictors from model
importanceplot(adaboost1, las = 2)

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are campaign, age, poutcome, cons_price_index, and month (the same as the default model).

Experiment 2: Train-test split - smaller training data, larger testing data

Objective: Discover if the model’s accurate decreases with less training data, and larger testing data.

The point of this model is to experiment with the amount of training data actually needed for adaboost. Adaboost should be the best model, so for this experiment, I would like to see if we can still get a great accuracy while using less training data. After testing out a few models for adaboost and them taking too long to even complete running, it would be great to speed up the runtime by having less data needed. (I wanted to test out params such as the contribution of each learner, coeflearn, but my model never finished running).

# Experiment 1 - try using smaller training dataset
set.seed(125)
start <- Sys.time()
adaboost2 <- boosting(y ~ ., data = smoted_train1)
print(summary(adaboost2))

##            Length Class   Mode     
## formula        3  formula call     
## trees        100  -none-  list     
## weights      100  -none-  numeric  
## votes      87688  -none-  numeric  
## prob       87688  -none-  numeric  
## class      43844  -none-  character
## importance    15  -none-  numeric  
## terms          3  terms   call     
## call           3  -none-  call

end <- Sys.time()

# Make predictions
# When using boosting, use predict.boosting for the predictions
# Note: must convert the output of predict.boosting to a factor to match the test dataset and avoid errors
predictions_adaboost2 <- predict.boosting(adaboost2, newdata = test1)
cf_matrix_adaboost2 <- confusionMatrix(as.factor(predictions_adaboost2$class), test1$y)

# Bar plot of the target variable
predictions_adaboost2_df <- as.data.frame(table(predictions_adaboost2$class))
colnames(predictions_adaboost2_df) <- c("Target", "Count")
ggplot(predictions_adaboost2_df, aes(x = Target, y = Count)) +
  geom_bar(stat = "identity", fill="blue") +
  labs(
    title = "Distribution of Target Variable",
    x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
    y = "Count"
  ) +
  theme_minimal()

# Evaluation
adaboost2_roc <- roc(test1$y, predictions_adaboost2$prob[, 2], quiet = TRUE)
adaboost2_auc <- auc(adaboost2_roc)
adaboost2_accuracy <- cf_matrix_adaboost2$overall['Accuracy']
adaboost2_kappa <- cf_matrix_adaboost2$overall['Kappa']
adaboost2_sensitivity <- cf_matrix_adaboost2$byClass['Sensitivity']
adaboost2_specificity <- cf_matrix_adaboost2$byClass['Specificity']
adaboost2_f1 <- cf_matrix_adaboost2$byClass["F1"]
adaboost2_precision <- cf_matrix_adaboost2$byClass["Precision"]
adaboost2_model_duration <- end - start

output7 <- paste("\n=== Model Selection and Evaluation ===\n\n",
                 "=== Model 2 Evaluation ===\n",
                 "Confusion Matrix:\n",
                 paste(capture.output(print(cf_matrix_adaboost2)), collapse = "\n"), "\n",
                 "Accuracy:", round(adaboost2_accuracy, 4), "| Precision:", round(adaboost2_precision, 4), 
                 "| Sensitivity:", round(adaboost2_sensitivity, 4), "| Specificity:", round(adaboost2_specificity, 4), "\n",
                 "F1 Score:", round(adaboost2_f1, 4), "| Kappa:", round(adaboost2_kappa, 4), "| AUC:", round(adaboost2_auc, 4),
                 "| Model Duration:", round(adaboost2_model_duration, 4), "\n\n", sep = " ")
cat(output7)

## 
## === Model Selection and Evaluation ===
## 
##  === Model 2 Evaluation ===
##  Confusion Matrix:
##  Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  13428   887
##        yes  1187   969
##                                           
##                Accuracy : 0.8741          
##                  95% CI : (0.8689, 0.8791)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4118          
##                                           
##  Mcnemar's Test P-Value : 5.186e-11       
##                                           
##             Sensitivity : 0.9188          
##             Specificity : 0.5221          
##          Pos Pred Value : 0.9380          
##          Neg Pred Value : 0.4494          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8153          
##    Detection Prevalence : 0.8691          
##       Balanced Accuracy : 0.7204          
##                                           
##        'Positive' Class : no              
##                                            
##  Accuracy: 0.8741 | Precision: 0.938 | Sensitivity: 0.9188 | Specificity: 0.5221 
##  F1 Score: 0.9283 | Kappa: 0.4118 | AUC: 0.7839 | Model Duration: 2.5773

Results (focusing on accuracy):

For this adaboost model, the accuracy is exactly the same as the default accuracy with a value of 87.41%. This tells us that using 20% less training data did not affect the model performance. Potentially a very good accuracy score could be had by tuning this model. A big plus for this model is that it was able to run.

# Extract important predictors from model
importanceplot(adaboost2, las = 2)

Recommendation:

According to this model, the most important predictors the marketing team should be focusing on are campaign, age, poutcome, month, and cons_price_index (the same as the default model, but month higher than cons_price_index).

Overall results

# Create experiments results table
adaboost_experiments <- data.frame(
  Model = c("Default Model", "Model 1", "Model 2"),
  Accuracy = c(adaboost_accuracy, adaboost1_accuracy, adaboost2_accuracy),
  Error_Rate = c(1-adaboost_accuracy, 1-adaboost1_accuracy, 1-adaboost2_accuracy),
  Precision = c(adaboost_precision, adaboost1_precision, adaboost2_precision),
  Sensitivity = c(adaboost_sensitivity, adaboost1_sensitivity, adaboost2_sensitivity),
  Specificity = c(adaboost_specificity, adaboost1_specificity, adaboost2_specificity),
  F1_Score = c(adaboost_f1, adaboost1_f1, adaboost2_f1),
  Kappa = c(adaboost_kappa, adaboost1_kappa, adaboost2_kappa),
  AUC = c(adaboost_auc, adaboost1_auc, adaboost2_auc),
  Duration = c(adaboost_model_duration, adaboost1_model_duration, adaboost2_model_duration)
)
print(adaboost_experiments)

##           Model  Accuracy Error_Rate Precision Sensitivity Specificity
## 1 Default Model 0.8740894  0.1259106 0.9374913   0.9194034   0.5172414
## 2       Model 1 0.8759106  0.1240894 0.9371349   0.9220033   0.5129310
## 3       Model 2 0.8740817  0.1259183 0.9380370   0.9187821   0.5220905
##    F1_Score     Kappa       AUC      Duration
## 1 0.9283592 0.4094865 0.7701523 4.092809 mins
## 2 0.9295075 0.4120631 0.7712281 6.652441 mins
## 3 0.9283097 0.4118166 0.7839215 2.577312 mins

# Plot ROC curves
plot(adaboost_roc, col = "blue", main = "ROC Curves Comparison")
plot(adaboost1_roc, col = "red", add = TRUE)
plot(adaboost2_roc, col = "green", add = TRUE)
legend("bottomright", 
       legend = paste0(c("Default Model", "Model 1", "Model 2"), " (AUC=", 
                      round(c(adaboost_auc, adaboost1_auc, adaboost2_auc), 3), ")"),
       col = c("blue", "red", "green"),
       lwd = 2)

Overall Model 1 performed the best regarding accuracy, but not by much. This means having more iterations does improve a model’s performance as it is able to learn more. It did not have the best performance across the board - it had the smallest precision and specificity metrics. It also took a very long time to run with a duration of 6.674 mins. Model 2 had a very high accuracy (basically the same as the default model), and had the highest precision, specificity and AUC. Perhaps most importantly, the runtime decreased by 56.03% compared to Model 1 which is a major difference. It also has a lower runtime compared to the default model. Because of this, Model 2 has to be the winner of the adaboost models.

Conclusion

Final results table:

all_experiments <- data.frame(
  Model = c("DT Default Model", "DT Model 1", "DT Model 2", "RF Default Model", "RF Model 1", "RF Model 2", "AB Default Model", "AB Model 1", "AB Model 2"),
  Accuracy = c(decision_tree_accuracy, decision_tree1_accuracy, decision_tree2_accuracy, rf_accuracy, rf1_accuracy, rf2_accuracy, adaboost_accuracy, adaboost1_accuracy, adaboost2_accuracy),
  Error_Rate = c(1-decision_tree_accuracy, 1-decision_tree1_accuracy, 1-decision_tree2_accuracy, 1-rf_accuracy, 1-rf1_accuracy, 1-rf2_accuracy, 1-adaboost_accuracy, 1-adaboost1_accuracy, 1-adaboost2_accuracy),
  Precision = c(decision_tree_precision, decision_tree1_precision, decision_tree2_precision, rf_precision, rf1_precision, rf2_precision, adaboost_precision, adaboost1_precision, adaboost2_precision),
  Sensitivity = c(decision_tree_sensitivity, decision_tree1_sensitivity, decision_tree2_sensitivity, rf_sensitivity, rf1_sensitivity, rf2_sensitivity, adaboost_sensitivity, adaboost1_sensitivity, adaboost2_sensitivity),
  Specificity = c(decision_tree_specificity, decision_tree1_specificity, decision_tree2_specificity, rf_specificity, rf1_specificity, rf2_specificity, adaboost_specificity, adaboost1_specificity, adaboost2_specificity),
  F1_Score = c(decision_tree_f1, decision_tree1_f1, decision_tree2_f1, rf_f1, rf1_f1, rf2_f1, adaboost_f1, adaboost1_f1, adaboost2_f1),
  Kappa = c(decision_tree_kappa, decision_tree1_kappa, decision_tree2_kappa, rf_kappa, rf1_kappa, rf2_kappa, adaboost_kappa, adaboost1_kappa, adaboost2_kappa),
  AUC = c(decision_tree_auc, decision_tree1_auc, decision_tree2_auc, rf_auc, rf1_auc, rf2_auc, adaboost_auc, adaboost1_auc, adaboost2_auc),
  Duration = c(decision_tree_model_duration, decision_tree1_model_duration, decision_tree2_model_duration, rf_model_duration, rf1_model_duration, rf2_model_duration, adaboost_model_duration, adaboost1_model_duration, adaboost2_model_duration)
)
print(all_experiments)

##              Model  Accuracy Error_Rate Precision Sensitivity Specificity
## 1 DT Default Model 0.8362069  0.1637931 0.9430483   0.8678161   0.5872845
## 2       DT Model 1 0.8355998  0.1644002 0.9415604   0.8686371   0.5754310
## 3       DT Model 2 0.8362069  0.1637931 0.9430483   0.8678161   0.5872845
## 4 RF Default Model 0.8751821  0.1248179 0.9296661   0.9296661   0.4461207
## 5       RF Model 1 0.8674114  0.1325886 0.9202271   0.9313082   0.3642241
## 6       RF Model 2 0.8761535  0.1238465 0.9304491   0.9299398   0.4525862
## 7 AB Default Model 0.8740894  0.1259106 0.9374913   0.9194034   0.5172414
## 8       AB Model 1 0.8759106  0.1240894 0.9371349   0.9220033   0.5129310
## 9       AB Model 2 0.8740817  0.1259183 0.9380370   0.9187821   0.5220905
##    F1_Score     Kappa       AUC         Duration
## 1 0.9038695 0.3571577 0.7275503   1.0857201 secs
## 2 0.9036299 0.3507004 0.7220341   0.5946162 secs
## 3 0.9038695 0.3571577 0.6164905   0.9451001 secs
## 4 0.9296661 0.3757868 0.7676071  33.4794230 secs
## 5 0.9257345 0.3082935 0.7396810  66.2988660 secs
## 6 0.9301944 0.3818076 0.7678411 657.8564739 secs
## 7 0.9283592 0.4094865 0.7701523 245.5685310 secs
## 8 0.9295075 0.4120631 0.7712281 399.1464460 secs
## 9 0.9283097 0.4118166 0.7839215 154.6387150 secs

Overall, all of the models produced fairly good accuracy scores but none of them could get above 87% accuracy. The model with the highest accuracy is Random Forest Model 2. Although it did not perform the best for all the other metrics, it had fairly high numbers. For many of the metrics such as AUC, it’s very close to the highest value. Additionally, the runtime is not too bad compared to the adaboost models. So therefore, the model chosen would have to be the Random Forest Model 2.

DATA 622 Assignment 2

Gillian McGovern

2025-10-22

Background

Dataset

1. Importing Libraries

2. Data Ingestion

4 EDA

5 Data Cleansing for Decision Trees

5.1 Convert Categorical Variable to Factor

5.2 Remove Variables

5.3 Feature Engineering

5.4 Split Data for Validation

5.5 Standardization

5.6 Address Imbalanced Dataset

6 Experiments

Decision Trees

Default Hyperparameters with Original Training Dataset (Non-standardized)

Experiment 1: Smaller maximum depth of the final tree

Experiment 2: Standardized datasets

Overall results

Random Forests

Default Hyperparameters with Original Dataset

Experiment 1: Large number of features to randomly select (mtry)

Experiment 2: Larger number of trees to grow (ntree)

Overall results

Adaboost

Default Hyperparameters with Original Dataset

Experiment 1: Increase number of iterations

Experiment 2: Train-test split - smaller training data, larger testing data

Overall results

Conclusion