A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.
Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
NOTE: The experiments section is in part 6
For this assignment, I chose the full dataset,
bank-additional-full.csv which has all observations and
features. I would prefer to start out with all the data, then narrow it
down after performing my EDA and algorithm selection. I wouldn’t want to
disregard some features or observations before looking at them and
knowing if they’re important or not.
Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) 5 - default: has credit in default? (binary: “yes”,“no”) 6 - housing: has housing loan? (binary: “yes”,“no”) 7 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 9 - day: last contact day of the month (numeric) 10 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 11 - duration: last contact duration, in seconds (numeric) # other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”) 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071) # For skewness function
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:ggplot2':
##
## element
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ROSE)
## Loaded ROSE 0.0-4
library(smotefamily)
library(dplyr)
library(caret)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(GGally)
library(gmodels)
library(correlationfunnel)
## ══ correlationfunnel Tip #2 ════════════════════════════════════════════════════
## Clean your NA's prior to using `binarize()`.
## Missing values and cleaning data are critical to getting great correlations. :)
library(DataExplorer)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(readr)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ rsample 1.3.1
## ✔ dials 1.4.2 ✔ tailor 0.1.0
## ✔ infer 1.0.9 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ recipes 1.3.1 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ rsample::calibration() masks caret::calibration()
## ✖ scales::discard() masks purrr::discard()
## ✖ e1071::element() masks ggplot2::element()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ car::recode() masks dplyr::recode()
## ✖ MASS::select() masks dplyr::select()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ car::some() masks purrr::some()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
library(themis)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(vcd)
## Loading required package: grid
library(rpart)
##
## Attaching package: 'rpart'
##
## The following object is masked from 'package:dials':
##
## prune
library(rpart.plot)
options(rgl.useNULL=TRUE)
library(adabag)
## Loading required package: foreach
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following object is masked from 'package:gmodels':
##
## ci
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(vip)
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
#library(fastAdaboost)
# Full dataset
data_raw <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-608/refs/heads/main/bank-additional-full.csv", sep = ";")
# Rename columns
names(data_raw)[names(data_raw) == "emp.var.rate"] <- "emp_var_rate"
names(data_raw)[names(data_raw) == "cons.price.idx"] <- "cons_price_idx"
names(data_raw)[names(data_raw) == "cons.conf.idx"] <- "cons_conf_idx"
names(data_raw)[names(data_raw) == "nr.employed"] <- "nr_employed"
# Remove duplicates
data_raw <- unique(data_raw)
str(data_raw)
## 'data.frame': 41176 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp_var_rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons_price_idx: num 94 94 94 94 94 ...
## $ cons_conf_idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr_employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
EDA was done in assignment 1 so I won’t include it here.
This section is from assignment 1, but altered very slightly due to the fact tree algorithms are being used for this assignment.
To use the algorithms, let’s convert categorical variables to
factor data type:
# Convert categorical to factor since glm can use factors
data_raw_clean <- data_raw
data_raw_clean$job <- as.factor(data_raw_clean$job)
data_raw_clean$marital <- as.factor(data_raw_clean$marital)
data_raw_clean$education <- as.factor(data_raw_clean$education)
data_raw_clean$default <- as.factor(data_raw_clean$default)
data_raw_clean$housing <- as.factor(data_raw_clean$housing)
data_raw_clean$loan <- as.factor(data_raw_clean$loan)
data_raw_clean$contact <- as.factor(data_raw_clean$contact)
data_raw_clean$month <- as.factor(data_raw_clean$month)
data_raw_clean$day_of_week <- as.factor(data_raw_clean$day_of_week)
data_raw_clean$poutcome <- as.factor(data_raw_clean$poutcome)
data_raw_clean$y <- as.factor(data_raw_clean$y)
Since over 96% of pdays is inapplicable, let’s treat
this as NA data which should just be removed to simplify the model.
Having too much data can be harmful as it hurts efficiency.
Additionally, as the dataset suggests, duration should
be discarded if the intention is to have a realistic predictive model,
so let’s remove that as well.
# Remove variable pdays (96% inapplicable) and duration
data_raw_clean2 <- data_raw_clean |>
dplyr::select(-c(pdays, duration))
Address Multicollinearity:
Since there can not be any highly correlated variables for Logistic
Regression, let’s remove the ones we found in EDA which have
correlations > 0.4. Let’s remove euribor3m,
emp_var_rate, and nr_employed. This will also
help reduce the amount of data in the model.
# Remove `euribor3m`, `emp_var_rate`, and `nr_employed` (absolute value correlations > 0.4)
data_raw_clean3 <- data_raw_clean2 |>
dplyr::select(-c(euribor3m, emp_var_rate, nr_employed))
Although I combined variables in the dataset
data_raw_combined for EDA, I would probably start out my
model with just data_raw_clean3. After creating the model,
I would then test data_raw_combined (after applying the
same changes I made to data_raw_clean3 above) to see if it
would improve the model’s performance.
To ensure objective model evaluation and prevent overfitting, let’s split each dataset into training (80%) and testing (20%) sets.
# Split data into train and test
set.seed(123)
split <- initial_split(data_raw_clean3, prop = 0.8, strata = y)
train <- split |>
training()
test <- split |>
testing()
# Try with other 60% training, 40% test
set.seed(123)
split1 <- initial_split(data_raw_clean3, prop = 0.6, strata = y)
train1 <- split1 |>
training()
test1 <- split1 |>
testing()
Standardization could be helpful for Decision Trees, but it is not necessary. Since there is a difference in spreads among my predictors (as shown in the EDA), standardizing the data might be useful. This would be especially important if I ever decided to use regularization.
# Standardize the training and test sets (using training standardization)
set.seed(123)
preproc_params <- preProcess(train, method = c("center", "scale"))
# Apply standardization from the train set
train_standardized <- predict(preproc_params, train)
test_standardized <- predict(preproc_params, test)
As the EDA showed, this dataset is heavily imbalanced which can cause bias. To resolve this, let’s use SMOTE (Synthetic Minority Over-sampling Technique):
set.seed(123)
# Only apply SMOTE to training set
smoted_train_standardized <- smotenc(train_standardized, var = "y", over_ratio = 1)
# Do the same for original training set
smoted_train <- smotenc(train, var = "y", over_ratio = 1)
smoted_train1 <- smotenc(train1, var = "y", over_ratio = 1)
# Distribution of target variable
smoted_train_standardized |>
dplyr::select(y) |>
ggplot() +
aes(x = y) +
geom_histogram(bins= 40,fill = "blue", color = "black", stat="count") +
labs(title = "Distribution of y", y = "Count") +
theme_minimal()
## Warning in geom_histogram(bins = 40, fill = "blue", color = "black", stat =
## "count"): Ignoring unknown parameters: `binwidth` and `bins`
We can now see that the distribution is even for our target variable,
y.
The final training set to be used in a Logistic Regression model is
smoted_train_standardized and the test set is
test_standardized.
Since there is no variation in this model, as it uses the the default and original datasets, we won’t count this as an actual experiment. We will only use this for comparison purposes and learning about the default params.
Metric: Accuracy
From the R documentation, here are the parameters for this model type:
minsplit: the minimum number of observations that must
exist in a node in order for a split to be attemptedminbucket: the minimum number of observations in any
terminal cp: complexity parameter. Any split that does not
decrease the overall lack of fit by a factor of cp is not attempted. For
instance, with anova splitting, this means that the overall R-squared
must increase by cp at each step. The main role of this parameter is to
save computing time by pruning off splits that are obviously not
worthwhile. Essentially,the user informs the program that any split
which does not improve the fit by cp will likely be pruned off by
cross-validation, and that hence the program need not pursue it.maxcompete: the number of competitor splits retained in
the output. It is useful to know not just which split was chosen, but
which variable came in second, third, etc.maxsurrogate: the number of surrogate splits retained
in the output. If this is set to zero the compute time will be reduced,
since approximately half of the computational time (other than setup) is
used in the search for surrogate splits.usesurrogate: how to use surrogates in the splitting
process. 0 means display only; an observation with a missing value for
the primary split rule is not sent further down the tree. 1 means use
surrogates, in order, to split subjects missing the primary variable; if
all surrogates are missing the observation is not split. For value 2 ,if
all surrogates are missing, then send the observation in the majority
direction. A value of 0 corresponds to the action of tree, and 2 to the
recommendations of Breiman et.al (1984).xval: number of cross-validations.surrogatestyle: controls the selection of a best
surrogate. If set to 0 (default) the program uses the total number of
correct classification for a potential surrogate variable, if set to 1
it uses the percent correct, calculated over the non-missing values of
the surrogate. The first option more severely penalizes covariates with
a large number of missing values.maxdepth: Set the maximum depth of any node of the
final tree, with the root node counted as depth 0. Maximum value 30 (as
node numbers have to be stored in 32-bit signed integer types).The evaluation metric we will focus on accuracy.
# Default model
# minsplit=20
# minbucket=round(minsplit/3)
# cp=0.01
# maxdepth=30
# xval=10
# maxcompete=4
# maxsurrogate=5
# usesurrogate=0
set.seed(125)
start <- Sys.time()
decision_tree <- rpart(y ~ ., data = smoted_train, method = "class")
print(summary(decision_tree))
## Call:
## rpart(formula = y ~ ., data = smoted_train, method = "class")
## n= 58458
##
## CP nsplit rel error xerror xstd
## 1 0.18040302 0 1.0000000 1.0074926 0.004135860
## 2 0.14690889 1 0.8195970 0.8195970 0.004068116
## 3 0.06455917 2 0.6726881 0.6722433 0.003907514
## 4 0.03907079 3 0.6081289 0.6081289 0.003805181
## 5 0.02035650 4 0.5690581 0.5719662 0.003737941
## 6 0.01800723 6 0.5283451 0.5312874 0.003653518
## 7 0.01293236 9 0.4743234 0.4772999 0.003525989
## 8 0.01000000 11 0.4484587 0.4514694 0.003458216
##
## Variable importance
## month cons_price_idx cons_conf_idx poutcome campaign
## 21 18 18 17 13
## previous contact age job education
## 7 3 2 1 1
##
## Node number 1: 58458 observations, complexity param=0.180403
## predicted class=no expected loss=0.5 P(node) =1
## class counts: 29229 29229
## probabilities: 0.500 0.500
## left son=2 (52419 obs) right son=3 (6039 obs)
## Primary splits:
## poutcome splits as LLR, improve=2567.295, (0 missing)
## contact splits as RL, improve=2391.529, (0 missing)
## month splits as RLRLLRLLRR, improve=2282.421, (0 missing)
## cons_conf_idx < -36.09886 to the left, improve=2165.345, (0 missing)
## previous < 0.0004073881 to the left, improve=1978.245, (0 missing)
## Surrogate splits:
## previous < 1.001562 to the left, agree=0.926, adj=0.287, (0 split)
## cons_price_idx < 94.60799 to the left, agree=0.897, adj=0.003, (0 split)
## cons_conf_idx < -50.7846 to the right, agree=0.897, adj=0.002, (0 split)
## age < 18.7708 to the right, agree=0.897, adj=0.001, (0 split)
##
## Node number 2: 52419 observations, complexity param=0.1469089
## predicted class=no expected loss=0.4497034 P(node) =0.8966951
## class counts: 28846 23573
## probabilities: 0.550 0.450
## left son=4 (43283 obs) right son=5 (9136 obs)
## Primary splits:
## month splits as RLRLLRLLRR, improve=1801.2090, (0 missing)
## cons_price_idx < 93.19988 to the right, improve=1793.0170, (0 missing)
## contact splits as RL, improve=1651.3490, (0 missing)
## cons_conf_idx < -36.09886 to the left, improve=1557.9710, (0 missing)
## default splits as RLL, improve= 894.2134, (0 missing)
## Surrogate splits:
## cons_conf_idx < -46.68129 to the right, agree=0.919, adj=0.537, (0 split)
## age < 60.99529 to the left, agree=0.829, adj=0.017, (0 split)
## cons_price_idx < 94.5206 to the left, agree=0.827, adj=0.009, (0 split)
##
## Node number 3: 6039 observations
## predicted class=yes expected loss=0.0634211 P(node) =0.1033049
## class counts: 383 5656
## probabilities: 0.063 0.937
##
## Node number 4: 43283 observations, complexity param=0.06455917
## predicted class=no expected loss=0.3894832 P(node) =0.7404119
## class counts: 26425 16858
## probabilities: 0.611 0.389
## left son=8 (31312 obs) right son=9 (11971 obs)
## Primary splits:
## cons_price_idx < 93.19988 to the right, improve=1186.3600, (0 missing)
## cons_conf_idx < -36.09674 to the left, improve=1107.2040, (0 missing)
## contact splits as RL, improve=1079.9350, (0 missing)
## default splits as RLL, improve= 442.0817, (0 missing)
## age < 60.00359 to the left, improve= 381.0203, (0 missing)
## Surrogate splits:
## cons_conf_idx < -44.15814 to the right, agree=0.885, adj=0.583, (0 split)
## previous < 0.00118432 to the left, agree=0.761, adj=0.136, (0 split)
## poutcome splits as RL-, agree=0.758, adj=0.126, (0 split)
## job splits as LLLLLLLLRLLL, agree=0.729, adj=0.019, (0 split)
## age < 60.53519 to the left, agree=0.727, adj=0.015, (0 split)
##
## Node number 5: 9136 observations
## predicted class=yes expected loss=0.2649956 P(node) =0.1562831
## class counts: 2421 6715
## probabilities: 0.265 0.735
##
## Node number 8: 31312 observations, complexity param=0.0203565
## predicted class=no expected loss=0.3170989 P(node) =0.5356324
## class counts: 21383 9929
## probabilities: 0.683 0.317
## left son=16 (13637 obs) right son=17 (17675 obs)
## Primary splits:
## contact splits as RL, improve=434.5561, (0 missing)
## campaign < 1.000115 to the left, improve=347.2238, (0 missing)
## cons_conf_idx < -36.40503 to the right, improve=317.5052, (0 missing)
## age < 60.00359 to the left, improve=256.2774, (0 missing)
## previous < 1.000371 to the left, improve=238.8876, (0 missing)
## Surrogate splits:
## month splits as -R-RL-LR--, agree=0.928, adj=0.835, (0 split)
## cons_price_idx < 93.99106 to the right, agree=0.902, adj=0.775, (0 split)
## cons_conf_idx < -41.80137 to the right, agree=0.726, adj=0.371, (0 split)
## job splits as RLRRRRRLRRLL, agree=0.604, adj=0.092, (0 split)
## education splits as LLLRRRRL, agree=0.589, adj=0.057, (0 split)
##
## Node number 9: 11971 observations, complexity param=0.03907079
## predicted class=yes expected loss=0.4211845 P(node) =0.2047795
## class counts: 5042 6929
## probabilities: 0.421 0.579
## left son=18 (7082 obs) right son=19 (4889 obs)
## Primary splits:
## cons_conf_idx < -46.19657 to the left, improve=881.6659, (0 missing)
## month splits as -R-RR-LR--, improve=775.1438, (0 missing)
## cons_price_idx < 92.89325 to the left, improve=416.1322, (0 missing)
## job splits as RLLRRRLLRRRL, improve=204.9930, (0 missing)
## campaign < 4.997728 to the right, improve=201.7275, (0 missing)
## Surrogate splits:
## month splits as -R-RR-LR--, agree=0.974, adj=0.936, (0 split)
## cons_price_idx < 92.88986 to the right, agree=0.811, adj=0.536, (0 split)
## job splits as RLLRLRLLRLRR, agree=0.672, adj=0.196, (0 split)
## education splits as LLLLRLRR, agree=0.646, adj=0.133, (0 split)
## age < 55.01647 to the left, agree=0.638, adj=0.113, (0 split)
##
## Node number 16: 13637 observations
## predicted class=no expected loss=0.222263 P(node) =0.2332786
## class counts: 10606 3031
## probabilities: 0.778 0.222
##
## Node number 17: 17675 observations, complexity param=0.0203565
## predicted class=no expected loss=0.3902687 P(node) =0.3023538
## class counts: 10777 6898
## probabilities: 0.610 0.390
## left son=34 (16113 obs) right son=35 (1562 obs)
## Primary splits:
## cons_price_idx < 93.91801 to the left, improve=824.9796, (0 missing)
## month splits as -L-LR-RL--, improve=396.8921, (0 missing)
## campaign < 1.000115 to the left, improve=256.5630, (0 missing)
## cons_conf_idx < -42.71392 to the right, improve=222.1548, (0 missing)
## age < 60.00524 to the left, improve=184.8284, (0 missing)
## Surrogate splits:
## cons_conf_idx < -46.28575 to the right, agree=0.923, adj=0.125, (0 split)
## previous < 1.041097 to the left, agree=0.922, adj=0.117, (0 split)
## age < 60.99399 to the left, agree=0.922, adj=0.113, (0 split)
## month splits as -L-LR-LL--, agree=0.914, adj=0.031, (0 split)
##
## Node number 18: 7082 observations
## predicted class=no expected loss=0.4193731 P(node) =0.1211468
## class counts: 4112 2970
## probabilities: 0.581 0.419
##
## Node number 19: 4889 observations
## predicted class=yes expected loss=0.1902229 P(node) =0.08363269
## class counts: 930 3959
## probabilities: 0.190 0.810
##
## Node number 34: 16113 observations, complexity param=0.01800723
## predicted class=no expected loss=0.3427046 P(node) =0.2756338
## class counts: 10591 5522
## probabilities: 0.657 0.343
## left son=68 (15595 obs) right son=69 (518 obs)
## Primary splits:
## month splits as -L-LR-RL--, improve=348.29280, (0 missing)
## campaign < 1.000631 to the left, improve=222.63500, (0 missing)
## cons_conf_idx < -42.71681 to the right, improve=150.22440, (0 missing)
## cons_price_idx < 93.44448 to the left, improve= 86.71672, (0 missing)
## age < 43.99796 to the right, improve= 66.11546, (0 missing)
## Surrogate splits:
## previous < 1.017587 to the left, agree=0.971, adj=0.106, (0 split)
## age < 60.91176 to the left, agree=0.969, adj=0.044, (0 split)
##
## Node number 35: 1562 observations
## predicted class=yes expected loss=0.1190781 P(node) =0.02672004
## class counts: 186 1376
## probabilities: 0.119 0.881
##
## Node number 68: 15595 observations, complexity param=0.01800723
## predicted class=no expected loss=0.3237576 P(node) =0.2667727
## class counts: 10546 5049
## probabilities: 0.676 0.324
## left son=136 (5560 obs) right son=137 (10035 obs)
## Primary splits:
## campaign < 1.000631 to the left, improve=214.25650, (0 missing)
## cons_conf_idx < -36.09546 to the left, improve=138.52350, (0 missing)
## cons_price_idx < 93.44448 to the left, improve= 65.02158, (0 missing)
## age < 43.99796 to the right, improve= 61.36487, (0 missing)
## job splits as RRRLRRRRRRLL, improve= 52.25816, (0 missing)
## Surrogate splits:
## marital splits as RRRL, agree=0.644, adj=0.001, (0 split)
## cons_conf_idx < -45.88844 to the left, agree=0.644, adj=0.001, (0 split)
## education splits as RRRRLRRR, agree=0.644, adj=0.000, (0 split)
## default splits as RRL, agree=0.644, adj=0.000, (0 split)
##
## Node number 69: 518 observations
## predicted class=yes expected loss=0.08687259 P(node) =0.008861063
## class counts: 45 473
## probabilities: 0.087 0.913
##
## Node number 136: 5560 observations
## predicted class=no expected loss=0.2124101 P(node) =0.09511102
## class counts: 4379 1181
## probabilities: 0.788 0.212
##
## Node number 137: 10035 observations, complexity param=0.01800723
## predicted class=no expected loss=0.3854509 P(node) =0.1716617
## class counts: 6167 3868
## probabilities: 0.615 0.385
## left son=274 (8884 obs) right son=275 (1151 obs)
## Primary splits:
## campaign < 1.999861 to the right, improve=982.03570, (0 missing)
## cons_conf_idx < -36.09546 to the left, improve= 87.87123, (0 missing)
## job splits as RRLLRRRRRRLL, improve= 57.45433, (0 missing)
## age < 42.99197 to the right, improve= 53.21787, (0 missing)
## cons_price_idx < 93.44448 to the left, improve= 40.59531, (0 missing)
## Surrogate splits:
## age < 73.95204 to the left, agree=0.886, adj=0.009, (0 split)
## cons_conf_idx < -34.4013 to the left, agree=0.886, adj=0.003, (0 split)
## previous < 1.373351 to the left, agree=0.886, adj=0.002, (0 split)
##
## Node number 274: 8884 observations, complexity param=0.01293236
## predicted class=no expected loss=0.3058307 P(node) =0.1519724
## class counts: 6167 2717
## probabilities: 0.694 0.306
## left son=548 (3123 obs) right son=549 (5761 obs)
## Primary splits:
## campaign < 2.000554 to the left, improve=126.64850, (0 missing)
## cons_price_idx < 93.20023 to the left, improve= 69.11038, (0 missing)
## cons_conf_idx < -36.09546 to the left, improve= 66.03994, (0 missing)
## month splits as -R-R---L--, improve= 48.22153, (0 missing)
## job splits as RRLLRRRRRRLL, improve= 47.21639, (0 missing)
## Surrogate splits:
## cons_price_idx < 93.20023 to the left, agree=0.666, adj=0.050, (0 split)
## month splits as -R-R---L--, agree=0.662, adj=0.040, (0 split)
## previous < 0.2940022 to the right, agree=0.654, adj=0.014, (0 split)
## poutcome splits as LR-, agree=0.653, adj=0.013, (0 split)
## default splits as RRL, agree=0.649, adj=0.000, (0 split)
##
## Node number 275: 1151 observations
## predicted class=yes expected loss=0 P(node) =0.01968935
## class counts: 0 1151
## probabilities: 0.000 1.000
##
## Node number 548: 3123 observations
## predicted class=no expected loss=0.1911623 P(node) =0.05342297
## class counts: 2526 597
## probabilities: 0.809 0.191
##
## Node number 549: 5761 observations, complexity param=0.01293236
## predicted class=no expected loss=0.3679917 P(node) =0.09854939
## class counts: 3641 2120
## probabilities: 0.632 0.368
## left son=1098 (5005 obs) right son=1099 (756 obs)
## Primary splits:
## campaign < 2.999257 to the right, improve=695.17030, (0 missing)
## cons_conf_idx < -36.09546 to the left, improve= 40.29316, (0 missing)
## education splits as LLRRLRRL, improve= 39.18488, (0 missing)
## job splits as RRLLRRRRLRLL, improve= 35.95156, (0 missing)
## cons_price_idx < 93.20023 to the left, improve= 35.52620, (0 missing)
## Surrogate splits:
## cons_conf_idx < -35.93195 to the left, agree=0.872, adj=0.022, (0 split)
## previous < 2.218332 to the left, agree=0.869, adj=0.004, (0 split)
## age < 60.0642 to the left, agree=0.869, adj=0.001, (0 split)
##
## Node number 1098: 5005 observations
## predicted class=no expected loss=0.2725275 P(node) =0.08561702
## class counts: 3641 1364
## probabilities: 0.727 0.273
##
## Node number 1099: 756 observations
## predicted class=yes expected loss=0 P(node) =0.01293236
## class counts: 0 756
## probabilities: 0.000 1.000
##
## n= 58458
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 58458 29229 no (0.50000000 0.50000000)
## 2) poutcome=failure,nonexistent 52419 23573 no (0.55029665 0.44970335)
## 4) month=aug,jul,jun,may,nov 43283 16858 no (0.61051683 0.38948317)
## 8) cons_price_idx>=93.19988 31312 9929 no (0.68290112 0.31709888)
## 16) contact=telephone 13637 3031 no (0.77773704 0.22226296) *
## 17) contact=cellular 17675 6898 no (0.60973126 0.39026874)
## 34) cons_price_idx< 93.91801 16113 5522 no (0.65729535 0.34270465)
## 68) month=aug,jul,nov 15595 5049 no (0.67624239 0.32375761)
## 136) campaign< 1.000631 5560 1181 no (0.78758993 0.21241007) *
## 137) campaign>=1.000631 10035 3868 no (0.61454908 0.38545092)
## 274) campaign>=1.999861 8884 2717 no (0.69416929 0.30583071)
## 548) campaign< 2.000554 3123 597 no (0.80883766 0.19116234) *
## 549) campaign>=2.000554 5761 2120 no (0.63200833 0.36799167)
## 1098) campaign>=2.999257 5005 1364 no (0.72747253 0.27252747) *
## 1099) campaign< 2.999257 756 0 yes (0.00000000 1.00000000) *
## 275) campaign< 1.999861 1151 0 yes (0.00000000 1.00000000) *
## 69) month=jun,may 518 45 yes (0.08687259 0.91312741) *
## 35) cons_price_idx>=93.91801 1562 186 yes (0.11907810 0.88092190) *
## 9) cons_price_idx< 93.19988 11971 5042 yes (0.42118453 0.57881547)
## 18) cons_conf_idx< -46.19657 7082 2970 no (0.58062694 0.41937306) *
## 19) cons_conf_idx>=-46.19657 4889 930 yes (0.19022295 0.80977705) *
## 5) month=apr,dec,mar,oct,sep 9136 2421 yes (0.26499562 0.73500438) *
## 3) poutcome=success 6039 383 yes (0.06342110 0.93657890) *
end <- Sys.time()
# rpart does k-fold cross validation automatically with default 10-fold for cp
# Plot cross-validation results
plotcp(decision_tree)
# Print cp values
printcp(decision_tree)
##
## Classification tree:
## rpart(formula = y ~ ., data = smoted_train, method = "class")
##
## Variables actually used in tree construction:
## [1] campaign cons_conf_idx cons_price_idx contact month
## [6] poutcome
##
## Root node error: 29229/58458 = 0.5
##
## n= 58458
##
## CP nsplit rel error xerror xstd
## 1 0.180403 0 1.00000 1.00749 0.0041359
## 2 0.146909 1 0.81960 0.81960 0.0040681
## 3 0.064559 2 0.67269 0.67224 0.0039075
## 4 0.039071 3 0.60813 0.60813 0.0038052
## 5 0.020356 4 0.56906 0.57197 0.0037379
## 6 0.018007 6 0.52835 0.53129 0.0036535
## 7 0.012932 9 0.47432 0.47730 0.0035260
## 8 0.010000 11 0.44846 0.45147 0.0034582
# Default predictions
predictions <- predict(decision_tree, newdata = test, type = "class")
cf_matrix <- confusionMatrix(predictions, test$y)
# print(cf_matrix)
# Bar plot of the target variable
predictions_df <- as.data.frame(table(predictions))
colnames(predictions_df) <- c("Target", "Count")
ggplot(predictions_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
probs <- predict(decision_tree, newdata = test, type = "vector")
roc <- roc(test$y, probs, quiet = TRUE)
decision_tree_auc <- auc(roc)
decision_tree_accuracy <- cf_matrix$overall['Accuracy']
decision_tree_kappa <- cf_matrix$overall['Kappa']
decision_tree_sensitivity <- cf_matrix$byClass['Sensitivity']
decision_tree_specificity <- cf_matrix$byClass['Specificity']
decision_tree_f1 <- cf_matrix$byClass["F1"]
decision_tree_precision <- cf_matrix$byClass["Precision"]
decision_tree_model_duration <- end - start
output1 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Default Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix)), collapse = "\n"), "\n",
"Accuracy:", round(decision_tree_accuracy, 4), "| Precision:", round(decision_tree_precision, 4),
"| Sensitivity:", round(decision_tree_sensitivity, 4), "| Specificity:", round(decision_tree_specificity, 4), "\n",
"F1 Score:", round(decision_tree_f1, 4), "| Kappa:", round(decision_tree_kappa, 4), "| AUC:", round(decision_tree_auc, 4),
"| Model Duration:", round(decision_tree_model_duration, 4), "\n\n", sep = " ")
cat(output1)
##
## === Model Selection and Evaluation ===
##
## === Default Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6342 383
## yes 966 545
##
## Accuracy : 0.8362
## 95% CI : (0.828, 0.8441)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3572
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8678
## Specificity : 0.5873
## Pos Pred Value : 0.9430
## Neg Pred Value : 0.3607
## Prevalence : 0.8873
## Detection Rate : 0.7700
## Detection Prevalence : 0.8165
## Balanced Accuracy : 0.7276
##
## 'Positive' Class : no
##
## Accuracy: 0.8362 | Precision: 0.943 | Sensitivity: 0.8678 | Specificity: 0.5873
## F1 Score: 0.9039 | Kappa: 0.3572 | AUC: 0.7276 | Model Duration: 1.0857
Results:
The accuracy, our chosen metric, of the most basic model is 0.8362069 which is not a terrible result. We can see in the cross-validation graph that as the complexity inreases (cp decreases), the relative error decreases. With a higher complexity, the decision tree can be more precise, but at a computational cost. As we can see, with cp=0.01 and nsplit=11 gives the smallest relative error.
Since the optimal cp value from this model is 0.01, there is no need
for using the prune method with the optimal cp to prune the
tree since it will give the same results as above.
# Plot decision tree
rpart.plot(decision_tree)
# Extract important predictors from model
vip(decision_tree)
# Rules
rpart.rules(decision_tree)
## y
## 0.19 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is 93 to 94 & contact is cellular & campaign is 2 to 2
## 0.21 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is 93 to 94 & contact is cellular & campaign < 1
## 0.22 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >= 93 & contact is telephone
## 0.27 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is 93 to 94 & contact is cellular & campaign >= 3
## 0.42 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx < 93 & cons_conf_idx < -46
## 0.74 when poutcome is failure or nonexistent & month is apr or dec or mar or oct or sep
## 0.81 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx < 93 & cons_conf_idx >= -46
## 0.88 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >= 94 & contact is cellular
## 0.91 when poutcome is failure or nonexistent & month is jun or may & cons_price_idx is 93 to 94 & contact is cellular
## 0.94 when poutcome is success
## 1.00 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is 93 to 94 & contact is cellular & campaign is 2 to 3
## 1.00 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is 93 to 94 & contact is cellular & campaign is 1 to 2
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are month, cons_price_idx, cons_conf_idx, poutcome, and campaign.
The rules above can also be used to follow the decision tree structure.
Objective: Discover how a smaller maximum depth affects the model accuracy
The point of this model is to experiment with the
maxdepth hyperparameter, which is the maximum depth of any
node of the tree. So for this experiment, instead of the default value
of 30, I will use 5. I chose to decrease this number since a smaller max
depth can help with overfitting, and make the model simpler which can be
helpful with model interpretability. This should help the tree adapt to
new data.
Metric: Accuracy
# Decision Tree Experiment 1 Model
set.seed(125)
start <- Sys.time()
decision_tree1 <- rpart(y ~ ., data = smoted_train, method = "class", control = rpart.control(maxdepth = 5))
print(summary(decision_tree1))
## Call:
## rpart(formula = y ~ ., data = smoted_train, method = "class",
## control = rpart.control(maxdepth = 5))
## n= 58458
##
## CP nsplit rel error xerror xstd
## 1 0.18040302 0 1.0000000 1.0074926 0.004135860
## 2 0.14690889 1 0.8195970 0.8195970 0.004068116
## 3 0.06455917 2 0.6726881 0.6722433 0.003907514
## 4 0.03907079 3 0.6081289 0.6081289 0.003805181
## 5 0.02035650 4 0.5690581 0.5719662 0.003737941
## 6 0.01000000 6 0.5283451 0.5312874 0.003653518
##
## Variable importance
## month cons_price_idx cons_conf_idx poutcome previous
## 22 21 21 20 7
## contact age job education
## 3 2 2 1
##
## Node number 1: 58458 observations, complexity param=0.180403
## predicted class=no expected loss=0.5 P(node) =1
## class counts: 29229 29229
## probabilities: 0.500 0.500
## left son=2 (52419 obs) right son=3 (6039 obs)
## Primary splits:
## poutcome splits as LLR, improve=2567.295, (0 missing)
## contact splits as RL, improve=2391.529, (0 missing)
## month splits as RLRLLRLLRR, improve=2282.421, (0 missing)
## cons_conf_idx < -36.09886 to the left, improve=2165.345, (0 missing)
## previous < 0.0004073881 to the left, improve=1978.245, (0 missing)
## Surrogate splits:
## previous < 1.001562 to the left, agree=0.926, adj=0.287, (0 split)
## cons_price_idx < 94.60799 to the left, agree=0.897, adj=0.003, (0 split)
## cons_conf_idx < -50.7846 to the right, agree=0.897, adj=0.002, (0 split)
## age < 18.7708 to the right, agree=0.897, adj=0.001, (0 split)
##
## Node number 2: 52419 observations, complexity param=0.1469089
## predicted class=no expected loss=0.4497034 P(node) =0.8966951
## class counts: 28846 23573
## probabilities: 0.550 0.450
## left son=4 (43283 obs) right son=5 (9136 obs)
## Primary splits:
## month splits as RLRLLRLLRR, improve=1801.2090, (0 missing)
## cons_price_idx < 93.19988 to the right, improve=1793.0170, (0 missing)
## contact splits as RL, improve=1651.3490, (0 missing)
## cons_conf_idx < -36.09886 to the left, improve=1557.9710, (0 missing)
## default splits as RLL, improve= 894.2134, (0 missing)
## Surrogate splits:
## cons_conf_idx < -46.68129 to the right, agree=0.919, adj=0.537, (0 split)
## age < 60.99529 to the left, agree=0.829, adj=0.017, (0 split)
## cons_price_idx < 94.5206 to the left, agree=0.827, adj=0.009, (0 split)
##
## Node number 3: 6039 observations
## predicted class=yes expected loss=0.0634211 P(node) =0.1033049
## class counts: 383 5656
## probabilities: 0.063 0.937
##
## Node number 4: 43283 observations, complexity param=0.06455917
## predicted class=no expected loss=0.3894832 P(node) =0.7404119
## class counts: 26425 16858
## probabilities: 0.611 0.389
## left son=8 (31312 obs) right son=9 (11971 obs)
## Primary splits:
## cons_price_idx < 93.19988 to the right, improve=1186.3600, (0 missing)
## cons_conf_idx < -36.09674 to the left, improve=1107.2040, (0 missing)
## contact splits as RL, improve=1079.9350, (0 missing)
## default splits as RLL, improve= 442.0817, (0 missing)
## age < 60.00359 to the left, improve= 381.0203, (0 missing)
## Surrogate splits:
## cons_conf_idx < -44.15814 to the right, agree=0.885, adj=0.583, (0 split)
## previous < 0.00118432 to the left, agree=0.761, adj=0.136, (0 split)
## poutcome splits as RL-, agree=0.758, adj=0.126, (0 split)
## job splits as LLLLLLLLRLLL, agree=0.729, adj=0.019, (0 split)
## age < 60.53519 to the left, agree=0.727, adj=0.015, (0 split)
##
## Node number 5: 9136 observations
## predicted class=yes expected loss=0.2649956 P(node) =0.1562831
## class counts: 2421 6715
## probabilities: 0.265 0.735
##
## Node number 8: 31312 observations, complexity param=0.0203565
## predicted class=no expected loss=0.3170989 P(node) =0.5356324
## class counts: 21383 9929
## probabilities: 0.683 0.317
## left son=16 (13637 obs) right son=17 (17675 obs)
## Primary splits:
## contact splits as RL, improve=434.5561, (0 missing)
## campaign < 1.000115 to the left, improve=347.2238, (0 missing)
## cons_conf_idx < -36.40503 to the right, improve=317.5052, (0 missing)
## age < 60.00359 to the left, improve=256.2774, (0 missing)
## previous < 1.000371 to the left, improve=238.8876, (0 missing)
## Surrogate splits:
## month splits as -R-RL-LR--, agree=0.928, adj=0.835, (0 split)
## cons_price_idx < 93.99106 to the right, agree=0.902, adj=0.775, (0 split)
## cons_conf_idx < -41.80137 to the right, agree=0.726, adj=0.371, (0 split)
## job splits as RLRRRRRLRRLL, agree=0.604, adj=0.092, (0 split)
## education splits as LLLRRRRL, agree=0.589, adj=0.057, (0 split)
##
## Node number 9: 11971 observations, complexity param=0.03907079
## predicted class=yes expected loss=0.4211845 P(node) =0.2047795
## class counts: 5042 6929
## probabilities: 0.421 0.579
## left son=18 (7082 obs) right son=19 (4889 obs)
## Primary splits:
## cons_conf_idx < -46.19657 to the left, improve=881.6659, (0 missing)
## month splits as -R-RR-LR--, improve=775.1438, (0 missing)
## cons_price_idx < 92.89325 to the left, improve=416.1322, (0 missing)
## job splits as RLLRRRLLRRRL, improve=204.9930, (0 missing)
## campaign < 4.997728 to the right, improve=201.7275, (0 missing)
## Surrogate splits:
## month splits as -R-RR-LR--, agree=0.974, adj=0.936, (0 split)
## cons_price_idx < 92.88986 to the right, agree=0.811, adj=0.536, (0 split)
## job splits as RLLRLRLLRLRR, agree=0.672, adj=0.196, (0 split)
## education splits as LLLLRLRR, agree=0.646, adj=0.133, (0 split)
## age < 55.01647 to the left, agree=0.638, adj=0.113, (0 split)
##
## Node number 16: 13637 observations
## predicted class=no expected loss=0.222263 P(node) =0.2332786
## class counts: 10606 3031
## probabilities: 0.778 0.222
##
## Node number 17: 17675 observations, complexity param=0.0203565
## predicted class=no expected loss=0.3902687 P(node) =0.3023538
## class counts: 10777 6898
## probabilities: 0.610 0.390
## left son=34 (16113 obs) right son=35 (1562 obs)
## Primary splits:
## cons_price_idx < 93.91801 to the left, improve=824.9796, (0 missing)
## month splits as -L-LR-RL--, improve=396.8921, (0 missing)
## campaign < 1.000115 to the left, improve=256.5630, (0 missing)
## cons_conf_idx < -42.71392 to the right, improve=222.1548, (0 missing)
## age < 60.00524 to the left, improve=184.8284, (0 missing)
## Surrogate splits:
## cons_conf_idx < -46.28575 to the right, agree=0.923, adj=0.125, (0 split)
## previous < 1.041097 to the left, agree=0.922, adj=0.117, (0 split)
## age < 60.99399 to the left, agree=0.922, adj=0.113, (0 split)
## month splits as -L-LR-LL--, agree=0.914, adj=0.031, (0 split)
##
## Node number 18: 7082 observations
## predicted class=no expected loss=0.4193731 P(node) =0.1211468
## class counts: 4112 2970
## probabilities: 0.581 0.419
##
## Node number 19: 4889 observations
## predicted class=yes expected loss=0.1902229 P(node) =0.08363269
## class counts: 930 3959
## probabilities: 0.190 0.810
##
## Node number 34: 16113 observations
## predicted class=no expected loss=0.3427046 P(node) =0.2756338
## class counts: 10591 5522
## probabilities: 0.657 0.343
##
## Node number 35: 1562 observations
## predicted class=yes expected loss=0.1190781 P(node) =0.02672004
## class counts: 186 1376
## probabilities: 0.119 0.881
##
## n= 58458
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 58458 29229 no (0.5000000 0.5000000)
## 2) poutcome=failure,nonexistent 52419 23573 no (0.5502966 0.4497034)
## 4) month=aug,jul,jun,may,nov 43283 16858 no (0.6105168 0.3894832)
## 8) cons_price_idx>=93.19988 31312 9929 no (0.6829011 0.3170989)
## 16) contact=telephone 13637 3031 no (0.7777370 0.2222630) *
## 17) contact=cellular 17675 6898 no (0.6097313 0.3902687)
## 34) cons_price_idx< 93.91801 16113 5522 no (0.6572954 0.3427046) *
## 35) cons_price_idx>=93.91801 1562 186 yes (0.1190781 0.8809219) *
## 9) cons_price_idx< 93.19988 11971 5042 yes (0.4211845 0.5788155)
## 18) cons_conf_idx< -46.19657 7082 2970 no (0.5806269 0.4193731) *
## 19) cons_conf_idx>=-46.19657 4889 930 yes (0.1902229 0.8097771) *
## 5) month=apr,dec,mar,oct,sep 9136 2421 yes (0.2649956 0.7350044) *
## 3) poutcome=success 6039 383 yes (0.0634211 0.9365789) *
end <- Sys.time()
# rpart does k-fold cross validation automatically with default 10-fold for cp
# Plot cross-validation results
plotcp(decision_tree1)
# Print cp values
printcp(decision_tree1)
##
## Classification tree:
## rpart(formula = y ~ ., data = smoted_train, method = "class",
## control = rpart.control(maxdepth = 5))
##
## Variables actually used in tree construction:
## [1] cons_conf_idx cons_price_idx contact month poutcome
##
## Root node error: 29229/58458 = 0.5
##
## n= 58458
##
## CP nsplit rel error xerror xstd
## 1 0.180403 0 1.00000 1.00749 0.0041359
## 2 0.146909 1 0.81960 0.81960 0.0040681
## 3 0.064559 2 0.67269 0.67224 0.0039075
## 4 0.039071 3 0.60813 0.60813 0.0038052
## 5 0.020356 4 0.56906 0.57197 0.0037379
## 6 0.010000 6 0.52835 0.53129 0.0036535
# Default predictions
predictions_decision_tree1 <- predict(decision_tree1, newdata = test, type = "class")
cf_matrix_decision_tree1 <- confusionMatrix(predictions_decision_tree1, test$y)
print(cf_matrix_decision_tree1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6348 394
## yes 960 534
##
## Accuracy : 0.8356
## 95% CI : (0.8274, 0.8435)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3507
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8686
## Specificity : 0.5754
## Pos Pred Value : 0.9416
## Neg Pred Value : 0.3574
## Prevalence : 0.8873
## Detection Rate : 0.7708
## Detection Prevalence : 0.8186
## Balanced Accuracy : 0.7220
##
## 'Positive' Class : no
##
# Bar plot of the target variable
predictions_df <- as.data.frame(table(predictions_decision_tree1))
colnames(predictions_df) <- c("Target", "Count")
ggplot(predictions_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
probs1 <- predict(decision_tree1, newdata = test, type = "vector")
roc1 <- roc(test$y, probs1, quiet = TRUE)
decision_tree1_auc <- auc(roc1)
decision_tree1_accuracy <- cf_matrix_decision_tree1$overall['Accuracy']
decision_tree1_kappa <- cf_matrix_decision_tree1$overall['Kappa']
decision_tree1_sensitivity <- cf_matrix_decision_tree1$byClass['Sensitivity']
decision_tree1_specificity <- cf_matrix_decision_tree1$byClass['Specificity']
decision_tree1_f1 <- cf_matrix_decision_tree1$byClass["F1"]
decision_tree1_precision <- cf_matrix_decision_tree1$byClass["Precision"]
decision_tree1_model_duration <- end - start
output2 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 1 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_decision_tree1)), collapse = "\n"), "\n",
"Accuracy:", round(decision_tree1_accuracy, 4), "| Precision:", round(decision_tree1_precision, 4),
"| Sensitivity:", round(decision_tree1_sensitivity, 4), "| Specificity:", round(decision_tree1_specificity, 4), "\n",
"F1 Score:", round(decision_tree1_f1, 4), "| Kappa:", round(decision_tree1_kappa, 4), "| AUC:", round(decision_tree1_auc, 4),
"| Model Duration:", round(decision_tree1_model_duration, 4), "\n\n", sep = " ")
cat(output2)
##
## === Model Selection and Evaluation ===
##
## === Model 1 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6348 394
## yes 960 534
##
## Accuracy : 0.8356
## 95% CI : (0.8274, 0.8435)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3507
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8686
## Specificity : 0.5754
## Pos Pred Value : 0.9416
## Neg Pred Value : 0.3574
## Prevalence : 0.8873
## Detection Rate : 0.7708
## Detection Prevalence : 0.8186
## Balanced Accuracy : 0.7220
##
## 'Positive' Class : no
##
## Accuracy: 0.8356 | Precision: 0.9416 | Sensitivity: 0.8686 | Specificity: 0.5754
## F1 Score: 0.9036 | Kappa: 0.3507 | AUC: 0.722 | Model Duration: 0.5946
Results:
With a smaller max depth, the accuracy slightly decreased compared to the default model with a value of 0.8355998.
# Recommendation
# Plot decision tree
rpart.plot(decision_tree1)
# Extract important predictors from model
vip(decision_tree1)
# Rules
rpart.rules(decision_tree1)
## y
## 0.22 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >= 93 & contact is telephone
## 0.34 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx is 93 to 94 & contact is cellular
## 0.42 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx < 93 & cons_conf_idx < -46
## 0.74 when poutcome is failure or nonexistent & month is apr or dec or mar or oct or sep
## 0.81 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx < 93 & cons_conf_idx >= -46
## 0.88 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >= 94 & contact is cellular
## 0.94 when poutcome is success
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are month, cons_price_idx, cons_conf_idx, poutcome, and previous. The main difference for this model compared to the previous is campaign is not a top predictor at all.
The rules above can also be used to follow the decision tree structure.
Objective: Discover how using standardized datasets affects model accuracy.
The point of this model is to experiment with the original datasets. The standardized datasets (created earlier) will be used to see if it has any impact on the model’s accuracy. This should not have any affect for decision trees, but let’s confirm this.
Metric: Accuracy
# Experiment 2 model - standardized datasets
set.seed(125)
start <- Sys.time()
decision_tree2 <- rpart(y ~ ., data = smoted_train_standardized, method = "class")
print(decision_tree2)
## n= 58458
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 58458 29229 no (0.50000000 0.50000000)
## 2) poutcome=failure,nonexistent 52478 23632 no (0.54967796 0.45032204)
## 4) month=aug,jul,jun,may,nov 43295 16870 no (0.61034762 0.38965238)
## 8) cons_price_idx>=-0.6487716 31264 9881 no (0.68394959 0.31605041)
## 16) contact=telephone 13563 2957 no (0.78198039 0.21801961) *
## 17) contact=cellular 17701 6924 no (0.60883566 0.39116434)
## 34) cons_price_idx< 0.5930096 16205 5614 no (0.65356371 0.34643629)
## 68) month=aug,jul,nov 15572 5026 no (0.67724120 0.32275880)
## 136) campaign< -0.5647457 5526 1147 no (0.79243576 0.20756424) *
## 137) campaign>=-0.5647457 10046 3879 no (0.61387617 0.38612383)
## 274) campaign>=-0.201818 8898 2731 no (0.69307710 0.30692290)
## 548) campaign< -0.2014506 3121 595 no (0.80935598 0.19064402) *
## 549) campaign>=-0.2014506 5777 2136 no (0.63025792 0.36974208)
## 1098) campaign>=0.1607861 5057 1416 no (0.71999209 0.28000791) *
## 1099) campaign< 0.1607861 720 0 yes (0.00000000 1.00000000) *
## 275) campaign< -0.201818 1148 0 yes (0.00000000 1.00000000) *
## 69) month=jun,may 633 45 yes (0.07109005 0.92890995) *
## 35) cons_price_idx>=0.5930096 1496 186 yes (0.12433155 0.87566845) *
## 9) cons_price_idx< -0.6487716 12031 5042 yes (0.41908403 0.58091597)
## 18) cons_conf_idx< -1.23077 7122 3010 no (0.57736591 0.42263409) *
## 19) cons_conf_idx>=-1.23077 4909 930 yes (0.18944795 0.81055205) *
## 5) month=apr,dec,mar,oct,sep 9183 2421 yes (0.26363933 0.73636067) *
## 3) poutcome=success 5980 383 yes (0.06404682 0.93595318) *
end <- Sys.time()
# Predictions
predictions_decision_tree2 <- predict(decision_tree2, newdata = test_standardized, type = "class")
cf_matrix_decision_tree2 <- confusionMatrix(predictions_decision_tree2, test_standardized$y)
# print(cf_matrix_decision_tree2)
# Bar plot of the target variable
predictions_df2 <- as.data.frame(table(predictions_decision_tree2))
colnames(predictions_df2) <- c("Target", "Count")
ggplot(predictions_df2, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
probs2 <- predict(decision_tree2, newdata = test, type = "vector")
roc2 <- roc(test$y, probs2, quiet = TRUE)
decision_tree2_auc <- auc(roc2)
decision_tree2_accuracy <- cf_matrix_decision_tree2$overall['Accuracy']
decision_tree2_kappa <- cf_matrix_decision_tree2$overall['Kappa']
decision_tree2_sensitivity <- cf_matrix_decision_tree2$byClass['Sensitivity']
decision_tree2_specificity <- cf_matrix_decision_tree2$byClass['Specificity']
decision_tree2_f1 <- cf_matrix_decision_tree2$byClass["F1"]
decision_tree2_precision <- cf_matrix_decision_tree2$byClass["Precision"]
decision_tree2_model_duration <- end - start
output3 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 2 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_decision_tree2)), collapse = "\n"), "\n",
"Accuracy:", round(decision_tree2_accuracy, 4), "| Precision:", round(decision_tree2_precision, 4),
"| Sensitivity:", round(decision_tree2_sensitivity, 4), "| Specificity:", round(decision_tree2_specificity, 4), "\n",
"F1 Score:", round(decision_tree2_f1, 4), "| Kappa:", round(decision_tree2_kappa, 4), "| AUC:", round(decision_tree2_auc, 4),
"| Model Duration:", round(decision_tree2_model_duration, 4), "\n\n", sep = " ")
cat(output3)
##
## === Model Selection and Evaluation ===
##
## === Model 2 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6342 383
## yes 966 545
##
## Accuracy : 0.8362
## 95% CI : (0.828, 0.8441)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3572
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8678
## Specificity : 0.5873
## Pos Pred Value : 0.9430
## Neg Pred Value : 0.3607
## Prevalence : 0.8873
## Detection Rate : 0.7700
## Detection Prevalence : 0.8165
## Balanced Accuracy : 0.7276
##
## 'Positive' Class : no
##
## Accuracy: 0.8362 | Precision: 0.943 | Sensitivity: 0.8678 | Specificity: 0.5873
## F1 Score: 0.9039 | Kappa: 0.3572 | AUC: 0.6165 | Model Duration: 0.9451
Results:
As expected, the model accuracy remained the same, meaning it is not necessary to standardize the dataset ahead of time. That being said, it does not hurt the model at all.
# Recommendation
# Plot decision tree
rpart.plot(decision_tree2)
# Extract important predictors from model
vip(decision_tree2)
# Rules
rpart.rules(decision_tree2)
## y
## 0.19 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is -0.65 to 0.59 & contact is cellular & campaign is -0.20 to -0.20
## 0.21 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is -0.65 to 0.59 & contact is cellular & campaign < -0.56
## 0.22 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >= -0.65 & contact is telephone
## 0.28 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is -0.65 to 0.59 & contact is cellular & campaign >= 0.16
## 0.42 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx < -0.65 & cons_conf_idx < -1.2
## 0.74 when poutcome is failure or nonexistent & month is apr or dec or mar or oct or sep
## 0.81 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx < -0.65 & cons_conf_idx >= -1.2
## 0.88 when poutcome is failure or nonexistent & month is aug or jul or jun or may or nov & cons_price_idx >= 0.59 & contact is cellular
## 0.93 when poutcome is failure or nonexistent & month is jun or may & cons_price_idx is -0.65 to 0.59 & contact is cellular
## 0.94 when poutcome is success
## 1.00 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is -0.65 to 0.59 & contact is cellular & campaign is -0.20 to 0.16
## 1.00 when poutcome is failure or nonexistent & month is aug or jul or nov & cons_price_idx is -0.65 to 0.59 & contact is cellular & campaign is -0.56 to -0.20
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are month, cons_price_idx, cons_conf_idx, poutcome, and campaign. This matches the default model.
The rules above can also be used to follow the decision tree structure.
# Create experiments results table
decision_tree_experiments <- data.frame(
Model = c("Default Model", "Model 1", "Model 2"),
Accuracy = c(decision_tree_accuracy, decision_tree1_accuracy, decision_tree2_accuracy),
Error_Rate = c(1-decision_tree_accuracy, 1-decision_tree1_accuracy, 1-decision_tree2_accuracy),
Precision = c(decision_tree_precision, decision_tree1_precision, decision_tree2_precision),
Sensitivity = c(decision_tree_sensitivity, decision_tree1_sensitivity, decision_tree2_sensitivity),
Specificity = c(decision_tree_specificity, decision_tree1_specificity, decision_tree2_specificity),
F1_Score = c(decision_tree_f1, decision_tree1_f1, decision_tree2_f1),
Kappa = c(decision_tree_kappa, decision_tree1_kappa, decision_tree2_kappa),
AUC = c(decision_tree_auc, decision_tree1_auc, decision_tree2_auc),
Duration = c(decision_tree_model_duration, decision_tree1_model_duration, decision_tree2_model_duration)
)
print(decision_tree_experiments)
## Model Accuracy Error_Rate Precision Sensitivity Specificity
## 1 Default Model 0.8362069 0.1637931 0.9430483 0.8678161 0.5872845
## 2 Model 1 0.8355998 0.1644002 0.9415604 0.8686371 0.5754310
## 3 Model 2 0.8362069 0.1637931 0.9430483 0.8678161 0.5872845
## F1_Score Kappa AUC Duration
## 1 0.9038695 0.3571577 0.7275503 1.0857201 secs
## 2 0.9036299 0.3507004 0.7220341 0.5946162 secs
## 3 0.9038695 0.3571577 0.6164905 0.9451001 secs
# Plot ROC curves
plot(roc, col = "blue", main = "ROC Curves Comparison")
plot(roc1, col = "red", add = TRUE)
plot(roc2, col = "green", add = TRUE)
legend("bottomright",
legend = paste0(c("Default Model", "Model 1", "Model 2"), " (AUC=",
round(c(decision_tree_auc, decision_tree1_auc, decision_tree2_auc), 3), ")"),
col = c("blue", "red", "green"),
lwd = 2)
Overall, the default Model and Model 2 performed the best focusing on accuracy. This showed that using standardized dataset isn’t necessary, which is pro for decision trees. Model 1 showed the trade off between bias and variance, that lowering the maximum depth of the tree could go too far in the prevention of overfitting, and actually cause underfitting. For this case, too much information was lost. Therefore, the default model would be the winner of the decision trees.
Since there is no variation in this model, as it uses the the default and original datasets, we won’t count this as an actual experiment. We will only use this for comparison purposes and learning about the default params.
Metric: Accuracy
There are a lot of arguments for randomForest, so I’ll reference this R documentation website for all the params: https://www.rdocumentation.org/packages/randomForest/versions/4.7-1.2/topics/randomForest
# Default model
set.seed(125)
start <- Sys.time()
rf <- randomForest(y ~ ., data = smoted_train)
print(rf)
##
## Call:
## randomForest(formula = y ~ ., data = smoted_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 9.38%
## Confusion matrix:
## no yes class.error
## no 26982 2247 0.07687571
## yes 3235 25994 0.11067775
end <- Sys.time()
# Default predictions
predictions_rf <- predict(rf, newdata = test)
cf_matrix_rf <- confusionMatrix(predictions_rf, test$y)
print(cf_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6794 514
## yes 514 414
##
## Accuracy : 0.8752
## 95% CI : (0.8679, 0.8822)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.9997
##
## Kappa : 0.3758
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9297
## Specificity : 0.4461
## Pos Pred Value : 0.9297
## Neg Pred Value : 0.4461
## Prevalence : 0.8873
## Detection Rate : 0.8249
## Detection Prevalence : 0.8873
## Balanced Accuracy : 0.6879
##
## 'Positive' Class : no
##
# Bar plot of the target variable
predictions_rf_df <- as.data.frame(table(predictions_rf))
colnames(predictions_rf_df) <- c("Target", "Count")
ggplot(predictions_rf_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
# rf_probs <- predict(rf, newdata = test, type = "prob")
rf_probs <- predict(rf, newdata = test, type = "prob")[, 2]
rf_roc <- roc(test$y, rf_probs, quiet = TRUE)
rf_auc <- auc(rf_roc)
rf_accuracy <- cf_matrix_rf$overall['Accuracy']
rf_kappa <- cf_matrix_rf$overall['Kappa']
rf_sensitivity <- cf_matrix_rf$byClass['Sensitivity']
rf_specificity <- cf_matrix_rf$byClass['Specificity']
rf_f1 <- cf_matrix_rf$byClass["F1"]
rf_precision <- cf_matrix_rf$byClass["Precision"]
rf_model_duration <- end - start
output2 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Default Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_rf)), collapse = "\n"), "\n",
"Accuracy:", round(rf_accuracy, 4), "| Precision:", round(rf_precision, 4),
"| Sensitivity:", round(rf_sensitivity, 4), "| Specificity:", round(rf_specificity, 4), "\n",
"F1 Score:", round(rf_f1, 4), "| Kappa:", round(rf_kappa, 4), "| AUC:", round(rf_auc, 4),
"| Model Duration:", round(rf_model_duration, 4), "\n\n", sep = " ")
cat(output2)
##
## === Model Selection and Evaluation ===
##
## === Default Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6794 514
## yes 514 414
##
## Accuracy : 0.8752
## 95% CI : (0.8679, 0.8822)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.9997
##
## Kappa : 0.3758
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9297
## Specificity : 0.4461
## Pos Pred Value : 0.9297
## Neg Pred Value : 0.4461
## Prevalence : 0.8873
## Detection Rate : 0.8249
## Detection Prevalence : 0.8873
## Balanced Accuracy : 0.6879
##
## 'Positive' Class : no
##
## Accuracy: 0.8752 | Precision: 0.9297 | Sensitivity: 0.9297 | Specificity: 0.4461
## F1 Score: 0.9297 | Kappa: 0.3758 | AUC: 0.7676 | Model Duration: 33.4794
Results (focusing on accuracy):
Using the randomForest model increased the accuracy compared to the decision tree with a value of 0.8751821. This makes sense since a random forest model is an ensemble model, which is able to use a “team” of decision trees.
# Extract important predictors from model
vip(rf)
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are campaign, cons_conf_idx, month, age, and cons_price_idx.
Objective: Discover how a larger number of features to randomly select affects the model accuracy.
The point of this model is to experiment with the mtry
hyperparameter, which is the number of features that will be randomly
selected at each split. So for this experiment, instead of the default I
will force it to use 10. I chose to increase this number since
increasing the number of features should give the model more data to
work with, and be more accurate.
Metric: Accuracy
# Build model 1 with mtry = 10
set.seed(125)
# Tried to use cross-validation, but it took too long
# # Use cross-validation
# ctrl <- trainControl(method = "cv", number = 5)
#
# # Set up param tune grid for cross-validation
# tune_grid = expand.grid(mtry=c(10, 15))
start <- Sys.time()
rf1 <- randomForest(y ~ ., data = smoted_train, mtry = 10)
print(rf1)
##
## Call:
## randomForest(formula = y ~ ., data = smoted_train, mtry = 10)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 7.37%
## Confusion matrix:
## no yes class.error
## no 27201 2028 0.06938315
## yes 2282 26947 0.07807315
end <- Sys.time()
# Make predictions
predictions_rf1 <- predict(rf1, newdata = test)
cf_matrix_rf1 <- confusionMatrix(predictions_rf1, test$y)
print(cf_matrix_rf1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6806 590
## yes 502 338
##
## Accuracy : 0.8674
## 95% CI : (0.8599, 0.8747)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : 0.3083
##
## Mcnemar's Test P-Value : 0.00847
##
## Sensitivity : 0.9313
## Specificity : 0.3642
## Pos Pred Value : 0.9202
## Neg Pred Value : 0.4024
## Prevalence : 0.8873
## Detection Rate : 0.8264
## Detection Prevalence : 0.8980
## Balanced Accuracy : 0.6478
##
## 'Positive' Class : no
##
# Bar plot of the target variable
predictions_rf1_df <- as.data.frame(table(predictions_rf1))
colnames(predictions_rf1_df) <- c("Target", "Count")
ggplot(predictions_rf1_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
rf1_probs <- predict(rf1, newdata = test, type = "prob")[, 2]
rf1_roc <- roc(test$y, rf1_probs, quiet = TRUE)
rf1_auc <- auc(rf1_roc)
rf1_accuracy <- cf_matrix_rf1$overall['Accuracy']
rf1_kappa <- cf_matrix_rf1$overall['Kappa']
rf1_sensitivity <- cf_matrix_rf1$byClass['Sensitivity']
rf1_specificity <- cf_matrix_rf1$byClass['Specificity']
rf1_f1 <- cf_matrix_rf1$byClass["F1"]
rf1_precision <- cf_matrix_rf1$byClass["Precision"]
rf1_model_duration <- end - start
output3 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 1 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_rf1)), collapse = "\n"), "\n",
"Accuracy:", round(rf1_accuracy, 4), "| Precision:", round(rf1_precision, 4),
"| Sensitivity:", round(rf1_sensitivity, 4), "| Specificity:", round(rf1_specificity, 4), "\n",
"F1 Score:", round(rf1_f1, 4), "| Kappa:", round(rf1_kappa, 4), "| AUC:", round(rf1_auc, 4),
"| Model Duration:", round(rf1_model_duration, 4), "\n\n", sep = " ")
cat(output3)
##
## === Model Selection and Evaluation ===
##
## === Model 1 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6806 590
## yes 502 338
##
## Accuracy : 0.8674
## 95% CI : (0.8599, 0.8747)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : 0.3083
##
## Mcnemar's Test P-Value : 0.00847
##
## Sensitivity : 0.9313
## Specificity : 0.3642
## Pos Pred Value : 0.9202
## Neg Pred Value : 0.4024
## Prevalence : 0.8873
## Detection Rate : 0.8264
## Detection Prevalence : 0.8980
## Balanced Accuracy : 0.6478
##
## 'Positive' Class : no
##
## Accuracy: 0.8674 | Precision: 0.9202 | Sensitivity: 0.9313 | Specificity: 0.3642
## F1 Score: 0.9257 | Kappa: 0.3083 | AUC: 0.7397 | Model Duration: 1.105
Results (focusing on accuracy):
The first experimental random forest model showed a lower accuracy value than the default random forest model with a value of 86.74%. This means that there were too many features in the model that resulted in noisy data. With so many features, the model wasn’t able to strategically pick the most important features and make accurate predictions.
# Extract important predictors from model
vip(rf1)
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are campaign, age, cons_price_idx, month, and poutcome.
Objective: Discover how a larger number of trees affects the model accuracy.
The point of this model is to experiment with the ntree
hyperparameter, which is the number of trees created. So for this
experiment, instead of the default of 500, 700 will be used. I initially
tried to do this via cross-validation, but I ended up getting a “mtry”
error, even though this param was specified.
Metric: Accuracy
# Build model 2 with ntree = 700
set.seed(125)
start <- Sys.time()
rf2 <- randomForest(y ~ ., data = smoted_train, ntree = 700)
# train_control <- trainControl(method = "cv", number = 10)
# tune_grid = expand.grid(ntree=c(500, 700))
#
# rf2 <- train(
# y ~ .,
# data = smoted_train,
# method = "rf",
# trControl = train_control,
# tuneGrid = tune_grid,
# mtry = 3
# )
print(rf2)
##
## Call:
## randomForest(formula = y ~ ., data = smoted_train, ntree = 700)
## Type of random forest: classification
## Number of trees: 700
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 9.32%
## Confusion matrix:
## no yes class.error
## no 26984 2245 0.07680728
## yes 3201 26028 0.10951452
end <- Sys.time()
# Make predictions
predictions_rf2 <- predict(rf2, newdata = test)
cf_matrix_rf2 <- confusionMatrix(predictions_rf2, test$y)
# Bar plot of the target variable
predictions_rf2_df <- as.data.frame(table(predictions_rf2))
colnames(predictions_rf2_df) <- c("Target", "Count")
ggplot(predictions_rf2_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
rf2_probs <- predict(rf2, newdata = test, type = "prob")[, 2]
rf2_roc <- roc(test$y, rf2_probs, quiet = TRUE)
rf2_auc <- auc(rf2_roc)
rf2_accuracy <- cf_matrix_rf2$overall['Accuracy']
rf2_kappa <- cf_matrix_rf2$overall['Kappa']
rf2_sensitivity <- cf_matrix_rf2$byClass['Sensitivity']
rf2_specificity <- cf_matrix_rf2$byClass['Specificity']
rf2_f1 <- cf_matrix_rf2$byClass["F1"]
rf2_precision <- cf_matrix_rf2$byClass["Precision"]
rf2_model_duration <- end - start
output4 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 1 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_rf2)), collapse = "\n"), "\n",
"Accuracy:", round(rf2_accuracy, 4), "| Precision:", round(rf2_precision, 4),
"| Sensitivity:", round(rf2_sensitivity, 4), "| Specificity:", round(rf2_specificity, 4), "\n",
"F1 Score:", round(rf2_f1, 4), "| Kappa:", round(rf2_kappa, 4), "| AUC:", round(rf2_auc, 4),
"| Model Duration:", round(rf2_model_duration, 4), "\n\n", sep = " ")
cat(output4)
##
## === Model Selection and Evaluation ===
##
## === Model 1 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6796 508
## yes 512 420
##
## Accuracy : 0.8762
## 95% CI : (0.8688, 0.8832)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.9993
##
## Kappa : 0.3818
##
## Mcnemar's Test P-Value : 0.9252
##
## Sensitivity : 0.9299
## Specificity : 0.4526
## Pos Pred Value : 0.9304
## Neg Pred Value : 0.4506
## Prevalence : 0.8873
## Detection Rate : 0.8252
## Detection Prevalence : 0.8868
## Balanced Accuracy : 0.6913
##
## 'Positive' Class : no
##
## Accuracy: 0.8762 | Precision: 0.9304 | Sensitivity: 0.9299 | Specificity: 0.4526
## F1 Score: 0.9302 | Kappa: 0.3818 | AUC: 0.7678 | Model Duration: 10.9643
Results (focusing on accuracy):
The second experimental random forest model showed a slightly higher accuracy value than the default random forest model with a value of 87.62%. This shows that increasing the number of trees does improve the model’s accuracy, as more “team members” are now part of team (tree) making it more successful.
# Extract important predictors from model
vip(rf2)
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are campaign, cons_conf_idx, month, age, and cons_price_idx.
# Create experiments results table
rf_experiments <- data.frame(
Model = c("Default Model", "Model 1", "Model 2"),
Accuracy = c(rf_accuracy, rf1_accuracy, rf2_accuracy),
Error_Rate = c(1-rf_accuracy, 1-rf1_accuracy, 1-rf2_accuracy),
Precision = c(rf_precision, rf1_precision, rf2_precision),
Sensitivity = c(rf_sensitivity, rf1_sensitivity, rf2_sensitivity),
Specificity = c(rf_specificity, rf1_specificity, rf2_specificity),
F1_Score = c(rf_f1, rf1_f1, rf2_f1),
Kappa = c(rf_kappa, rf1_kappa, rf2_kappa),
AUC = c(rf_auc, rf1_auc, rf2_auc),
Duration = c(rf_model_duration, rf1_model_duration, rf2_model_duration)
)
print(rf_experiments)
## Model Accuracy Error_Rate Precision Sensitivity Specificity
## 1 Default Model 0.8751821 0.1248179 0.9296661 0.9296661 0.4461207
## 2 Model 1 0.8674114 0.1325886 0.9202271 0.9313082 0.3642241
## 3 Model 2 0.8761535 0.1238465 0.9304491 0.9299398 0.4525862
## F1_Score Kappa AUC Duration
## 1 0.9296661 0.3757868 0.7676071 33.47942 secs
## 2 0.9257345 0.3082935 0.7396810 66.29887 secs
## 3 0.9301944 0.3818076 0.7678411 657.85647 secs
# Plot ROC curves
plot(rf_roc, col = "blue", main = "ROC Curves Comparison")
plot(rf1_roc, col = "red", add = TRUE)
plot(rf2_roc, col = "green", add = TRUE)
legend("bottomright",
legend = paste0(c("Default Model", "Model 1", "Model 2"), " (AUC=",
round(c(rf_auc, rf1_auc, rf2_auc), 3), ")"),
col = c("blue", "red", "green"),
lwd = 2)
Overall, increasing the number of trees (Model 2) improved the model’s performance, but not by much. It did however increase the runtime of the model, making it more computationally expensive. Increasing the number of features (Model 1) actually hurt the model’s performance, showing that having too many unimportant features can lower the accuracy of a decision tree. It’s important to not make important splits on noisy data that does not actually influence the outcome variable. Additionally, adding more features added a lot of time to the runtime of the model. The fastest model was the default, followed by Model 2. Model 2 across the board had the highest metrics except for specificity, so Model 2 has to be the winner for the random forest models.
Since there is no variation in this model, as it uses the the default and original datasets, we won’t count this as an actual experiment. We will only use this for comparison purposes and learning about the default params.
From the R documentation, here are the parameters for this model type:
formula: a formula, as in the lm function.data: a data frame in which to interpret the variables
named in formula.boos: if TRUE (by default), a bootstrap sample of the
training set is drawn using the weights for each observation on that
iteration. If FALSE, every observation is used with its weights.mfinal: an integer, the number of iterations for which
boosting is run or the number of trees to use. Defaults to mfinal=100
iterations.coeflearn: if ‘Breiman’(by default),
alpha=1/2ln((1-err)/err) is used. If ‘Freund’ alpha=ln((1-err)/err) is
used. In both cases the AdaBoost.M1 algorithm is used and alpha is the
weight updating coefficient. On the other hand, if coeflearn is ‘Zhu’
the SAMME algorithm is implemented with alpha=ln((1-err)/err)+
ln(nclasses-1).control: options that control details of the rpart
algorithm. See rpart.control for more details.# Default model
set.seed(125)
start <- Sys.time()
adaboost <- boosting(y ~ ., data = smoted_train)
print(summary(adaboost))
## Length Class Mode
## formula 3 formula call
## trees 100 -none- list
## weights 100 -none- numeric
## votes 116916 -none- numeric
## prob 116916 -none- numeric
## class 58458 -none- character
## importance 15 -none- numeric
## terms 3 terms call
## call 3 -none- call
end <- Sys.time()
# Make predictions
# When using boosting, use predict.boosting for the predictions
# Note: must convert the output of predict.boosting to a factor to match the test dataset and avoid errors
predictions_adaboost <- predict.boosting(adaboost, newdata = test)
cf_matrix_adaboost <- confusionMatrix(as.factor(predictions_adaboost$class), test$y)
# print(cf_matrix_adaboost)
# Model Evaluation - use accuracy metric
# accuracy_adaboost <- cf_matrix_adaboost$overall['Accuracy']
# print(accuracy_adaboost)
# Bar plot of the target variable
predictions_adaboost_df <- as.data.frame(table(predictions_adaboost$class))
colnames(predictions_adaboost_df) <- c("Target", "Count")
ggplot(predictions_adaboost_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
# adaboost_probs <- predict(adaboost, newdata = test, type = "prob")[, 2]
adaboost_roc <- roc(test$y, predictions_adaboost$prob[, 2], quiet = TRUE)
adaboost_auc <- auc(adaboost_roc)
adaboost_accuracy <- cf_matrix_adaboost$overall['Accuracy']
adaboost_kappa <- cf_matrix_adaboost$overall['Kappa']
adaboost_sensitivity <- cf_matrix_adaboost$byClass['Sensitivity']
adaboost_specificity <- cf_matrix_adaboost$byClass['Specificity']
adaboost_f1 <- cf_matrix_adaboost$byClass["F1"]
adaboost_precision <- cf_matrix_adaboost$byClass["Precision"]
adaboost_model_duration <- end - start
output5 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Default Model Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_adaboost)), collapse = "\n"), "\n",
"Accuracy:", round(adaboost_accuracy, 4), "| Precision:", round(adaboost_precision, 4),
"| Sensitivity:", round(adaboost_sensitivity, 4), "| Specificity:", round(adaboost_specificity, 4), "\n",
"F1 Score:", round(adaboost_f1, 4), "| Kappa:", round(adaboost_kappa, 4), "| AUC:", round(adaboost_auc, 4),
"| Model Duration:", round(adaboost_model_duration, 4), "\n\n", sep = " ")
cat(output5)
##
## === Model Selection and Evaluation ===
##
## === Default Model Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6719 448
## yes 589 480
##
## Accuracy : 0.8741
## 95% CI : (0.8667, 0.8812)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.4095
##
## Mcnemar's Test P-Value : 1.377e-05
##
## Sensitivity : 0.9194
## Specificity : 0.5172
## Pos Pred Value : 0.9375
## Neg Pred Value : 0.4490
## Prevalence : 0.8873
## Detection Rate : 0.8158
## Detection Prevalence : 0.8702
## Balanced Accuracy : 0.7183
##
## 'Positive' Class : no
##
## Accuracy: 0.8741 | Precision: 0.9375 | Sensitivity: 0.9194 | Specificity: 0.5172
## F1 Score: 0.9284 | Kappa: 0.4095 | AUC: 0.7702 | Model Duration: 4.0928
Results (focusing on accuracy):
For the default adaboost model, the accuracy is 87.41%. This default model did not beat the best random forest model, but is extremely close.
# Extract important predictors from model
importanceplot(adaboost, las = 2)
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are campaign, age, poutcome, cons_price_index, and month.
Objective: Discover how much accuracy improves with an increase in iterations (number of trees to run)
The point of this model is to experiment with the mfinal
hyperparameter, which is the number of trees to run. So for this
experiment, instead of the default of 100, let’s try 150. Increasing the
number of trees to learn from should increase the model’s accuracy.
# Experiment 1 - increase mfinal param
set.seed(125)
start <- Sys.time()
adaboost1 <- boosting(y ~ ., data = smoted_train, mfinal=150)
print(summary(adaboost1))
## Length Class Mode
## formula 3 formula call
## trees 150 -none- list
## weights 150 -none- numeric
## votes 116916 -none- numeric
## prob 116916 -none- numeric
## class 58458 -none- character
## importance 15 -none- numeric
## terms 3 terms call
## call 4 -none- call
end <- Sys.time()
# Make predictions
# When using boosting, use predict.boosting for the predictions
# Note: must convert the output of predict.boosting to a factor to match the test dataset and avoid errors
predictions_adaboost1 <- predict.boosting(adaboost1, newdata = test)
cf_matrix_adaboost1 <- confusionMatrix(as.factor(predictions_adaboost1$class), test$y)
# Bar plot of the target variable
predictions_adaboost1_df <- as.data.frame(table(predictions_adaboost1$class))
colnames(predictions_adaboost1_df) <- c("Target", "Count")
ggplot(predictions_adaboost1_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
adaboost1_roc <- roc(test$y, predictions_adaboost1$prob[, 2], quiet = TRUE)
adaboost1_auc <- auc(adaboost1_roc)
adaboost1_accuracy <- cf_matrix_adaboost1$overall['Accuracy']
adaboost1_kappa <- cf_matrix_adaboost1$overall['Kappa']
adaboost1_sensitivity <- cf_matrix_adaboost1$byClass['Sensitivity']
adaboost1_specificity <- cf_matrix_adaboost1$byClass['Specificity']
adaboost1_f1 <- cf_matrix_adaboost1$byClass["F1"]
adaboost1_precision <- cf_matrix_adaboost1$byClass["Precision"]
adaboost1_model_duration <- end - start
output6 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 1 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_adaboost1)), collapse = "\n"), "\n",
"Accuracy:", round(adaboost1_accuracy, 4), "| Precision:", round(adaboost1_precision, 4),
"| Sensitivity:", round(adaboost1_sensitivity, 4), "| Specificity:", round(adaboost1_specificity, 4), "\n",
"F1 Score:", round(adaboost1_f1, 4), "| Kappa:", round(adaboost1_kappa, 4), "| AUC:", round(adaboost1_auc, 4),
"| Model Duration:", round(adaboost1_model_duration, 4), "\n\n", sep = " ")
cat(output6)
##
## === Model Selection and Evaluation ===
##
## === Model 1 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6738 452
## yes 570 476
##
## Accuracy : 0.8759
## 95% CI : (0.8686, 0.883)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.9994247
##
## Kappa : 0.4121
##
## Mcnemar's Test P-Value : 0.0002524
##
## Sensitivity : 0.9220
## Specificity : 0.5129
## Pos Pred Value : 0.9371
## Neg Pred Value : 0.4551
## Prevalence : 0.8873
## Detection Rate : 0.8181
## Detection Prevalence : 0.8730
## Balanced Accuracy : 0.7175
##
## 'Positive' Class : no
##
## Accuracy: 0.8759 | Precision: 0.9371 | Sensitivity: 0.922 | Specificity: 0.5129
## F1 Score: 0.9295 | Kappa: 0.4121 | AUC: 0.7712 | Model Duration: 6.6524
Results (focusing on accuracy):
For this adaboost model, the accuracy is higher than the default accuracy with a value of 87.59%. Although it gives the highest accuracy, this model took the longest time of 4.4 mins, meaning there is a cost for this level of accuracy.
# Extract important predictors from model
importanceplot(adaboost1, las = 2)
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are campaign, age, poutcome, cons_price_index, and month (the same as the default model).
Objective: Discover if the model’s accurate decreases with less training data, and larger testing data.
The point of this model is to experiment with the amount of training data actually needed for adaboost. Adaboost should be the best model, so for this experiment, I would like to see if we can still get a great accuracy while using less training data. After testing out a few models for adaboost and them taking too long to even complete running, it would be great to speed up the runtime by having less data needed. (I wanted to test out params such as the contribution of each learner, coeflearn, but my model never finished running).
# Experiment 1 - try using smaller training dataset
set.seed(125)
start <- Sys.time()
adaboost2 <- boosting(y ~ ., data = smoted_train1)
print(summary(adaboost2))
## Length Class Mode
## formula 3 formula call
## trees 100 -none- list
## weights 100 -none- numeric
## votes 87688 -none- numeric
## prob 87688 -none- numeric
## class 43844 -none- character
## importance 15 -none- numeric
## terms 3 terms call
## call 3 -none- call
end <- Sys.time()
# Make predictions
# When using boosting, use predict.boosting for the predictions
# Note: must convert the output of predict.boosting to a factor to match the test dataset and avoid errors
predictions_adaboost2 <- predict.boosting(adaboost2, newdata = test1)
cf_matrix_adaboost2 <- confusionMatrix(as.factor(predictions_adaboost2$class), test1$y)
# Bar plot of the target variable
predictions_adaboost2_df <- as.data.frame(table(predictions_adaboost2$class))
colnames(predictions_adaboost2_df) <- c("Target", "Count")
ggplot(predictions_adaboost2_df, aes(x = Target, y = Count)) +
geom_bar(stat = "identity", fill="blue") +
labs(
title = "Distribution of Target Variable",
x = "Client Subscription Status (0 = Not Subscribed, 1 = Subscribed)",
y = "Count"
) +
theme_minimal()
# Evaluation
adaboost2_roc <- roc(test1$y, predictions_adaboost2$prob[, 2], quiet = TRUE)
adaboost2_auc <- auc(adaboost2_roc)
adaboost2_accuracy <- cf_matrix_adaboost2$overall['Accuracy']
adaboost2_kappa <- cf_matrix_adaboost2$overall['Kappa']
adaboost2_sensitivity <- cf_matrix_adaboost2$byClass['Sensitivity']
adaboost2_specificity <- cf_matrix_adaboost2$byClass['Specificity']
adaboost2_f1 <- cf_matrix_adaboost2$byClass["F1"]
adaboost2_precision <- cf_matrix_adaboost2$byClass["Precision"]
adaboost2_model_duration <- end - start
output7 <- paste("\n=== Model Selection and Evaluation ===\n\n",
"=== Model 2 Evaluation ===\n",
"Confusion Matrix:\n",
paste(capture.output(print(cf_matrix_adaboost2)), collapse = "\n"), "\n",
"Accuracy:", round(adaboost2_accuracy, 4), "| Precision:", round(adaboost2_precision, 4),
"| Sensitivity:", round(adaboost2_sensitivity, 4), "| Specificity:", round(adaboost2_specificity, 4), "\n",
"F1 Score:", round(adaboost2_f1, 4), "| Kappa:", round(adaboost2_kappa, 4), "| AUC:", round(adaboost2_auc, 4),
"| Model Duration:", round(adaboost2_model_duration, 4), "\n\n", sep = " ")
cat(output7)
##
## === Model Selection and Evaluation ===
##
## === Model 2 Evaluation ===
## Confusion Matrix:
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 13428 887
## yes 1187 969
##
## Accuracy : 0.8741
## 95% CI : (0.8689, 0.8791)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4118
##
## Mcnemar's Test P-Value : 5.186e-11
##
## Sensitivity : 0.9188
## Specificity : 0.5221
## Pos Pred Value : 0.9380
## Neg Pred Value : 0.4494
## Prevalence : 0.8873
## Detection Rate : 0.8153
## Detection Prevalence : 0.8691
## Balanced Accuracy : 0.7204
##
## 'Positive' Class : no
##
## Accuracy: 0.8741 | Precision: 0.938 | Sensitivity: 0.9188 | Specificity: 0.5221
## F1 Score: 0.9283 | Kappa: 0.4118 | AUC: 0.7839 | Model Duration: 2.5773
Results (focusing on accuracy):
For this adaboost model, the accuracy is exactly the same as the default accuracy with a value of 87.41%. This tells us that using 20% less training data did not affect the model performance. Potentially a very good accuracy score could be had by tuning this model. A big plus for this model is that it was able to run.
# Extract important predictors from model
importanceplot(adaboost2, las = 2)
Recommendation:
According to this model, the most important predictors the marketing team should be focusing on are campaign, age, poutcome, month, and cons_price_index (the same as the default model, but month higher than cons_price_index).
# Create experiments results table
adaboost_experiments <- data.frame(
Model = c("Default Model", "Model 1", "Model 2"),
Accuracy = c(adaboost_accuracy, adaboost1_accuracy, adaboost2_accuracy),
Error_Rate = c(1-adaboost_accuracy, 1-adaboost1_accuracy, 1-adaboost2_accuracy),
Precision = c(adaboost_precision, adaboost1_precision, adaboost2_precision),
Sensitivity = c(adaboost_sensitivity, adaboost1_sensitivity, adaboost2_sensitivity),
Specificity = c(adaboost_specificity, adaboost1_specificity, adaboost2_specificity),
F1_Score = c(adaboost_f1, adaboost1_f1, adaboost2_f1),
Kappa = c(adaboost_kappa, adaboost1_kappa, adaboost2_kappa),
AUC = c(adaboost_auc, adaboost1_auc, adaboost2_auc),
Duration = c(adaboost_model_duration, adaboost1_model_duration, adaboost2_model_duration)
)
print(adaboost_experiments)
## Model Accuracy Error_Rate Precision Sensitivity Specificity
## 1 Default Model 0.8740894 0.1259106 0.9374913 0.9194034 0.5172414
## 2 Model 1 0.8759106 0.1240894 0.9371349 0.9220033 0.5129310
## 3 Model 2 0.8740817 0.1259183 0.9380370 0.9187821 0.5220905
## F1_Score Kappa AUC Duration
## 1 0.9283592 0.4094865 0.7701523 4.092809 mins
## 2 0.9295075 0.4120631 0.7712281 6.652441 mins
## 3 0.9283097 0.4118166 0.7839215 2.577312 mins
# Plot ROC curves
plot(adaboost_roc, col = "blue", main = "ROC Curves Comparison")
plot(adaboost1_roc, col = "red", add = TRUE)
plot(adaboost2_roc, col = "green", add = TRUE)
legend("bottomright",
legend = paste0(c("Default Model", "Model 1", "Model 2"), " (AUC=",
round(c(adaboost_auc, adaboost1_auc, adaboost2_auc), 3), ")"),
col = c("blue", "red", "green"),
lwd = 2)
Overall Model 1 performed the best regarding accuracy, but not by much. This means having more iterations does improve a model’s performance as it is able to learn more. It did not have the best performance across the board - it had the smallest precision and specificity metrics. It also took a very long time to run with a duration of 6.674 mins. Model 2 had a very high accuracy (basically the same as the default model), and had the highest precision, specificity and AUC. Perhaps most importantly, the runtime decreased by 56.03% compared to Model 1 which is a major difference. It also has a lower runtime compared to the default model. Because of this, Model 2 has to be the winner of the adaboost models.
Final results table:
all_experiments <- data.frame(
Model = c("DT Default Model", "DT Model 1", "DT Model 2", "RF Default Model", "RF Model 1", "RF Model 2", "AB Default Model", "AB Model 1", "AB Model 2"),
Accuracy = c(decision_tree_accuracy, decision_tree1_accuracy, decision_tree2_accuracy, rf_accuracy, rf1_accuracy, rf2_accuracy, adaboost_accuracy, adaboost1_accuracy, adaboost2_accuracy),
Error_Rate = c(1-decision_tree_accuracy, 1-decision_tree1_accuracy, 1-decision_tree2_accuracy, 1-rf_accuracy, 1-rf1_accuracy, 1-rf2_accuracy, 1-adaboost_accuracy, 1-adaboost1_accuracy, 1-adaboost2_accuracy),
Precision = c(decision_tree_precision, decision_tree1_precision, decision_tree2_precision, rf_precision, rf1_precision, rf2_precision, adaboost_precision, adaboost1_precision, adaboost2_precision),
Sensitivity = c(decision_tree_sensitivity, decision_tree1_sensitivity, decision_tree2_sensitivity, rf_sensitivity, rf1_sensitivity, rf2_sensitivity, adaboost_sensitivity, adaboost1_sensitivity, adaboost2_sensitivity),
Specificity = c(decision_tree_specificity, decision_tree1_specificity, decision_tree2_specificity, rf_specificity, rf1_specificity, rf2_specificity, adaboost_specificity, adaboost1_specificity, adaboost2_specificity),
F1_Score = c(decision_tree_f1, decision_tree1_f1, decision_tree2_f1, rf_f1, rf1_f1, rf2_f1, adaboost_f1, adaboost1_f1, adaboost2_f1),
Kappa = c(decision_tree_kappa, decision_tree1_kappa, decision_tree2_kappa, rf_kappa, rf1_kappa, rf2_kappa, adaboost_kappa, adaboost1_kappa, adaboost2_kappa),
AUC = c(decision_tree_auc, decision_tree1_auc, decision_tree2_auc, rf_auc, rf1_auc, rf2_auc, adaboost_auc, adaboost1_auc, adaboost2_auc),
Duration = c(decision_tree_model_duration, decision_tree1_model_duration, decision_tree2_model_duration, rf_model_duration, rf1_model_duration, rf2_model_duration, adaboost_model_duration, adaboost1_model_duration, adaboost2_model_duration)
)
print(all_experiments)
## Model Accuracy Error_Rate Precision Sensitivity Specificity
## 1 DT Default Model 0.8362069 0.1637931 0.9430483 0.8678161 0.5872845
## 2 DT Model 1 0.8355998 0.1644002 0.9415604 0.8686371 0.5754310
## 3 DT Model 2 0.8362069 0.1637931 0.9430483 0.8678161 0.5872845
## 4 RF Default Model 0.8751821 0.1248179 0.9296661 0.9296661 0.4461207
## 5 RF Model 1 0.8674114 0.1325886 0.9202271 0.9313082 0.3642241
## 6 RF Model 2 0.8761535 0.1238465 0.9304491 0.9299398 0.4525862
## 7 AB Default Model 0.8740894 0.1259106 0.9374913 0.9194034 0.5172414
## 8 AB Model 1 0.8759106 0.1240894 0.9371349 0.9220033 0.5129310
## 9 AB Model 2 0.8740817 0.1259183 0.9380370 0.9187821 0.5220905
## F1_Score Kappa AUC Duration
## 1 0.9038695 0.3571577 0.7275503 1.0857201 secs
## 2 0.9036299 0.3507004 0.7220341 0.5946162 secs
## 3 0.9038695 0.3571577 0.6164905 0.9451001 secs
## 4 0.9296661 0.3757868 0.7676071 33.4794230 secs
## 5 0.9257345 0.3082935 0.7396810 66.2988660 secs
## 6 0.9301944 0.3818076 0.7678411 657.8564739 secs
## 7 0.9283592 0.4094865 0.7701523 245.5685310 secs
## 8 0.9295075 0.4120631 0.7712281 399.1464460 secs
## 9 0.9283097 0.4118166 0.7839215 154.6387150 secs
Overall, all of the models produced fairly good accuracy scores but none of them could get above 87% accuracy. The model with the highest accuracy is Random Forest Model 2. Although it did not perform the best for all the other metrics, it had fairly high numbers. For many of the metrics such as AUC, it’s very close to the highest value. Additionally, the runtime is not too bad compared to the adaboost models. So therefore, the model chosen would have to be the Random Forest Model 2.