Joseph Foy - DATA 622 Assignment Two: Experimentation & Model Training

This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment, I will define what I am trying to achieve (before each run), conduct the experiment, and at the end I will review how my experiment went. These experiments will allow me to compare algorithms and choose the optimal model.Using the data set and EDA from the assignment one. For each of the algorithms (above), the experiments will entail:

Define the objective of the experiment (hypothesis).
Decide what will change, and what will stay the same.
Select the evaluation metric (what you want to measure).
Perform the experiment.
Document the experiment so you compare results (track progress).

I will also conduct variations on the experiments.

# Load required packages:
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# # Picking up from Assignment One, after cleaning the data set, I wrote a new data set and named it "a2_bankdata" (for Assignment Two Bankdata). I loaded it herein.
url <- "C:/Users/zengo/OneDrive/Magic Briefcase/School of Professional Studies/DATA 622 Machine Learning/Data 622 Asgmt 1/a2_bankdata.csv"

# To ensure that I have the correct data structure, I forced the column type names.
a2_bankdata <- read_csv(url, col_types = "nffffnfffnfnnff")

glimpse(a2_bankdata)

## Rows: 45,211
## Columns: 15
## $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital   <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

# View the first 6 rows of data.

head(a2_bankdata)

## # A tibble: 6 × 15
##     age job        marital education default balance housing loan  contact   day
##   <dbl> <fct>      <fct>   <fct>     <fct>     <dbl> <fct>   <fct> <fct>   <dbl>
## 1    58 management married tertiary  no         2143 yes     no    unknown     5
## 2    44 technician single  secondary no           29 yes     no    unknown     5
## 3    33 entrepren… married secondary no            2 yes     yes   unknown     5
## 4    47 blue-coll… married unknown   no         1506 yes     no    unknown     5
## 5    33 unknown    single  unknown   no            1 no      no    unknown     5
## 6    35 management married tertiary  no          231 yes     no    unknown     5
## # ℹ 5 more variables: month <fct>, duration <dbl>, campaign <dbl>,
## #   poutcome <fct>, y <fct>

The data shows 45,211 rows and 17 columns. An dependent variable (y) and sixteen independent variables (features). Seven columns are continuous (integer) variables and ten columns are categorical (character) variables.

# To see keep the explanations of all the variables, I inserted an interactive table. I cut and pasted the Variables Table data from the Bank Marketing homepage into an Excel CSV file. I then uploaded it in this file as follows:

names_url <- "C:/Users/zengo/OneDrive/Magic Briefcase/School of Professional Studies/DATA 622 Machine Learning/Data 622 Asgmt 1/bank-names.csv"

# I used the stringsAsFactors = FALSE to make sure that the narrative remains as character data instead of being converted into factor variables.

bank_names <- read.csv(names_url, stringsAsFactors = FALSE) 

# Load the DT library to make an interactive data table.

library(DT)

# I ran the following command to display the interactive data table.The options tell RStudio how many rows to display and the scrollx option produces a horizontal scrolling bar. A nice feature with the DT library is the installation of a search box for larger data tables.

datatable(bank_names, options = list(pageLength = 20, scrollX = TRUE, width = "90%"))

Upon inspection, it appears that some of the variables may need to change their atomic vector characteristics. Job, marital, education, default, housing, loan, contact, month, and poutcome are converted from characters to factors to represent their specific levels.

# To get a statistical summary variables of all the variables in the a2_bankdata, I ran the summary() function.

summary(a2_bankdata)

##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   married :27214   tertiary :13301  
##  1st Qu.:33.00   management :9458   single  :12790   secondary:23202  
##  Median :39.00   technician :7597   divorced: 5207   unknown  : 1857  
##  Mean   :40.94   admin.     :5171                    primary  : 6851  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   yes:25130   no :37967   unknown  :13020  
##  yes:  815   1st Qu.:    72   no :20081   yes: 7244   cellular :29285  
##              Median :   448                           telephone: 2906  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##     poutcome       y        
##  unknown:36959   no :39922  
##  failure: 4901   yes: 5289  
##  other  : 1840              
##  success: 1511              
##                             
##                             
##

Upon inspection of the variables, the dependent variable, y, is categorical, which points to the use of a classification algorithm. The response is binary, which is well suited for Decision Trees, Random Forest and Adaboost.

Abbreviated LIterature Review and Hypothesis Development.

See the attached Essay for discussion.

Hypothesis (H1): Models trained on SMOTE will exhibit lower risk of overfitting compared to those not trained on SMOTE.

Experiment 1A, Decision Tree: Training a model with classes unbalanced.

Splitting the Data

Now that the data has been significantly preprocessed from Assignment One, the next step is to split the data with the goal of seeing how well a Decision Tree model generalizes to new data. Following Nwanganga & Chapple (2020), I used the set.seed() function to reproduce results. I split the data 75% into a training set to fit a logistic regression model, and 25% into a testing set to evaluate the model’s performance.

set.seed(1234)
sample_set <- sample(nrow(a2_bankdata), round(nrow(a2_bankdata)*.75), replace = FALSE)
a2_bankdata_train <- a2_bankdata[sample_set, ]
a2_bankdata_test <- a2_bankdata[-sample_set, ]

I ran a test to see the proportionality of split data. I needed to make sure that the the training and testing data is proportional to the original data set for the dependent variable, “y”.

round(prop.table(table(select(a2_bankdata, y), exclude = NULL)), 4) * 100

## y
##   no  yes 
## 88.3 11.7

round(prop.table(table(select(a2_bankdata_train, y), exclude = NULL)), 4) * 100

## y
##    no   yes 
## 88.23 11.77

round(prop.table(table(select(a2_bankdata_test, y), exclude = NULL)), 4) * 100

## y
##    no   yes 
## 88.53 11.47

The data is proportional across the entire dataset, training and testing datasets. The classes are unbalanced for the training and testing data sets. Per the first experiment, this will remain undisturbed.

Training a Model

In this section, I built a Decision Tree model using the glm() function from the stats package.

# The rpart package is used for creating recursive partitioning decision trees.

library(rpart) 
a2_bankdata_mod1 <-
  rpart(   # This function creates the decision tree.  
    y ~ ., # Specifies that y is the target variable. The decision tree will recursively split the data into branches 
           # based on all the feature values to maximize the separation of different classes in y.
    method = "class", # The method "class" specifies that this is a classification tree.
    data = a2_bankdata_train
 )

Evaluating the Model

# The rpart.plot library allows one to visualize a decision tree created by the rpart package.

library(rpart.plot) 

# The rpart.plot function formats and labels a decision tree. It is stored in the "a2_bankdata_mod1" model. 

rpart.plot(a2_bankdata_mod1)

Interpretation:

The labels of the root node are “no”, 0.12 and 100%. This informs us that 100% of the data is represented and there is only approximately 12% chance that when a call is made, it will result in a “yes” response. Because the “yes” response is less than 50%, the root node is labeled “no”. The chance of getting a “yes” response is consistent with the class distribution proportional numbers calculated with the round() function. The decision tree algorithm likely found “duration” is the most predictive feature and 522 seconds to be the best threshold where the distribution of “yes” vs. “no” changes significantly. Calls made with a duration over 827 seconds have the second most predictive features. For example, a call over 827 seconds results in a 58% chance of a “yes”.
Following along the left node, out of the calls less than 522 seconds, 89% were considered a failure, other or unknown for a previous marketing campaign, and only 8% of the respondents said “yes”. However, if there was a positive response to a previous marketing campaign, there was a 61% chance of a “yes”, but only for 3% of the cases, and if the call lasted more than 163 seconds, the chances increased to 74%, but for only 2% of the data.
Interestingly, calls made in April, August, February, January, July, June, May, or November, that last less than 522 seconds, and failed to a previous marketing campaign, have an 95% of a “no” response with 83% of the data. This may indicate that timing of the call (in certain months) and its duration are very important to illicit a “yes” response.

Confusion Matrix Creation

# # Evaluation of the training "a2_bankdata_mod1" model against the test "a2_bankdata_test" data. The "a2_bankdata_mod1" model is passed through the predict() function to classify the test data a2_bankdata_test".

a2_bankdata_pred1 <- predict(a2_bankdata_mod1, a2_bankdata_test, type = 'class')

# Create the Confusion Matrix.

# Load the "caret" package. Note that to run a confusion matrix, tibbles should be converted to data frames. However, the "caret" package handles tibbles without errors and they don't need to be converted to data frames. However, to make sure that no problems arise using the "caret" package, I converted the "a2_bankdata_train" tibble to a data frame.

# Conversion the "a2_bankdata_train" data from tibble to data.frame. 

a2_bankdata_train <- as.data.frame(a2_bankdata_train)

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# Use the confusionMatrix() function from the caret package.

conf_matrix <- confusionMatrix(data = a2_bankdata_pred1, reference = a2_bankdata_test$y, positive = "yes")

# Print and Analyze the Results.

print(conf_matrix) # Prints the confusion matrix and statistics

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9658  743
##        yes  348  554
##                                           
##                Accuracy : 0.9035          
##                  95% CI : (0.8979, 0.9089)
##     No Information Rate : 0.8853          
##     P-Value [Acc > NIR] : 2.637e-10       
##                                           
##                   Kappa : 0.4523          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.42714         
##             Specificity : 0.96522         
##          Pos Pred Value : 0.61419         
##          Neg Pred Value : 0.92856         
##              Prevalence : 0.11475         
##          Detection Rate : 0.04901         
##    Detection Prevalence : 0.07980         
##       Balanced Accuracy : 0.69618         
##                                           
##        'Positive' Class : yes             
##

Key performance metrics: Experiment 1A

The first model is correctly predicting outcomes about 90.35% of the time. Because this is unbalanced data in a Decision Tree on a small-medium sized data set, it is subject to bias and the high degree of accuracy may be suspect.
The Mcnemar’s Test P-Value tests if the types of errors (FP vs FN) are symmetrical. The low p-value of < 2.2e-16 suggests that errors are not balanced and the model is biased.
The Sensitivity of 0.42714 indicates that it has a high false negative rate (many “yes” cases are missed).
The Specificity of 0.96522 indicates that it has a high true negative rate (many “no” cases are correct).
The Pos Pred Value (Precision) of 0.61419 indicates the amount that it predicted “yes” correctly.
The Neg Pred Value of 0.92856 indicates the amount that it predicted “no” correctly.
The Balanced Accuracy of 0.69618 indicates how well the model is predicting the “yes” and “no” responses, which should be closer to 1 if it was modeling well.

Comments:

From the output of the Confusion Matrix (see Table 1), although the model is highly accurate, it is biased, favoring “no” predictions and straining to detect “yes” cases. It performs well in identifying negatives, but it underfits (too much variance) the positive class and should not be relied due to unbalanced data to detect true positives.

Experiment 1B, Decision Tree: Training a model with the use of SMOTE - Balancing Classes.

In the second Decision Tree experiment, I apply the use of the SMOTE function to balance the classes and compare the Decision Tree output and metrics. I carried over the proportional data across the entire data set, training and testing data sets as a starting point. The classes are unbalanced for the training and testing data sets.

Apply SMOTE and Balance Training Class.

# Load the DMwR package to run the SMOTE function.

library(DMwR)

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

set.seed(1234)

# Apply the SMOTE function to the training data set. I created an object labeled "a2_bankdate_train_smote" for balanced training data that can be used later in other models.

a2_bankdata_train_smote <- SMOTE(y ~ ., data = a2_bankdata_train, perc.over = 100, perc.under = 200)

# Create a table showing the percentage balances.

round(prop.table(table(select(a2_bankdata_train_smote, y), exclude = NULL)), 4) * 100

## y
##  no yes 
##  50  50

Training a Model

In this section, I built a Decision Tree model using the glm() function from the stats package. To distinguish it from the first experiment, I created a new object “a2_bankdata_mod2”.

# Train a Decision Tree model on the balanced training data.
library(rpart)
a2_bankdata_mod2 <- rpart(
  y ~ .,
  method = "class",
  data = a2_bankdata_train_smote
)

Evaluating the Model

library(rpart.plot) 

# Plot the balanced training data Decision Tree.

# This function formats and labels a decision tree. It is stored in the "a2_bankdata_mod2" model. 

rpart.plot(a2_bankdata_mod2)

Interpretation: Experiment 1B

The labels of the balance training data set root node are “no”, 0.50 and 100%. This informs us that 100% of the balanced training data is represented and there is an approximately 50% chance that when a call is made, it will result in a “yes” response. This is significantly higher than the unbalanced data. The chance of getting a “yes” response is consistent with the class distribution proportional numbers calculated with the SMOTE function. The decision tree algorithm likely found “duration” is the still the most predictive feature.
The Decision Tree chose duration as the most important feature and splits it at 250 seconds. Twenty-two percent (22%) of the data is less than 250 seconds, representing 45% of the data set, and 73% of the data is 250 seconds or greater, representing 45% of the training data. The second most important feature is the previous outcome from prior calls, and third most important is the time of year when a call is made.
Significantly, if a greater than or equal to 250 seconds, but less than up to 436 seconds (over a 3 minute difference), and the outcome of a previous call is known or failed, and it occurred in April through August or November to February, there was a 43% of getting a “no” response over 16% of the training data.
Significantly, if a call if a greater than or equal to 250 seconds, but less than up to 436 seconds (over a 3 minute difference), and the outcome of a previous call succeeded, there was a 84% of getting a “yes” response over 33% of the training data. The time of year did not appear to be an issue.

Confusion Matrix Creation

# Evaluation of the training "a2_bankdata_mod2" model against the test "a2_bankdata_test" data.The "a2_bankdata_mod2" model is passed through the predict() function to classify the test data a2_bankdata_test".

a2_bankdata_pred2 <- predict(a2_bankdata_mod2, a2_bankdata_test, type = 'class')

# Create the Confusion Matrix.

# Use the confusionMatrix() function from the caret package.

conf_matrix <- confusionMatrix(data = a2_bankdata_pred2, reference = a2_bankdata_test$y, positive = "yes")

# Print the confusion matrix and Analyze the Results.

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  8583  346
##        yes 1423  951
##                                           
##                Accuracy : 0.8435          
##                  95% CI : (0.8367, 0.8501)
##     No Information Rate : 0.8853          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4341          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.73323         
##             Specificity : 0.85779         
##          Pos Pred Value : 0.40059         
##          Neg Pred Value : 0.96125         
##              Prevalence : 0.11475         
##          Detection Rate : 0.08414         
##    Detection Prevalence : 0.21003         
##       Balanced Accuracy : 0.79551         
##                                           
##        'Positive' Class : yes             
##

Key performance metrics: Experiment 1B

The first model is correctly predicting outcomes about 90.35% of the time, but the second model dropped signicantly to 84.35%.
The Mcnemar’s Test P-Value tests if the types of errors (FP vs FN) are symmetrical. The low p-values of < 2.2e-16 and <2e-16 suggests that errors are not balanced in either model and both are still biased.
The Sensitivity of 0.73323 indicates that it is much better at detecting “yes” cases compared to model 1 (0.42714).
The Specificity of 0.85779 indicates that it not as good as rejecting false positives compared to model 1 (0.85779).
The Pos Pred Value (Precision) of 0.40059 indicates has more false alarms when it says “yes” compared to model 1 (0.61419).
The Neg Pred Value of 0.96125 indicates that model slightly better at predicting “no” when it’s actually “no” compared to model 2 (0.92856).
The Balanced Accuracy of 0.79551 indicates that model is is more balanced across both classes compared to model 2 (0.69618).

Comments:

The first model has better overall accuracy with fewer false positives. However, the second model utilizing the SMOTE function significantly improves its ability to detect positive cases. This may be important in scenarios identifying “yes” cases where time is money. There are more false positives and a lower overall accuracy, but a much better balance in recognizing both classes. Comparing the two models to conclude which is better cannot be determined without understanding whether false negatives or false positives carry more risk in the telemarketing context. At this point, it is a matter of perspective in terms of cost. If the cost of making telemarketing calls is expensive (not only in terms of actual money, but reputation), then the first model, may be more advantageous, making less false positive calls, with a higher precision rate that increases the chance of success with each call. If the cost of the telemarketing calls are inexpensive, the second model is more advantageous. You can make more false positive calls (higher volume), but they are cheap and may yield more “yes” responses.

Conclusion:

H1 is not supported. The Decision Tree shows that although model 1 is more Accurate and Specificity, and model 2 has a higher Balanced Accuracy and Sensitivity, it is not conclusive which model is better without understanding the effect of the costs of making such telemarketing calls.

Experiment 2A, Random Forest: Training a model with classes unbalanced.

Nwanganga & Chapple (2020) discuss the use of Random Forest and provide some R code to analyze outcomes. I augmented the interpretation visualization of the outcome with the help of code obtained from ChatGPT by prompting how to use ggplot2 to visualize the top 10 most important Random Forest variables (features).

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

# From Nwanganga & Chapple (2020), I used the available code to create a Random Forest.

set.seed(1235)
rf_mod1 <- train(
 y ~ .,
 data = a2_bankdata_train,
 metric = "Accuracy",
 method = "rf",
 trControl = trainControl(method = "none"),
 tuneGrid = expand.grid(.mtry = 4)
 )

# Create predictions to model a Confusion Matrix.

rf_pred1 <- predict(rf_mod1, a2_bankdata_test)

#  The rf_pred and a2_bankdata_test$y object need to be converted to factors, which is required by confusionMatrix().

rf_pred1 <- as.factor(rf_pred1)
a2_bankdata_test$y <- as.factor(a2_bankdata_test$y)

# Create Confusion Matrix.

confusionMatrix(rf_pred1, a2_bankdata_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9858  926
##        yes  148  371
##                                           
##                Accuracy : 0.905           
##                  95% CI : (0.8994, 0.9103)
##     No Information Rate : 0.8853          
##     P-Value [Acc > NIR] : 7.951e-12       
##                                           
##                   Kappa : 0.3671          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.28604         
##             Specificity : 0.98521         
##          Pos Pred Value : 0.71484         
##          Neg Pred Value : 0.91413         
##              Prevalence : 0.11475         
##          Detection Rate : 0.03282         
##    Detection Prevalence : 0.04592         
##       Balanced Accuracy : 0.63563         
##                                           
##        'Positive' Class : yes             
##

Key performance metrics: Experiment 2A

The third model is correctly predicting outcomes about 90.5% of the time.
The Mcnemar’s Test P-Value tests if the types of errors (FP vs FN) are symmetrical. The low p-values of <2e-16 suggests that errors are not balanced and are biased.
The Sensitivity of 0.28604 indicates that it is not good at detecting “yes” cases.
The Specificity of 0.98521 indicates that it is very good at getting “no” right.
The Pos Pred Value (Precision) of 0.71484 indicates a relatively few amount of false alarms when it says “yes”.
The Neg Pred Value of 0.91413 indicates that model is good at predicting “no” when it’s actually “no”.
The Balanced Accuracy of 0.63563 indicates that model is not balanced across both classes and bias .

Comments:

Consistent with the Decision Tree, the unbalanced data has a high accuracy rate, which is misleading due to the high proportion of negatives in the majority class. The very low Sensitivity indicates that the model misses the majority has many false negatives. The high Specificity indicates the model can detect the negative class well. The high Precision predicts “yes,” correctly very often. The relatively low Balanced Accuracy shows poor performance.

Visualizing the top 10 important Random Forest features.

# Libraries to visualize the top 10 important variables in the Random Forest.

library(ggplot2)
library(caret)

# Extract importance and convert to data frame.

imp <- varImp(rf_mod1)$importance
imp$Variable <- rownames(imp)

# Sort by importance (ascending), then get top 10 features.

top_imp <- imp[order(imp$Overall, decreasing = FALSE), ]
top10 <- tail(top_imp, 10)  

# Plot using ggplot2.

ggplot(top10, aes(x = Overall, y = reorder(Variable, Overall))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Top 10 Most Important Random Forest Variables", x = "Importance", y = "Variable") +
  theme_minimal(base_size = 12) +
  theme(axis.text.y = element_text(size = 10))

Comments:

Duration is consistent and the dominating important feature, as shown in the Decision Tree and first Random Forest experiment with unbalanced training data.

Experiment 2B, Random Forest: Training a model with the use of SMOTE - Balancing Classes.

In the second Random Forest experiment, I apply the use of the SMOTE function to balance the classes and compare the Random Forest output and metrics. I carried over the proportional data across the entire data set, training and testing data sets as a starting point. The classes are unbalanced for the training and testing data sets.

Apply SMOTE and Balance Training Class.

library(DMwR)
set.seed(1237)

# Create a table showing the percentage balances using the previously established "a2_bankdata_train_smote" data.

round(prop.table(table(select(a2_bankdata_train_smote, y), exclude = NULL)), 4) * 100

## y
##  no yes 
##  50  50

The training class are in balance.

Per Nwanganga & Chapple (2020) and the visualization code obtained from ChatGPT, I reproduced a new Random Forest model two.

library(randomForest)
library(caret)

# From Nwanganga & Chapple (2020), I used the available code to create a Random Forest.

set.seed(1238)
rf_mod2 <- train(
 y ~ .,
 data = a2_bankdata_train_smote,
 metric = "Accuracy",
 method = "rf",
 trControl = trainControl(method = "none"),
 tuneGrid = expand.grid(.mtry = 4)
 )

# Create predictions to model a Confusion Matrix.

rf_pred2 <- predict(rf_mod2, a2_bankdata_test)

#  The rf_pred and a2_bankdata_test$y object need to be converted to factors, which is required by confusionMatrix().

rf_pred2 <- as.factor(rf_pred2)
a2_bankdata_test$y <- as.factor(a2_bankdata_test$y)

# Create Confusion Matrix.

confusionMatrix(rf_pred2, a2_bankdata_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  8447  185
##        yes 1559 1112
##                                           
##                Accuracy : 0.8457          
##                  95% CI : (0.8389, 0.8523)
##     No Information Rate : 0.8853          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4802          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.85736         
##             Specificity : 0.84419         
##          Pos Pred Value : 0.41632         
##          Neg Pred Value : 0.97857         
##              Prevalence : 0.11475         
##          Detection Rate : 0.09838         
##    Detection Prevalence : 0.23631         
##       Balanced Accuracy : 0.85078         
##                                           
##        'Positive' Class : yes             
##

Key performance metrics: Experiment 2B

The third model is correctly predicting outcomes about 84.57% of the time, similar to the second model Decision Tree model at 84.35%.
The Mcnemar’s Test P-Value tests if the types of errors (FP vs FN) are symmetrical. The low p-values of <2e-16 suggests that errors are not balanced and are biased.
The Sensitivity of 0.85736 indicates that it is much better at detecting “yes” cases.
The Specificity of 0.84419 indicates that it good at getting “no” right.
The Pos Pred Value (Precision) of 0.41632 indicates significant false alarms when it says “yes”.
The Neg Pred Value of 0.97857 indicates that model is good at predicting “no” when it’s actually “no”.
The Balanced Accuracy of 0.85078 indicates that model is fairly balanced across both classes.

Comments:

Consistent with the Decision Tree, the balanced data has a lower accuracy rate, which more accurately reflects potential performance. The higher Sensitivity indicates that the model can better detect the majority of actual positives. The lower Specificity indicates the model detects a higher amount false positives. The Precision rate decreased, but recall improved, which is normal. The Balanced Accuracy significantly improved, showing the model performs well in both classes.

Overall, the balanced data set is an improvement, supporting H1.

Visualizing the top 10 important Random Forest features.

# Libraries to visualize the top 10 important variables in the Random Forest.

library(ggplot2)
library(caret)

# Extract importance and convert to data frame.

imp <- varImp(rf_mod1)$importance
imp$Variable <- rownames(imp)

# Sort by importance (ascending), then get top 10 features.

top_imp <- imp[order(imp$Overall, decreasing = FALSE), ]
top10 <- tail(top_imp, 10)  

# Plot using ggplot2.

ggplot(top10, aes(x = Overall, y = reorder(Variable, Overall))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Top 10 Most Important Random Forest Variables", x = "Importance", y = "Variable") +
  theme_minimal(base_size = 12) +
  theme(axis.text.y = element_text(size = 10))

Comments:

Duration is a consistent and the dominating important feature, as shown in the Decision Tree and first Random Forest experiment with unbalanced training data. However, there is a change in the second and third most important features, balance and age of the customer. Interestingly, this did not show up in the Decision Tree with balanced data. I would assume that these feature rose to importance as these features may have been overlooked in the unbalanced data. I would speculate the difference in the most important features between a Decision Tree and a Random Forest may have to do with how the Random Forest sampled the data across trees.

Experiment 3A, Adaboost: Training a model with classes unbalanced.

In the first Adaboost experiment, I use unbalanced training data and compare the Adaboost output and metrics. I carried over the proportional data across the entire data set, training and testing data sets as a starting point. The classes are unbalanced for the training and testing data sets.

# Load the Ada library.

library(ada)

# Train an AdaBoost model to predict y on unbalanced data, labeled "ada". 

ada <- ada(y ~., data=a2_bankdata_train) 

summary(ada)

## Call:
## ada(y ~ ., data = a2_bankdata_train)
## 
## Loss: exponential Method: discrete   Iteration: 50 
## 
## Training Results
## 
## Accuracy: 0.907 Kappa: 0.457

# Take the trained ada model to predict outcomes based on the test data set a2_bankdata_test, and save the predictions into the object ada_mod1.

ada_mod1 <- predict(ada, newdata = a2_bankdata_test)

# Create an Confusion Matrix.

confusionMatrix(ada_mod1, a2_bankdata_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9743  803
##        yes  263  494
##                                          
##                Accuracy : 0.9057         
##                  95% CI : (0.9002, 0.911)
##     No Information Rate : 0.8853         
##     P-Value [Acc > NIR] : 1.384e-12      
##                                          
##                   Kappa : 0.4331         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.38088        
##             Specificity : 0.97372        
##          Pos Pred Value : 0.65258        
##          Neg Pred Value : 0.92386        
##              Prevalence : 0.11475        
##          Detection Rate : 0.04371        
##    Detection Prevalence : 0.06697        
##       Balanced Accuracy : 0.67730        
##                                          
##        'Positive' Class : yes            
##

Key performance metrics: Experiment 3A

The third model is correctly predicting outcomes about 90.67% of the time.
The Mcnemar’s Test P-Value tests if the types of errors (FP vs FN) are symmetrical. The low p-values of <2.2e-16 suggests that errors are not balanced and are biased.
The Sensitivity of 0.3917 indicates that it is not good at detecting “yes” cases.
The Specificity of 0.9734 indicates that it good at getting “no” right.
The Pos Pred Value (Precision) of 0.65633 indicates significant false alarms when it says “yes”.
The Neg Pred Value of 0.92506 indicates that model is good at predicting “no” when it’s actually “no”.
The Balanced Accuracy of 0.68254 indicates that model is not fairly balanced across both classes.

Comments:

Consistent with the Decision Tree, and Random Forest, the unbalanced data has higher accuracy rates, which may be misleading. This model is good at getting the “no” responses right, but poor at getting the “yes” responses correct. The Balanced Accuracy also shows the unbalanced data problem.

Experiment 3B, Adaboost: Training a model with the use of SMOTE - Balancing Classes.

In the second Adaboost experiment, I apply the use of the SMOTE function to balance the classes and compare the Adaboost output and metrics. I carried over the proportional data across the entire data set, training and testing data sets as a starting point. The classes are unbalanced for the training and testing data sets.

Apply SMOTE and Balance Training Class.

library(DMwR)
set.seed(1229)

# Create a table showing the percentage balances using the previously established "a2_bankdata_train_smote" data.

round(prop.table(table(select(a2_bankdata_train_smote, y), exclude = NULL)), 4) * 100

## y
##  no yes 
##  50  50

The training class are in balance.

library(ada)
set.seed(1230)

# Train an AdaBoost model to predict y on balanced data, labeled "ada".

ada <- ada(y ~., data=a2_bankdata_train_smote) 
summary(ada)

## Call:
## ada(y ~ ., data = a2_bankdata_train_smote)
## 
## Loss: exponential Method: discrete   Iteration: 50 
## 
## Training Results
## 
## Accuracy: 0.869 Kappa: 0.737

ada_mod2 <- predict(ada, newdata = a2_bankdata_test)

confusionMatrix(ada_mod2, a2_bankdata_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  8708  259
##        yes 1298 1038
##                                           
##                Accuracy : 0.8622          
##                  95% CI : (0.8558, 0.8686)
##     No Information Rate : 0.8853          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4972          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.80031         
##             Specificity : 0.87028         
##          Pos Pred Value : 0.44435         
##          Neg Pred Value : 0.97112         
##              Prevalence : 0.11475         
##          Detection Rate : 0.09183         
##    Detection Prevalence : 0.20667         
##       Balanced Accuracy : 0.83529         
##                                           
##        'Positive' Class : yes             
##

Key performance metrics: Experiment 3B

The third model is correctly predicting outcomes about 86.16% of the time, which is lower than the unbalanced data model.
The Mcnemar’s Test P-Value tests if the types of errors (FP vs FN) are symmetrical. The low p-values of <2e-16 suggests that errors are not balanced and are still biased.
The Sensitivity of 0.79183 indicates that it is slightly less better at detecting “yes” cases than the unbalanced data model.
The Specificity of 0.87068 indicates that it is better at getting “no” right in the unbalanced data model.
The Pos Pred Value (Precision) of 0.44248 indicates increasing significant false alarms when it says “yes” compared to the unbalanced data.
The Neg Pred Value of 0.96994 indicates that model is slightly better at predicting “no” when it’s actually “no” compared to the unbalanced data model.
The Balanced Accuracy of 0.83125 indicates that model is much more balanced across both classes compared to the unbalanced data model.

Comments:

Consistent with the Decision Tree, and Random Forest, the Adaboost model with the balanced data has a lower accuracy rate, but still has a fairly high accuracy rate. The higher Sensitivity indicates that the model can still do a good job at detecting majority of actual positives. The increased Specificity indicates the model does better at detecting false positives. The Precision rate decreased, but recall improved, which is normal, consistent with the Random Forest. The Balanced Accuracy also significantly improved, showing the model performs well in both classes, consistent with the Decision Tree and Random Forest.

Overall, the balanced data set is an improvement, supporting H1.

Summarize the Confusion Matrices data in a table.

The following code generates a styled HTML table summarizing key performance metrics for the six machine learning algorithms.

library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

# Create a summary data frame with the algorithms and metrics (with some coding assistance from ChatGPT).
confusion_summary <- data.frame(
  Metrics = c("Accuracy", "Mcnemar's P-Value", "Sensitivity", "Specificity", 
              "Pos Pred Value", "Neg Pred Value", "Balanced Accuracy"),
  `DT (unbalanced)` = c(0.9035, "<2.2e-16", 0.42714, 0.96522, 0.61419, 0.92856, 0.69618),
  `DT (Balanced)` = c(0.8435, "<2e-16", 0.73323, 0.85779, 0.40059, 0.96125, 0.79551),
  `RF (unbalanced)` = c(0.9050, "<2.2e-16", 0.28604, 0.98521, 0.71484, 0.91413, 0.63563),
  `RF (Balanced)` = c(0.8457, "<2e-16", 0.85736, 0.84419, 0.41632, 0.97857, 0.85078),
  `Ada (unbalanced)` = c(0.9067, "<2.2e-16", 0.39167, 0.97342, 0.65633, 0.92506, 0.68254),
  `Ada (Balanced)` = c(0.8616, "<2e-16", 0.79183, 0.87068, 0.44248, 0.96994, 0.83125)
)

# Create an HTML table with kableExtra.
confusion_summary %>%
  kbl(caption = "Confusion Matrices Metrics Summary Across Models", 
      align = "lcccccc", 
      format = "html", 
      escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                full_width = FALSE, 
                position = "center") %>%
  scroll_box(height = "auto", width = "100%")

Confusion Matrices Metrics Summary Across Models
Metrics	DT..unbalanced.	DT..Balanced.	RF..unbalanced.	RF..Balanced.	Ada..unbalanced.	Ada..Balanced.
Accuracy	0.9035	0.8435	0.905	0.8457	0.9067	0.8616
Mcnemar’s P-Value	<2.2e-16	<2e-16	<2.2e-16	<2e-16	<2.2e-16	<2e-16
Sensitivity	0.42714	0.73323	0.28604	0.85736	0.39167	0.79183
Specificity	0.96522	0.85779	0.98521	0.84419	0.97342	0.87068
Pos Pred Value	0.61419	0.40059	0.71484	0.41632	0.65633	0.44248
Neg Pred Value	0.92856	0.96125	0.91413	0.97857	0.92506	0.96994
Balanced Accuracy	0.69618	0.79551	0.63563	0.85078	0.68254	0.83125

Ad Hoc Telemarketing Cost Analysis - The Business Case.

The following code creates a graphic that can help a data scientist determine the best model to select when taking into account the costs of telemarketing. I obtained the framework code from ChatGPT and modified it (some of the coding to set up the model_metrics and convert to long data to plot with ggplot2 are new to me).

library(tidyverse)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(ggplot2)

# Step 1: Create your data
model_metrics <- tribble(
  ~Model, ~Balance, ~Sensitivity, ~Specificity, ~PPV, ~Balanced_Accuracy,
  "DT",   "Unbalanced",  0.42714, 0.96522, 0.61419, 0.69618,
  "DT",   "Balanced",    0.73323, 0.85779, 0.40059, 0.79551,
  "RF",   "Unbalanced",  0.28604, 0.98521, 0.71484, 0.63563,
  "RF",   "Balanced",    0.85736, 0.84419, 0.41632, 0.85078,
  "Ada",  "Unbalanced",  0.39167, 0.97342, 0.65633, 0.68254,
  "Ada",  "Balanced",    0.79183, 0.87068, 0.44248, 0.83125
)

# Step 2: Add label for telemarketing cost scenario
model_metrics <- model_metrics %>%
  mutate(Scenario_Label = case_when(
    Model == "RF" & Balance == "Balanced" ~ "Best for Cheap Calls",
    Model == "Ada" & Balance == "Unbalanced" ~ "Best for Expensive Calls",
    TRUE ~ "Other"
  ),
  Model_Label = paste(Model, Balance, sep = "_"))

# Step 3: Melt the data to long format
long_data <- melt(model_metrics,
                  id.vars = c("Model", "Balance", "Model_Label", "Scenario_Label"),
                  measure.vars = c("Sensitivity", "Specificity", "PPV", "Balanced_Accuracy"))

# Step 4: Plot with color-coded and annotated scenario labels
ggplot(long_data, aes(x = Model_Label, y = value, fill = Scenario_Label)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ variable, scales = "free_y") +
  scale_fill_manual(values = c(
    "Best for Cheap Calls" = "darkgreen",
    "Best for Expensive Calls" = "darkred",
    "Other" = "grey70"
  )) +
  labs(
    title = "Model Evaluation by Scenario",
    subtitle = "Highlighting models based on telemarketing cost strategy",
    x = "Model (Balance)",
    y = "Metric Value",
    fill = "Scenario Fit"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

References

Abhishek, K., & Abdelaziz, M. (2023). Machine Learning for unbalanced Data : Tackle unbalanced Datasets Using Machine Learning and Deep Learning Techniques (First edition. ed.). Packt Publishing Ltd.

Freund, Y., & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of computer and system sciences, 55(1), 119-139. https://doi.org/10.1006/jcss.1997.1504

Haibo, H., & Garcia, E. A. (2009). Learning from unbalanced Data. IEEE transactions on knowledge and data engineering, 21(9), 1263-1284. https://doi.org/10.1109/TKDE.2008.239

Kanwal, S., Abdullah, M., Kumar, S., Arshad, S., Shahroz, M., Zhang, D., & Kumar, D. (2024). An Optimal Internet of Things-Driven Intelligent Decision-Making System for Real-Time Fishpond Water Quality Monitoring and Species Survival. Sensors, 24. https://doi.org/10.3390/s24237842

Nwanganga, F., & Chapple, M. (2020). Practical Machine Learning in R (1st edition ed.). John Wiley & Sons. https://doi.org/https://doi.org/10.1002/9781119591542

Joseph Foy - DATA 622 Assignment Two: Experimentation & Model Training

2025-03-25

Abbreviated LIterature Review and Hypothesis Development.

Experiment 1A, Decision Tree: Training a model with classes unbalanced.

Splitting the Data

Training a Model

Evaluating the Model

Confusion Matrix Creation

Experiment 1B, Decision Tree: Training a model with the use of SMOTE - Balancing Classes.

Apply SMOTE and Balance Training Class.

Training a Model

Evaluating the Model

Confusion Matrix Creation

Experiment 2A, Random Forest: Training a model with classes unbalanced.

Visualizing the top 10 important Random Forest features.

Experiment 2B, Random Forest: Training a model with the use of SMOTE - Balancing Classes.

Apply SMOTE and Balance Training Class.

Visualizing the top 10 important Random Forest features.

Experiment 3A, Adaboost: Training a model with classes unbalanced.

Experiment 3B, Adaboost: Training a model with the use of SMOTE - Balancing Classes.

Apply SMOTE and Balance Training Class.

Summarize the Confusion Matrices data in a table.

Ad Hoc Telemarketing Cost Analysis - The Business Case.

References