data622

Assignmet 2

library(caret)
library(rpart)
library(knitr)
library(randomForest)
library(adabag)
library(ada)
library(dplyr)
library(ggplot2)
library(pROC)
library(reshape2)
library(fmsb)
library(patchwork)
library(ROCR)
library(rpart.plot)
library(gains)
library(formattable)

# Load Data
url <- "https://raw.githubusercontent.com/NikoletaEm/datasps/refs/heads/main/bank-additional-full.csv"
bank <- read.csv(url, sep = ";")

# Drop 'duration' (leaks information)
bank <- bank %>% dplyr::select(-duration)

# Convert categorical variables to factors
categorical_vars <- c("job", "marital", "education", "contact", "day_of_week",
                      "month", "pdays", "previous", "housing", "loan", "default", "y")

bank[categorical_vars] <- lapply(bank[categorical_vars], factor)

# Remove 'default' column 
bank$default <- NULL  

# Handle "unknown" values in categorical columns by setting them to NA
bank$housing <- replace(bank$housing, bank$housing == "unknown", NA)
bank$loan <- replace(bank$loan, bank$loan == "unknown", NA)

# Impute missing values using the most frequent value (mode)
housing_mode <- names(which.max(table(bank$housing))) 
loan_mode <- names(which.max(table(bank$loan)))

bank$housing[is.na(bank$housing)] <- housing_mode
bank$loan[is.na(bank$loan)] <- loan_mode


summary(bank)

##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        housing           loan            contact     
##  university.degree  :12168   no     :18622   no     :34940   cellular :26144  
##  high.school        : 9515   unknown:    0   unknown:    0   telephone:15044  
##  basic.9y           : 6045   yes    :22566   yes    : 6248                    
##  professional.course: 5243                                                    
##  basic.4y           : 4176                                                    
##  basic.6y           : 2292                                                    
##  (Other)            : 1749                                                    
##      month       day_of_week    campaign          pdays          previous    
##  may    :13769   fri:7827    Min.   : 1.000   999    :39673   0      :35563  
##  jul    : 7174   mon:8514    1st Qu.: 1.000   3      :  439   1      : 4561  
##  aug    : 6178   thu:8623    Median : 2.000   6      :  412   2      :  754  
##  jun    : 5318   tue:8090    Mean   : 2.568   4      :  118   3      :  216  
##  nov    : 4101   wed:8134    3rd Qu.: 3.000   9      :   64   4      :   70  
##  apr    : 2632               Max.   :56.000   2      :   61   5      :   18  
##  (Other): 2016                                (Other):  421   (Other):    6  
##    poutcome          emp.var.rate      cons.price.idx  cons.conf.idx  
##  Length:41188       Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  Class :character   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  Mode  :character   Median : 1.10000   Median :93.75   Median :-41.8  
##                     Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##                                                                       
##    euribor3m      nr.employed     y        
##  Min.   :0.634   Min.   :4964   no :36548  
##  1st Qu.:1.344   1st Qu.:5099   yes: 4640  
##  Median :4.857   Median :5191              
##  Mean   :3.621   Mean   :5167              
##  3rd Qu.:4.961   3rd Qu.:5228              
##  Max.   :5.045   Max.   :5228              
##

# Split the data into training and testing sets (80-20 split)
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(bank$y, p = 0.8, list = FALSE)
train_data <- bank[trainIndex, ]
test_data <- bank[-trainIndex, ]

The objective is to find the best algorithm for predicting whether a client will subscribe to a term deposit, using Decision Trees, Random Forest and AdaBoost. Some hypotheses are that Decision Trees may perform well but could overfit without tuning, Random Forest should improve performance by reducing overfitting with ensemble learning and AdaBoost may enhance performance by focusing on misclassified instances, especially with deeper trees.

# Confirm the split
cat("Training Data Size:", nrow(train_data), "\n")

## Training Data Size: 32951

cat("Testing Data Size:", nrow(test_data), "\n")

## Testing Data Size: 8237

# Check class distribution in both sets
table(train_data$y) / nrow(train_data)

## 
##        no       yes 
## 0.8873479 0.1126521

table(test_data$y) / nrow(test_data)

## 
##        no       yes 
## 0.8873376 0.1126624

As far as choosing the evaluation metrics:

Accuracy: Measures the overall correctness of predictions. In the context of predicting term deposit subscriptions, accuracy gives a general idea of how well the model is performing. However, since our data is imbalanced (fewer “yes” cases), accuracy alone isn’t enough to gauge performance as the model could predict “no” for most cases and still achieve high accuracy.
Precision: Indicates the proportion of predicted “yes” cases that were actually correct. High precision means fewer false positives, which is valuable from a cost-efficiency perspective. Contacting non-interested clients is costly, so a model with high precision helps the bank focus its marketing efforts on clients most likely to subscribe, optimizing resource allocation.
Recall: Reflects the model’s ability to correctly identify clients who will subscribe (true positives). A low recall means the model is missing many potential clients, leading to lost opportunities. For the bank, improving recall would help capture more interested clients, increasing conversion rates.
F1 Score: Balances precision and recall, making it a useful metric when both false positives and false negatives carry costs. In this context, the F1 Score helps balance the trade-off between avoiding unnecessary outreach and maximizing the number of actual subscribers identified.
ROC-AUC: Measures the model’s ability to distinguish between classes across different threshold values. A higher AUC indicates better discriminatory power. For the bank, a higher AUC suggests the model can effectively prioritize clients, helping target the right audience with marketing campaigns.

In conclusion, I’ll use the F1 Score to balance precision and recall, and ROC-AUC to measure how well the model distinguishes between clients who will subscribe and those who won’t. This gives a clear view of performance with imbalanced data.

Decision Tree Experiment 1 (Baseline Decision Tree): Default Hyperparameters We hypothesize that using a Decision Tree with default hyperparameters will serve as a baseline, helping us understand how well the model performs without tuning. The goal is to measure its predictive ability and use it as a reference for future experiments. Variation: None (baseline model).

# Decision Tree Metrics
dt_results <- data.frame(
  Experiment = c("Default Decision Tree", "D.T:Max Depth = 5", "D.T:Pruned Tree"),
  Accuracy = NA,
  Precision = NA,
  Recall = NA,
  F1_Score = NA,
  AUC_ROC = NA
)
tc <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary)

set.seed(123)
dt_model <- train(y ~ ., 
                  data = train_data, 
                  method = "rpart", 
                  trControl = tc, 
                  metric = "ROC", 
                  weights = ifelse(train_data$y == "yes", 1.5, 1))
# Weights help handle class imbalance  
# "yes" (subscribed) cases get more weight (1.5)  
# "no" (not subscribed) cases get normal weight (1)  
# This helps the model focus on predicting "yes" better

# Baseline Decision Tree Plot
rpart.plot(dt_model$finalModel, 
           type = 2,          
           extra = 104,       
           under = TRUE,      
           box.palette = "Blues", 
           branch.lty = 3, 
           shadow.col = "gray", 
           main = "Baseline Decision Tree")

The Baseline Decision Tree is more complex, capturing more splits and deeper interactions, which likely contributes to its higher variance (aka overfitting risk).

tree_pred_df <- predict(dt_model, test_data, type = "raw")   # Class predictions
tree_pred_prob <- predict(dt_model, test_data, type = "prob") # Probability predictions

# Confusion Matrix
conf_matrix_df <- confusionMatrix(tree_pred_df, test_data$y, positive = "yes")
print(conf_matrix_df)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7077  613
##        yes  232  315
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8907, 0.9039)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.001812        
##                                           
##                   Kappa : 0.3749          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.33944         
##             Specificity : 0.96826         
##          Pos Pred Value : 0.57587         
##          Neg Pred Value : 0.92029         
##              Prevalence : 0.11266         
##          Detection Rate : 0.03824         
##    Detection Prevalence : 0.06641         
##       Balanced Accuracy : 0.65385         
##                                           
##        'Positive' Class : yes             
##

yes_probabilities <- tree_pred_prob[, "yes"]


test_y_numeric <- ifelse(test_data$y == "yes", 1, 0)


pred <- prediction(yes_probabilities, test_y_numeric)

auc <- performance(pred, "auc")
auc_value <- auc@y.values[[1]] 

dt_results[1, "Accuracy"]  <- conf_matrix_df$overall["Accuracy"]
dt_results[1, "Precision"] <- conf_matrix_df$byClass["Precision"]
dt_results[1, "Recall"]    <- conf_matrix_df$byClass["Recall"]
dt_results[1, "F1_Score"]  <- conf_matrix_df$byClass["F1"]
dt_results[1, "AUC_ROC"] <- auc_value


print(dt_results)

##              Experiment  Accuracy Precision    Recall  F1_Score   AUC_ROC
## 1 Default Decision Tree 0.8974141 0.5758684 0.3394397 0.4271186 0.7390987
## 2     D.T:Max Depth = 5        NA        NA        NA        NA        NA
## 3       D.T:Pruned Tree        NA        NA        NA        NA        NA

Results:

Accuracy: 0.8974
Precision: 0.5759
Recall: 0.3394
F1 Score: 0.4271
ROC-AUC: 0.7391

Conclusion: The baseline Decision Tree provides a solid foundation, but its recall is relatively low, meaning it struggles to capture the “yes” cases accurately.

Recommendation: Introduce constraints such as max depth to control overfitting.

Experiment 2: Decision Tree with Max Depth = 5 The hypothesis is that by Reducing the depth of the decision tree (by setting maxdepth = 5) it will affect the model’s performance in predicting whether a client will subscribe to a term deposit (variable y). Specifically, the model may generalize better, potentially improving its ability to classify unseen data, but it could also lose accuracy compared to the baseline model.

Variation: Setting maxdepth = 5, which limits the complexity of the model.

set.seed(123)
tree_model_1 <- rpart(y ~ ., data = train_data, method = "class", control = rpart.control(maxdepth = 5))

# Prediction
tree_pred_1 <- predict(tree_model_1, test_data, type = "class")
tree_pred_prob_1 <- predict(tree_model_1, test_data, type = "prob")
# Evaluation
conf_matrix_1 <- confusionMatrix(tree_pred_1, test_data$y, positive = "yes")
print(conf_matrix_1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7252  761
##        yes   57  167
##                                          
##                Accuracy : 0.9007         
##                  95% CI : (0.894, 0.9071)
##     No Information Rate : 0.8873         
##     P-Value [Acc > NIR] : 5.214e-05      
##                                          
##                   Kappa : 0.2574         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.17996        
##             Specificity : 0.99220        
##          Pos Pred Value : 0.74554        
##          Neg Pred Value : 0.90503        
##              Prevalence : 0.11266        
##          Detection Rate : 0.02027        
##    Detection Prevalence : 0.02719        
##       Balanced Accuracy : 0.58608        
##                                          
##        'Positive' Class : yes            
##

yes_probabilities_1 <- tree_pred_prob_1[, "yes"] 

pred_2 <- prediction(yes_probabilities_1, test_y_numeric)


auc_2 <- performance(pred_2, "auc")
auc_value_2 <- auc_2@y.values[[1]]

# Confusion Matrix for Experiment 2
tree_pred_2 <- predict(tree_model_1, test_data, type = "class")
conf_matrix_2 <- confusionMatrix(tree_pred_2, test_data$y, positive = "yes")
print(conf_matrix_2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7252  761
##        yes   57  167
##                                          
##                Accuracy : 0.9007         
##                  95% CI : (0.894, 0.9071)
##     No Information Rate : 0.8873         
##     P-Value [Acc > NIR] : 5.214e-05      
##                                          
##                   Kappa : 0.2574         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.17996        
##             Specificity : 0.99220        
##          Pos Pred Value : 0.74554        
##          Neg Pred Value : 0.90503        
##              Prevalence : 0.11266        
##          Detection Rate : 0.02027        
##    Detection Prevalence : 0.02719        
##       Balanced Accuracy : 0.58608        
##                                          
##        'Positive' Class : yes            
##

# Store results for Experiment 2
dt_results[2, "Accuracy"] <- conf_matrix_2$overall["Accuracy"]
dt_results[2, "Precision"] <- conf_matrix_2$byClass["Precision"]
dt_results[2, "Recall"] <- conf_matrix_2$byClass["Recall"]
dt_results[2, "F1_Score"] <- conf_matrix_2$byClass["F1"]
dt_results[2, "AUC_ROC"] <- auc_value_2

print(dt_results)

##              Experiment  Accuracy Precision    Recall  F1_Score   AUC_ROC
## 1 Default Decision Tree 0.8974141 0.5758684 0.3394397 0.4271186 0.7390987
## 2     D.T:Max Depth = 5 0.9006920 0.7455357 0.1799569 0.2899306 0.6989344
## 3       D.T:Pruned Tree        NA        NA        NA        NA        NA

rpart.plot(tree_model_1, 
           type = 2, 
           extra = 104, 
           under = TRUE, 
           box.palette = "Greens", 
           branch.lty = 3, 
           shadow.col = "gray", 
           main = "Decision Tree (Max Depth = 5)")

The Decision Tree with Max Depth = 5 above is much simpler than the baseline, limiting the number of branches. This reduces overfitting but might increase bias, as it may miss some nuanced relationships in the data. Results:

Accuracy: 0.9007
Precision: 0.7455
Recall: 0.1800
F1 Score: 0.2899
ROC-AUC: 0.6989

Conclusion: Reducing the depth has increased precision but significantly lowered recall. The model is making fewer false positives but is struggling to capture the positive cases.

Recommendation:Further tuning is needed—possibly adjusting minsplit or experimenting with pruning strategies.

Experiment 3: Decision Tree with Pruning & Min Split = 50 We hypothesize that combining pruning with minsplit = 50 will result in a more generalized tree that avoids overfitting while ensuring each split has sufficient data. This approach should balance precision and recall while improving AUC-ROC. Variation:

Apply post-pruning.
Set minsplit = 50 to ensure nodes have enough observations before splitting.

set.seed(123)
#Train Decision Tree with minsplit = 50
tree_model_3 <- rpart(y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0, minsplit = 50))

printcp(tree_model_3)

## 
## Classification tree:
## rpart(formula = y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0, 
##     minsplit = 50))
## 
## Variables actually used in tree construction:
##  [1] age           campaign      cons.conf.idx contact       day_of_week  
##  [6] education     emp.var.rate  euribor3m     housing       job          
## [11] loan          marital       month         nr.employed   pdays        
## [16] poutcome      previous     
## 
## Root node error: 3712/32951 = 0.11265
## 
## n= 32951 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.05266703      0   1.00000 1.00000 0.015461
## 2  0.00395115      2   0.89467 0.90894 0.014825
## 3  0.00350216      6   0.87796 0.90086 0.014767
## 4  0.00282866      7   0.87446 0.89898 0.014753
## 5  0.00242457      9   0.86880 0.89898 0.014753
## 6  0.00197557     10   0.86638 0.89278 0.014708
## 7  0.00188578     16   0.85318 0.89224 0.014704
## 8  0.00161638     18   0.84941 0.89170 0.014700
## 9  0.00148168     23   0.84133 0.89036 0.014690
## 10 0.00134698     25   0.83836 0.89332 0.014712
## 11 0.00094289     27   0.83567 0.89547 0.014728
## 12 0.00089799     29   0.83378 0.90598 0.014804
## 13 0.00080819     43   0.82085 0.91164 0.014845
## 14 0.00071839     45   0.81923 0.91352 0.014858
## 15 0.00062859     54   0.81277 0.91352 0.014858
## 16 0.00053879     57   0.81088 0.92107 0.014913
## 17 0.00044899     70   0.80172 0.92107 0.014913
## 18 0.00035920     73   0.80038 0.92565 0.014945
## 19 0.00026940     84   0.79580 0.92780 0.014961
## 20 0.00020205     94   0.79310 0.93427 0.015007
## 21 0.00017960     98   0.79230 0.93696 0.015026
## 22 0.00013470    101   0.79176 0.93723 0.015028
## 23 0.00010776    103   0.79149 0.93992 0.015047
## 24 0.00000000    108   0.79095 0.94154 0.015058

optimal_cp <- tree_model_3$cptable[which.min(tree_model_3$cptable[, "xerror"]), "CP"]


pruned_tree_3 <- prune(tree_model_3, cp = optimal_cp)

# Prediction
tree_pred_3 <- predict(pruned_tree_3, test_data, type = "class")


tree_pred_prob_3 <- predict(pruned_tree_3, test_data, type = "prob")
yes_probabilities_3 <- tree_pred_prob_3[, "yes"]



pred_3 <- prediction(yes_probabilities_3, test_y_numeric)


auc_3 <- performance(pred_3, "auc")
auc_value_3 <- auc_3@y.values[[1]]


conf_matrix_3 <- confusionMatrix(tree_pred_3, test_data$y, positive = "yes")
print(conf_matrix_3)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7209  706
##        yes  100  222
##                                           
##                Accuracy : 0.9021          
##                  95% CI : (0.8955, 0.9085)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 7.978e-06       
##                                           
##                   Kappa : 0.3155          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.23922         
##             Specificity : 0.98632         
##          Pos Pred Value : 0.68944         
##          Neg Pred Value : 0.91080         
##              Prevalence : 0.11266         
##          Detection Rate : 0.02695         
##    Detection Prevalence : 0.03909         
##       Balanced Accuracy : 0.61277         
##                                           
##        'Positive' Class : yes             
##

dt_results[3, "Accuracy"] <- conf_matrix_3$overall["Accuracy"]
dt_results[3, "Precision"] <- conf_matrix_3$byClass["Precision"]
dt_results[3, "Recall"] <- conf_matrix_3$byClass["Recall"]
dt_results[3, "F1_Score"] <- conf_matrix_3$byClass["F1"]
dt_results[3, "AUC_ROC"] <- auc_value_3


print(dt_results)

##              Experiment  Accuracy Precision    Recall  F1_Score   AUC_ROC
## 1 Default Decision Tree 0.8974141 0.5758684 0.3394397 0.4271186 0.7390987
## 2     D.T:Max Depth = 5 0.9006920 0.7455357 0.1799569 0.2899306 0.6989344
## 3       D.T:Pruned Tree 0.9021488 0.6894410 0.2392241 0.3552000 0.7390662

Results:

Accuracy: 0.9021
Precision: 0.6894
Recall: 0.2392
F1 Score: 0.3552
ROC-AUC: 0.7391

Conclusion: The pruned tree shows a slight improvement in accuracy compared to the default decision tree and a significant boost in precision. However, recall remains relatively low, indicating the model’s focus on reducing false positives at the expense of capturing more positive cases.

Recommendation: Further tuning of pruning parameters, like adjusting the minimum samples per leaf, could enhance recall without compromising precision.

# Precision-Recall Curve for Experiment 1
pr_curve <- performance(pred, "prec", "rec")
plot(pr_curve, main = "Precision-Recall Curve: Decision Trees", col = "blue", lwd = 2)

# Precision-Recall Curve for Experiment 2
pr_curve_2 <- performance(pred_2, "prec", "rec")
plot(pr_curve_2, main = "Precision-Recall Curve (Exp 2)", col = "green", lwd = 2, add = TRUE)

# Precision-Recall Curve for Experiment 3
pr_curve_3 <- performance(pred_3, "prec", "rec")
plot(pr_curve_3, main = "Precision-Recall Curve (Exp 3)", col = "red", lwd = 2, add = TRUE)

legend("bottomleft", legend = c("Exp 1", "Exp 2", "Exp 3"),
       col = c("blue", "green", "red"), lty = 1, lwd = 2)

This plot compares the precision-recall trade-off for three different decision tree models. Since our dataset is imbalanced (fewer “yes” responses), precision-recall curves are useful for evaluating model performance in distinguishing potential term deposit subscribers. The red curve (Exp 3) seems to perform slightly better at lower recall levels, indicating better handling of positive cases with pruning.

# Gain chart
gain_chart <- gains(test_y_numeric, yes_probabilities)
plot(gain_chart, main = "Gain Chart", col = "purple", lwd = 2)

This visualization helps assess the effectiveness of the predictive model by showing how well it ranks clients in terms of likelihood to subscribe.The Mean Response and Mean Predicted Response lines show how well the model is capturing potential subscribers compared to the actual distribution.The steep decline suggests that the model successfully identifies high-probability subscribers early on.

Random Forest

# Random Forest Metrics
rf_results <- data.frame(
  Experiment = c("R.F:50 Trees", "R.F:200 Trees", "R.F:mtry = 6"),
  Accuracy = NA,
  Precision = NA,
  Recall = NA,
  F1_Score = NA,
  AUC_ROC = NA
)

Experiment 1 (Baseline Random Forest): The hypothesis is that using a random forest classifier with 50 trees will provide a reasonable balance between model complexity and predictive performance.

Variation: None (baseline model).

### Experiment 1: Random Forest with 50 Trees ###
set.seed(123)
rf_50 <- randomForest(y ~ ., data = train_data, ntree = 50)

# Predictions
rf_50_pred <- predict(rf_50, test_data)  # Class predictions
rf_50_prob <- predict(rf_50, test_data, type = "prob")  # Probability predictions
yes_prob_rf_50 <- rf_50_prob[, "yes"]


pred_rf_50 <- prediction(yes_prob_rf_50, test_y_numeric)
auc_rf_50 <- performance(pred_rf_50, "auc")@y.values[[1]]


conf_matrix_rf_50 <- confusionMatrix(rf_50_pred, test_data$y, positive = "yes")


rf_results[1, "Accuracy"] <- conf_matrix_rf_50$overall["Accuracy"]
rf_results[1, "Precision"] <- conf_matrix_rf_50$byClass["Precision"]
rf_results[1, "Recall"] <- conf_matrix_rf_50$byClass["Recall"]
rf_results[1, "F1_Score"] <- conf_matrix_rf_50$byClass["F1"]
rf_results[1, "AUC_ROC"] <- auc_rf_50
print(rf_results)

##      Experiment Accuracy Precision    Recall  F1_Score   AUC_ROC
## 1  R.F:50 Trees 0.894986 0.5737705 0.2640086 0.3616236 0.7600701
## 2 R.F:200 Trees       NA        NA        NA        NA        NA
## 3  R.F:mtry = 6       NA        NA        NA        NA        NA

Results:

Accuracy: 0.8949860
Precision: 0.5737705
Recall: 0.2640086
F1 Score: 0.3616236
ROC-AUC: 0.7600701

Conclusion: The baseline model performs reasonably well, with an ROC-AUC of 0.76, providing a benchmark.

Recommendation: Moving on with hyperparameter tuning to improve performance

Experiment 2: 200 Trees The objective of this experiment is to evaluate the performance of a Random Forest model with 200 trees on predicting whether a client will subscribe to a term deposit (the target variable y). The hypothesis is that increasing the number of trees in the Random Forest model from 50 to 200 will enhance the model’s predictive power, reduce overfitting, and improve the accuracy and generalization of the model.

Variation: Increased n_estimators from 50 to 200

## Experiment 2: 200 Tree
set.seed(123)
rf_200 <- randomForest(y ~ ., data = train_data, ntree = 200)

# Predictions
rf_200_pred <- predict(rf_200, test_data)
rf_200_prob <- predict(rf_200, test_data, type = "prob")
yes_prob_rf_200 <- rf_200_prob[, "yes"]


pred_rf_200 <- prediction(yes_prob_rf_200, test_y_numeric)
auc_rf_200 <- performance(pred_rf_200, "auc")@y.values[[1]]


conf_matrix_rf_200 <- confusionMatrix(rf_200_pred, test_data$y, positive = "yes")


rf_results[2, "Accuracy"] <- conf_matrix_rf_200$overall["Accuracy"]
rf_results[2, "Precision"] <- conf_matrix_rf_200$byClass["Precision"]
rf_results[2, "Recall"] <- conf_matrix_rf_200$byClass["Recall"]
rf_results[2, "F1_Score"] <- conf_matrix_rf_200$byClass["F1"]
rf_results[2, "AUC_ROC"] <- auc_rf_200
print(rf_results)

##      Experiment  Accuracy Precision    Recall  F1_Score   AUC_ROC
## 1  R.F:50 Trees 0.8949860 0.5737705 0.2640086 0.3616236 0.7600701
## 2 R.F:200 Trees 0.8959573 0.5819861 0.2715517 0.3703159 0.7656355
## 3  R.F:mtry = 6        NA        NA        NA        NA        NA

Results:

Accuracy: 0.8959573
Precision: 0.5819861
Recall: 0.2715517
F1 Score: 0.3703159
ROC-AUC: 0.7656355

Conclusion: Increasing the number of trees improved ROC-AUC from 0.760 to 0.765, indicating reduced variance and better predictive performance.

Recommendation: Increasing trees is beneficial. Further tuning could optimize performance.

Experiment 3 (Tuning max_features): mtry = 6 he hypothesis is that tuning the mtry parameter (which controls the number of features considered for each tree split) will improve the model’s predictive performance by potentially reducing overfitting or enhancing generalization. Using fewer features at each split can prevent individual trees from fitting too closely to the training data, which may improve generalization to unseen data.

Variation: Set max_features to 6.

# Experiment 3 (Tuning max_features): mtry = 6
set.seed(123)
rf_mtry6 <- randomForest(y ~ ., data = train_data, ntree = 200, mtry = 6)

# Predictions
rf_mtry6_pred <- predict(rf_mtry6, test_data)
rf_mtry6_prob <- predict(rf_mtry6, test_data, type = "prob")
yes_prob_rf_mtry6 <- rf_mtry6_prob[, "yes"]


pred_rf_mtry6 <- prediction(yes_prob_rf_mtry6, test_y_numeric)
auc_rf_mtry6 <- performance(pred_rf_mtry6, "auc")@y.values[[1]]


conf_matrix_rf_mtry6 <- confusionMatrix(rf_mtry6_pred, test_data$y, positive = "yes")

# Store results in rf_results 
rf_results[3, "Accuracy"] <- conf_matrix_rf_mtry6$overall["Accuracy"]
rf_results[3, "Precision"] <- conf_matrix_rf_mtry6$byClass["Precision"]
rf_results[3, "Recall"] <- conf_matrix_rf_mtry6$byClass["Recall"]
rf_results[3, "F1_Score"] <- conf_matrix_rf_mtry6$byClass["F1"]
rf_results[3, "AUC_ROC"] <- auc_rf_mtry6

print(rf_results)

##      Experiment  Accuracy Precision    Recall  F1_Score   AUC_ROC
## 1  R.F:50 Trees 0.8949860 0.5737705 0.2640086 0.3616236 0.7600701
## 2 R.F:200 Trees 0.8959573 0.5819861 0.2715517 0.3703159 0.7656355
## 3  R.F:mtry = 6 0.8931650 0.5502092 0.2834052 0.3741110 0.7568994

Results:

Accuracy: 0.8931650
Precision: 0.5502092
Recall: 0.2834052
F1 Score: 0.3741110
ROC-AUC: 0.7568994

Conclusion: Limiting mtry (max features) slightly reduced performance, with ROC-AUC dropping to 0.7569 compared to 0.7656 from 200 trees, possibly due to increased bias.

Recommendation: Using 200 trees without restricting mtry provides the best balance of recall (0.2716), F1 score (0.3703), and ROC-AUC (0.7656).

radar_data_rf <- rbind(
  rep(1, 5),
  rep(0, 5),   
  rf_results[,-1] 
)

rownames(radar_data_rf) <- c("Max", "Min", "50 Trees", "200 Trees", "mtry = 6")

# Radar Plot
radarchart(radar_data_rf,
           axistype = 1,
           pcol = c("blue", "green", "red"),  
           plwd = 2,                           
           plty = 1,                           
           cglcol = "grey",                    
           cglty = 1,                          
           axislabcol = "grey",                
           vlcex = 0.8,                        
           title = "Random Forest Models Performance Comparison")


legend("topright", legend = c("50 Trees", "200 Trees", "mtry = 6"),
       col = c("blue", "green", "red"), lty = 1, lwd = 2)

As we can see from the radar plot above Accuracy and AUC-ROC are high across all models, while Precision and Recall show trade-offs. The higher recall for some models suggests better identification of potential subscribers, while the slight drop in precision might indicate more false positives.

Conclusion Increasing the number of trees from 50 to 200 results in only a slight improvement across metrics. Recall increases from 0.2640 to 0.2716, F1 score from 0.3616 to 0.3703, and ROC-AUC from 0.7601 to 0.7656, showing minimal gains. Tuning mtry to 6 further improves recall to 0.2834, but at the cost of lower precision (0.5502) and a slight decrease in ROC-AUC (0.7569). This trade-off may be beneficial if recall is prioritized over precision.

Overall, all Random Forest models perform similarly, with only minor variations. The best choice depends on the emphasis placed on recall versus precision.

AdaBoost

# AdaBoost Metrics
adaboost_results <- data.frame(
  Experiment = c("Default AdaBoost", "Ada: = 0.5, iter = 100", "Ada: Feature Selection & Scaling"),
  Accuracy = NA,
  Precision = NA,
  Recall = NA,
  F1_Score = NA,
  AUC_ROC = NA
)

Experiment 1 (Baseline AdaBoost) : Baseline with 50 iterations we hypothesize that by applying AdaBoost with 50 iterations, we can achieve a high level of accuracy, precision, and recall, with an AUC ROC that reflects the model’s ability to distinguish between the two classes (subscribed vs. not subscribed).

Variation: None (baseline model).

## Experiment 1 (Baseline AdaBoost) : Baseline with 50 iterations
# Train an Adaboost model
adaboost_model <- ada(y ~ ., data = train_data, iter = 50, nu = 1)

print(adaboost_model)

## Call:
## ada(y ~ ., data = train_data, iter = 50, nu = 1)
## 
## Loss: exponential Method: discrete   Iteration: 50 
## 
## Final Confusion Matrix for Data:
##           Final Prediction
## True value    no   yes
##        no  28802   437
##        yes  2863   849
## 
## Train Error: 0.1 
## 
## Out-Of-Bag Error:  0.099  iteration= 6 
## 
## Additional Estimates of number of iterations:
## 
## train.err1 train.kap1 
##         41         44

adaboost_pred <- predict(adaboost_model, test_data)
adaboost_prob <- predict(adaboost_model, test_data, type = "prob")  # Probability predictions
yes_prob_ada_1 <- adaboost_prob[, 2]  




pred_ada_1 <- prediction(yes_prob_ada_1, test_y_numeric)
auc_ada_1 <- performance(pred_ada_1, "auc")@y.values[[1]]


conf_matrix_ada <- confusionMatrix(adaboost_pred, test_data$y, positive = "yes")


adaboost_results[1, "Accuracy"] <- conf_matrix_ada$overall["Accuracy"]
adaboost_results[1, "Precision"] <- conf_matrix_ada$byClass["Precision"]
adaboost_results[1, "Recall"] <- conf_matrix_ada$byClass["Recall"]
adaboost_results[1, "F1_Score"] <- conf_matrix_ada$byClass["F1"]
adaboost_results[1, "AUC_ROC"] <- auc_ada_1

print(adaboost_results)

##                         Experiment  Accuracy Precision    Recall  F1_Score
## 1                 Default AdaBoost 0.9011776 0.6925676 0.2209052 0.3349673
## 2           Ada: = 0.5, iter = 100        NA        NA        NA        NA
## 3 Ada: Feature Selection & Scaling        NA        NA        NA        NA
##     AUC_ROC
## 1 0.7719639
## 2        NA
## 3        NA

Results:

Accuracy: 0.9009
Precision: 0.7029
Recall: 0.2091
F1 Score: 0.3223
ROC-AUC: 0.7752

Conclusion: The baseline performance is solid, with high accuracy but lower precision and recall. There’s room for improvement in model performance.

Recommendation: Tune n_estimators and learning rate to enhance performance.

Experiment 2: Hyperparameter Tuning (nu and iter) Objective is to assess how changing the learning rate (nu) and the number of boosting iterations (iter) will impact the performance of the AdaBoost model. Specifically, we aim to explore whether tuning these hyperparameters improves the accuracy and F1-score compared to the baseline model (with default nu = 1 and iter = 50).

Variation:The nu (learning rate) will be tested at different values (0.5).The iter (iterations) will be tested at different values (100).

# Experiment 2: Hyperparameter tuning (nu and iter)
# Using nu = 0.5 and iter = 100
adaboost_model_1 <- ada(y ~ ., data = train_data, iter = 100, nu = 0.5)
adaboost_pred_1 <- predict(adaboost_model_1, test_data)

adaboost_prob_2 <- predict(adaboost_model_1, test_data, type = "prob")
yes_prob_ada_2 <- adaboost_prob_2[, 2]


pred_ada_2 <- prediction(yes_prob_ada_2, test_y_numeric)
auc_ada_2 <- performance(pred_ada_2, "auc")@y.values[[1]]

# Confusion Matrix
conf_matrix_ada_2 <- confusionMatrix(adaboost_pred_1, test_data$y, positive = "yes")


adaboost_results[2, "Accuracy"] <- conf_matrix_ada_2$overall["Accuracy"]
adaboost_results[2, "Precision"] <- conf_matrix_ada_2$byClass["Precision"]
adaboost_results[2, "Recall"] <- conf_matrix_ada_2$byClass["Recall"]
adaboost_results[2, "F1_Score"] <- conf_matrix_ada_2$byClass["F1"]
adaboost_results[2, "AUC_ROC"] <- auc_ada_2
print(adaboost_results)

##                         Experiment  Accuracy Precision    Recall  F1_Score
## 1                 Default AdaBoost 0.9011776 0.6925676 0.2209052 0.3349673
## 2           Ada: = 0.5, iter = 100 0.8995994 0.6391185 0.2500000 0.3594113
## 3 Ada: Feature Selection & Scaling        NA        NA        NA        NA
##     AUC_ROC
## 1 0.7719639
## 2 0.7799235
## 3        NA

Results:

Accuracy: 0.8994
Precision: 0.6571
Recall: 0.2231
F1 Score: 0.3331
ROC-AUC: 0.7788

Conclusion: Hyperparameter Tuning slightly improved AUC-ROC and F1-Score, but accuracy stayed nearly the same. There’s still a trade-off between precision and recall.

Recommendation: The increase in estimators helps slightly, but further tuning of the learning rate could lead to better overall performance.

Experiment 3: Data Preprocessing (Normalization and Feature Selection) The objective is to evaluate if applying data preprocessing techniques—normalization of continuous features and feature selection—improves the performance of the AdaBoost model compared to the previous experiments.

Variation:

Normalize the continuous variables in the dataset.
Select a subset of important features based on feature importance from a Random Forest model.

# Experiment 3: Data Preprocessing (Normalization and Feature Selection)
# Normalize the continuous variables

pre_process <- preProcess(train_data, method = c("center", "scale"))
train_data_normalized <- predict(pre_process, train_data)
test_data_normalized <- predict(pre_process, test_data)

# Feature Selection: Select the top 10 features based on importance
rf_model <- randomForest(y ~ ., data = train_data_normalized)
importance_scores <- importance(rf_model)
top_features <- names(sort(importance_scores[, 1], decreasing = TRUE))[1:10]
train_data_selected <- train_data_normalized[, c(top_features, "y")]
test_data_selected <- test_data_normalized[, c(top_features, "y")]

# Train the Adaboost model on the selected features
adaboost_model_3 <- ada(y ~ ., data = train_data_selected, iter = 50, nu = 1)
adaboost_pred_3 <- predict(adaboost_model_3, test_data_selected)

adaboost_prob_3 <- predict(adaboost_model_3, test_data_selected, type = "prob")
yes_prob_ada_3 <- adaboost_prob_3[, 2]

pred_ada_3 <- prediction(yes_prob_ada_3, test_y_numeric)
auc_ada_3 <- performance(pred_ada_3, "auc")@y.values[[1]]

conf_matrix_ada_3 <- confusionMatrix(adaboost_pred_3, test_data_selected$y, positive = "yes")

# Store results in adaboost_results (Row 3)
adaboost_results[3, "Accuracy"] <- conf_matrix_ada_3$overall["Accuracy"]
adaboost_results[3, "Precision"] <- conf_matrix_ada_3$byClass["Precision"]
adaboost_results[3, "Recall"] <- conf_matrix_ada_3$byClass["Recall"]
adaboost_results[3, "F1_Score"] <- conf_matrix_ada_3$byClass["F1"]
adaboost_results[3, "AUC_ROC"] <- auc_ada_3

print(adaboost_results)

##                         Experiment  Accuracy Precision    Recall  F1_Score
## 1                 Default AdaBoost 0.9011776 0.6925676 0.2209052 0.3349673
## 2           Ada: = 0.5, iter = 100 0.8995994 0.6391185 0.2500000 0.3594113
## 3 Ada: Feature Selection & Scaling 0.8985067 0.6729323 0.1928879 0.2998325
##     AUC_ROC
## 1 0.7719639
## 2 0.7799235
## 3 0.7717384

Results:

Accuracy: 0.9008
Precision: 0.7211
Recall: 0.1950
F1 Score: 0.3070
ROC-AUC: 0.7683

Conclusion:Feature Selection & Scaling resulted in higher precision (0.7211) compared to the baseline and hyperparameter tuning models. However, recall dropped to 0.1950, indicating that the model is missing more true positives. AUC-ROC also decreased slightly compared to the other experiments, suggesting that some removed features may have contained important predictive information.

Recommendation:Further testing could involve different normalization methods or keeping more than 10 features to balance recall and precision.But if the goal is high precision, this model is a good choice as it reduces false positives

adaboost_results_long <- melt(adaboost_results, id.vars = "Experiment")


ggplot(adaboost_results_long, aes(x = Experiment, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "AdaBoost Performance Comparison", y = "Metric Value", x = "Experiment") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Hyperparameter Tuning (nu = 0.5, iter = 100) provides a slight improvement in precision, and AUC-ROC, making it the best performing configuration based on these metrics. In summary, hyperparameter tuning (nu = 0.5, iter = 100) appears to be the most beneficial change to the AdaBoost model, offering a small but meaningful performance boost over the default model.

# Combine the results into one DataFrame
all_results <- rbind(dt_results, rf_results, adaboost_results)

kable(all_results, caption = "Model Performance Comparison", digits = 3, format = "markdown")

Model Performance Comparison
Experiment	Accuracy	Precision	Recall	F1_Score	AUC_ROC
Default Decision Tree	0.897	0.576	0.339	0.427	0.739
D.T:Max Depth = 5	0.901	0.746	0.180	0.290	0.699
D.T:Pruned Tree	0.902	0.689	0.239	0.355	0.739
R.F:50 Trees	0.895	0.574	0.264	0.362	0.760
R.F:200 Trees	0.896	0.582	0.272	0.370	0.766
R.F:mtry = 6	0.893	0.550	0.283	0.374	0.757
Default AdaBoost	0.901	0.693	0.221	0.335	0.772
Ada: = 0.5, iter = 100	0.900	0.639	0.250	0.359	0.780
Ada: Feature Selection & Scaling	0.899	0.673	0.193	0.300	0.772

# Highlighting the best result in each metric
highlighted_results <- formattable(
  all_results,
  list(
    Accuracy = formatter("span", 
                         style = function(x) ifelse(x == max(all_results$Accuracy), 
                                                    style(color = "green", font.weight = "bold"), 
                                                    NA)),
    Precision = formatter("span", 
                          style = function(x) ifelse(x == max(all_results$Precision), 
                                                     style(color = "blue", font.weight = "bold"), 
                                                     NA)),
    Recall = formatter("span", 
                       style = function(x) ifelse(x == max(all_results$Recall), 
                                                  style(color = "red", font.weight = "bold"), 
                                                  NA)),
    F1_Score = formatter("span", 
                         style = function(x) ifelse(x == max(all_results$F1_Score), 
                                                    style(color = "orange", font.weight = "bold"), 
                                                    NA)),
    AUC_ROC = formatter("span", 
                        style = function(x) ifelse(x == max(all_results$AUC_ROC), 
                                                   style(color = "purple", font.weight = "bold"), 
                                                   NA))
  )
)


highlighted_results

Experiment	Accuracy	Precision	Recall	F1_Score	AUC_ROC
Default Decision Tree	0.8974141	0.5758684	0.3394397	0.4271186	0.7390987
D.T:Max Depth = 5	0.9006920	0.7455357	0.1799569	0.2899306	0.6989344
D.T:Pruned Tree	0.9021488	0.6894410	0.2392241	0.3552000	0.7390662
R.F:50 Trees	0.8949860	0.5737705	0.2640086	0.3616236	0.7600701
R.F:200 Trees	0.8959573	0.5819861	0.2715517	0.3703159	0.7656355
R.F:mtry = 6	0.8931650	0.5502092	0.2834052	0.3741110	0.7568994
Default AdaBoost	0.9011776	0.6925676	0.2209052	0.3349673	0.7719639
Ada: = 0.5, iter = 100	0.8995994	0.6391185	0.2500000	0.3594113	0.7799235
Ada: Feature Selection & Scaling	0.8985067	0.6729323	0.1928879	0.2998325	0.7717384

CONCLUSION

In conclusion,we observed that in decision trees pruning improved precision (0.7455) but reduced recall (0.1800). This model struggled with overfitting. We also noticed that in our random forest experiments, increased trees improved recall to 0.2866 and ROC-AUC to 0.7656, showing better generalization.But AdaBoost performed best with 100 iterations and a learning rate of 0.5, achieving the highest ROC-AUC of 0.7788 and a strong balance between precision and recall.AdaBoost is our optimal model as it provided the best overall performance. Further tuning and resampling techniques could improve recall without compromising precision. For business decisions, choosing between precision-focused or recall-focused models depends on marketing priorities.

data622_hw2

Nikoleta Emanouilidi

2025-03-12

CONCLUSION