Assignmet 2
library(caret)
library(rpart)
library(knitr)
library(randomForest)
library(adabag)
library(ada)
library(dplyr)
library(ggplot2)
library(pROC)
library(reshape2)
library(fmsb)
library(patchwork)
library(ROCR)
library(rpart.plot)
library(gains)
library(formattable)
# Load Data
url <- "https://raw.githubusercontent.com/NikoletaEm/datasps/refs/heads/main/bank-additional-full.csv"
bank <- read.csv(url, sep = ";")
# Drop 'duration' (leaks information)
bank <- bank %>% dplyr::select(-duration)
# Convert categorical variables to factors
categorical_vars <- c("job", "marital", "education", "contact", "day_of_week",
"month", "pdays", "previous", "housing", "loan", "default", "y")
bank[categorical_vars] <- lapply(bank[categorical_vars], factor)
# Remove 'default' column
bank$default <- NULL
# Handle "unknown" values in categorical columns by setting them to NA
bank$housing <- replace(bank$housing, bank$housing == "unknown", NA)
bank$loan <- replace(bank$loan, bank$loan == "unknown", NA)
# Impute missing values using the most frequent value (mode)
housing_mode <- names(which.max(table(bank$housing)))
loan_mode <- names(which.max(table(bank$loan)))
bank$housing[is.na(bank$housing)] <- housing_mode
bank$loan[is.na(bank$loan)] <- loan_mode
summary(bank)
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1720
## (Other) : 6156
## education housing loan contact
## university.degree :12168 no :18622 no :34940 cellular :26144
## high.school : 9515 unknown: 0 unknown: 0 telephone:15044
## basic.9y : 6045 yes :22566 yes : 6248
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## month day_of_week campaign pdays previous
## may :13769 fri:7827 Min. : 1.000 999 :39673 0 :35563
## jul : 7174 mon:8514 1st Qu.: 1.000 3 : 439 1 : 4561
## aug : 6178 thu:8623 Median : 2.000 6 : 412 2 : 754
## jun : 5318 tue:8090 Mean : 2.568 4 : 118 3 : 216
## nov : 4101 wed:8134 3rd Qu.: 3.000 9 : 64 4 : 70
## apr : 2632 Max. :56.000 2 : 61 5 : 18
## (Other): 2016 (Other): 421 (Other): 6
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## Length:41188 Min. :-3.40000 Min. :92.20 Min. :-50.8
## Class :character 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## Mode :character Median : 1.10000 Median :93.75 Median :-41.8
## Mean : 0.08189 Mean :93.58 Mean :-40.5
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. : 1.40000 Max. :94.77 Max. :-26.9
##
## euribor3m nr.employed y
## Min. :0.634 Min. :4964 no :36548
## 1st Qu.:1.344 1st Qu.:5099 yes: 4640
## Median :4.857 Median :5191
## Mean :3.621 Mean :5167
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
##
# Split the data into training and testing sets (80-20 split)
set.seed(123) # for reproducibility
trainIndex <- createDataPartition(bank$y, p = 0.8, list = FALSE)
train_data <- bank[trainIndex, ]
test_data <- bank[-trainIndex, ]
The objective is to find the best algorithm for predicting whether a client will subscribe to a term deposit, using Decision Trees, Random Forest and AdaBoost. Some hypotheses are that Decision Trees may perform well but could overfit without tuning, Random Forest should improve performance by reducing overfitting with ensemble learning and AdaBoost may enhance performance by focusing on misclassified instances, especially with deeper trees.
# Confirm the split
cat("Training Data Size:", nrow(train_data), "\n")
## Training Data Size: 32951
cat("Testing Data Size:", nrow(test_data), "\n")
## Testing Data Size: 8237
# Check class distribution in both sets
table(train_data$y) / nrow(train_data)
##
## no yes
## 0.8873479 0.1126521
table(test_data$y) / nrow(test_data)
##
## no yes
## 0.8873376 0.1126624
As far as choosing the evaluation metrics:
Accuracy: Measures the overall correctness of predictions. In the context of predicting term deposit subscriptions, accuracy gives a general idea of how well the model is performing. However, since our data is imbalanced (fewer “yes” cases), accuracy alone isn’t enough to gauge performance as the model could predict “no” for most cases and still achieve high accuracy.
Precision: Indicates the proportion of predicted “yes” cases that were actually correct. High precision means fewer false positives, which is valuable from a cost-efficiency perspective. Contacting non-interested clients is costly, so a model with high precision helps the bank focus its marketing efforts on clients most likely to subscribe, optimizing resource allocation.
Recall: Reflects the model’s ability to correctly identify clients who will subscribe (true positives). A low recall means the model is missing many potential clients, leading to lost opportunities. For the bank, improving recall would help capture more interested clients, increasing conversion rates.
F1 Score: Balances precision and recall, making it a useful metric when both false positives and false negatives carry costs. In this context, the F1 Score helps balance the trade-off between avoiding unnecessary outreach and maximizing the number of actual subscribers identified.
ROC-AUC: Measures the model’s ability to distinguish between classes across different threshold values. A higher AUC indicates better discriminatory power. For the bank, a higher AUC suggests the model can effectively prioritize clients, helping target the right audience with marketing campaigns.
In conclusion, I’ll use the F1 Score to balance precision and recall, and ROC-AUC to measure how well the model distinguishes between clients who will subscribe and those who won’t. This gives a clear view of performance with imbalanced data.
Decision Tree Experiment 1 (Baseline Decision Tree): Default Hyperparameters We hypothesize that using a Decision Tree with default hyperparameters will serve as a baseline, helping us understand how well the model performs without tuning. The goal is to measure its predictive ability and use it as a reference for future experiments. Variation: None (baseline model).
# Decision Tree Metrics
dt_results <- data.frame(
Experiment = c("Default Decision Tree", "D.T:Max Depth = 5", "D.T:Pruned Tree"),
Accuracy = NA,
Precision = NA,
Recall = NA,
F1_Score = NA,
AUC_ROC = NA
)
tc <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary)
set.seed(123)
dt_model <- train(y ~ .,
data = train_data,
method = "rpart",
trControl = tc,
metric = "ROC",
weights = ifelse(train_data$y == "yes", 1.5, 1))
# Weights help handle class imbalance
# "yes" (subscribed) cases get more weight (1.5)
# "no" (not subscribed) cases get normal weight (1)
# This helps the model focus on predicting "yes" better
# Baseline Decision Tree Plot
rpart.plot(dt_model$finalModel,
type = 2,
extra = 104,
under = TRUE,
box.palette = "Blues",
branch.lty = 3,
shadow.col = "gray",
main = "Baseline Decision Tree")
The Baseline Decision Tree is more complex, capturing more splits and deeper interactions, which likely contributes to its higher variance (aka overfitting risk).
tree_pred_df <- predict(dt_model, test_data, type = "raw") # Class predictions
tree_pred_prob <- predict(dt_model, test_data, type = "prob") # Probability predictions
# Confusion Matrix
conf_matrix_df <- confusionMatrix(tree_pred_df, test_data$y, positive = "yes")
print(conf_matrix_df)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7077 613
## yes 232 315
##
## Accuracy : 0.8974
## 95% CI : (0.8907, 0.9039)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.001812
##
## Kappa : 0.3749
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.33944
## Specificity : 0.96826
## Pos Pred Value : 0.57587
## Neg Pred Value : 0.92029
## Prevalence : 0.11266
## Detection Rate : 0.03824
## Detection Prevalence : 0.06641
## Balanced Accuracy : 0.65385
##
## 'Positive' Class : yes
##
yes_probabilities <- tree_pred_prob[, "yes"]
test_y_numeric <- ifelse(test_data$y == "yes", 1, 0)
pred <- prediction(yes_probabilities, test_y_numeric)
auc <- performance(pred, "auc")
auc_value <- auc@y.values[[1]]
dt_results[1, "Accuracy"] <- conf_matrix_df$overall["Accuracy"]
dt_results[1, "Precision"] <- conf_matrix_df$byClass["Precision"]
dt_results[1, "Recall"] <- conf_matrix_df$byClass["Recall"]
dt_results[1, "F1_Score"] <- conf_matrix_df$byClass["F1"]
dt_results[1, "AUC_ROC"] <- auc_value
print(dt_results)
## Experiment Accuracy Precision Recall F1_Score AUC_ROC
## 1 Default Decision Tree 0.8974141 0.5758684 0.3394397 0.4271186 0.7390987
## 2 D.T:Max Depth = 5 NA NA NA NA NA
## 3 D.T:Pruned Tree NA NA NA NA NA
Results:
Conclusion: The baseline Decision Tree provides a solid foundation, but its recall is relatively low, meaning it struggles to capture the “yes” cases accurately.
Recommendation: Introduce constraints such as max depth to control overfitting.
Experiment 2: Decision Tree with Max Depth = 5 The hypothesis is that by Reducing the depth of the decision tree (by setting maxdepth = 5) it will affect the model’s performance in predicting whether a client will subscribe to a term deposit (variable y). Specifically, the model may generalize better, potentially improving its ability to classify unseen data, but it could also lose accuracy compared to the baseline model.
Variation: Setting maxdepth = 5, which limits the complexity of the model.
set.seed(123)
tree_model_1 <- rpart(y ~ ., data = train_data, method = "class", control = rpart.control(maxdepth = 5))
# Prediction
tree_pred_1 <- predict(tree_model_1, test_data, type = "class")
tree_pred_prob_1 <- predict(tree_model_1, test_data, type = "prob")
# Evaluation
conf_matrix_1 <- confusionMatrix(tree_pred_1, test_data$y, positive = "yes")
print(conf_matrix_1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7252 761
## yes 57 167
##
## Accuracy : 0.9007
## 95% CI : (0.894, 0.9071)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 5.214e-05
##
## Kappa : 0.2574
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.17996
## Specificity : 0.99220
## Pos Pred Value : 0.74554
## Neg Pred Value : 0.90503
## Prevalence : 0.11266
## Detection Rate : 0.02027
## Detection Prevalence : 0.02719
## Balanced Accuracy : 0.58608
##
## 'Positive' Class : yes
##
yes_probabilities_1 <- tree_pred_prob_1[, "yes"]
pred_2 <- prediction(yes_probabilities_1, test_y_numeric)
auc_2 <- performance(pred_2, "auc")
auc_value_2 <- auc_2@y.values[[1]]
# Confusion Matrix for Experiment 2
tree_pred_2 <- predict(tree_model_1, test_data, type = "class")
conf_matrix_2 <- confusionMatrix(tree_pred_2, test_data$y, positive = "yes")
print(conf_matrix_2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7252 761
## yes 57 167
##
## Accuracy : 0.9007
## 95% CI : (0.894, 0.9071)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 5.214e-05
##
## Kappa : 0.2574
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.17996
## Specificity : 0.99220
## Pos Pred Value : 0.74554
## Neg Pred Value : 0.90503
## Prevalence : 0.11266
## Detection Rate : 0.02027
## Detection Prevalence : 0.02719
## Balanced Accuracy : 0.58608
##
## 'Positive' Class : yes
##
# Store results for Experiment 2
dt_results[2, "Accuracy"] <- conf_matrix_2$overall["Accuracy"]
dt_results[2, "Precision"] <- conf_matrix_2$byClass["Precision"]
dt_results[2, "Recall"] <- conf_matrix_2$byClass["Recall"]
dt_results[2, "F1_Score"] <- conf_matrix_2$byClass["F1"]
dt_results[2, "AUC_ROC"] <- auc_value_2
print(dt_results)
## Experiment Accuracy Precision Recall F1_Score AUC_ROC
## 1 Default Decision Tree 0.8974141 0.5758684 0.3394397 0.4271186 0.7390987
## 2 D.T:Max Depth = 5 0.9006920 0.7455357 0.1799569 0.2899306 0.6989344
## 3 D.T:Pruned Tree NA NA NA NA NA
rpart.plot(tree_model_1,
type = 2,
extra = 104,
under = TRUE,
box.palette = "Greens",
branch.lty = 3,
shadow.col = "gray",
main = "Decision Tree (Max Depth = 5)")
The Decision Tree with Max Depth = 5 above is much simpler than the baseline, limiting the number of branches. This reduces overfitting but might increase bias, as it may miss some nuanced relationships in the data. Results:
Conclusion: Reducing the depth has increased precision but significantly lowered recall. The model is making fewer false positives but is struggling to capture the positive cases.
Recommendation:Further tuning is needed—possibly adjusting minsplit or experimenting with pruning strategies.
Experiment 3: Decision Tree with Pruning & Min Split = 50 We hypothesize that combining pruning with minsplit = 50 will result in a more generalized tree that avoids overfitting while ensuring each split has sufficient data. This approach should balance precision and recall while improving AUC-ROC. Variation:
set.seed(123)
#Train Decision Tree with minsplit = 50
tree_model_3 <- rpart(y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0, minsplit = 50))
printcp(tree_model_3)
##
## Classification tree:
## rpart(formula = y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0,
## minsplit = 50))
##
## Variables actually used in tree construction:
## [1] age campaign cons.conf.idx contact day_of_week
## [6] education emp.var.rate euribor3m housing job
## [11] loan marital month nr.employed pdays
## [16] poutcome previous
##
## Root node error: 3712/32951 = 0.11265
##
## n= 32951
##
## CP nsplit rel error xerror xstd
## 1 0.05266703 0 1.00000 1.00000 0.015461
## 2 0.00395115 2 0.89467 0.90894 0.014825
## 3 0.00350216 6 0.87796 0.90086 0.014767
## 4 0.00282866 7 0.87446 0.89898 0.014753
## 5 0.00242457 9 0.86880 0.89898 0.014753
## 6 0.00197557 10 0.86638 0.89278 0.014708
## 7 0.00188578 16 0.85318 0.89224 0.014704
## 8 0.00161638 18 0.84941 0.89170 0.014700
## 9 0.00148168 23 0.84133 0.89036 0.014690
## 10 0.00134698 25 0.83836 0.89332 0.014712
## 11 0.00094289 27 0.83567 0.89547 0.014728
## 12 0.00089799 29 0.83378 0.90598 0.014804
## 13 0.00080819 43 0.82085 0.91164 0.014845
## 14 0.00071839 45 0.81923 0.91352 0.014858
## 15 0.00062859 54 0.81277 0.91352 0.014858
## 16 0.00053879 57 0.81088 0.92107 0.014913
## 17 0.00044899 70 0.80172 0.92107 0.014913
## 18 0.00035920 73 0.80038 0.92565 0.014945
## 19 0.00026940 84 0.79580 0.92780 0.014961
## 20 0.00020205 94 0.79310 0.93427 0.015007
## 21 0.00017960 98 0.79230 0.93696 0.015026
## 22 0.00013470 101 0.79176 0.93723 0.015028
## 23 0.00010776 103 0.79149 0.93992 0.015047
## 24 0.00000000 108 0.79095 0.94154 0.015058
optimal_cp <- tree_model_3$cptable[which.min(tree_model_3$cptable[, "xerror"]), "CP"]
pruned_tree_3 <- prune(tree_model_3, cp = optimal_cp)
# Prediction
tree_pred_3 <- predict(pruned_tree_3, test_data, type = "class")
tree_pred_prob_3 <- predict(pruned_tree_3, test_data, type = "prob")
yes_probabilities_3 <- tree_pred_prob_3[, "yes"]
pred_3 <- prediction(yes_probabilities_3, test_y_numeric)
auc_3 <- performance(pred_3, "auc")
auc_value_3 <- auc_3@y.values[[1]]
conf_matrix_3 <- confusionMatrix(tree_pred_3, test_data$y, positive = "yes")
print(conf_matrix_3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7209 706
## yes 100 222
##
## Accuracy : 0.9021
## 95% CI : (0.8955, 0.9085)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 7.978e-06
##
## Kappa : 0.3155
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.23922
## Specificity : 0.98632
## Pos Pred Value : 0.68944
## Neg Pred Value : 0.91080
## Prevalence : 0.11266
## Detection Rate : 0.02695
## Detection Prevalence : 0.03909
## Balanced Accuracy : 0.61277
##
## 'Positive' Class : yes
##
dt_results[3, "Accuracy"] <- conf_matrix_3$overall["Accuracy"]
dt_results[3, "Precision"] <- conf_matrix_3$byClass["Precision"]
dt_results[3, "Recall"] <- conf_matrix_3$byClass["Recall"]
dt_results[3, "F1_Score"] <- conf_matrix_3$byClass["F1"]
dt_results[3, "AUC_ROC"] <- auc_value_3
print(dt_results)
## Experiment Accuracy Precision Recall F1_Score AUC_ROC
## 1 Default Decision Tree 0.8974141 0.5758684 0.3394397 0.4271186 0.7390987
## 2 D.T:Max Depth = 5 0.9006920 0.7455357 0.1799569 0.2899306 0.6989344
## 3 D.T:Pruned Tree 0.9021488 0.6894410 0.2392241 0.3552000 0.7390662
Results:
Conclusion: The pruned tree shows a slight improvement in accuracy compared to the default decision tree and a significant boost in precision. However, recall remains relatively low, indicating the model’s focus on reducing false positives at the expense of capturing more positive cases.
Recommendation: Further tuning of pruning parameters, like adjusting the minimum samples per leaf, could enhance recall without compromising precision.
# Precision-Recall Curve for Experiment 1
pr_curve <- performance(pred, "prec", "rec")
plot(pr_curve, main = "Precision-Recall Curve: Decision Trees", col = "blue", lwd = 2)
# Precision-Recall Curve for Experiment 2
pr_curve_2 <- performance(pred_2, "prec", "rec")
plot(pr_curve_2, main = "Precision-Recall Curve (Exp 2)", col = "green", lwd = 2, add = TRUE)
# Precision-Recall Curve for Experiment 3
pr_curve_3 <- performance(pred_3, "prec", "rec")
plot(pr_curve_3, main = "Precision-Recall Curve (Exp 3)", col = "red", lwd = 2, add = TRUE)
legend("bottomleft", legend = c("Exp 1", "Exp 2", "Exp 3"),
col = c("blue", "green", "red"), lty = 1, lwd = 2)
This plot compares the precision-recall trade-off for three different decision tree models. Since our dataset is imbalanced (fewer “yes” responses), precision-recall curves are useful for evaluating model performance in distinguishing potential term deposit subscribers. The red curve (Exp 3) seems to perform slightly better at lower recall levels, indicating better handling of positive cases with pruning.
# Gain chart
gain_chart <- gains(test_y_numeric, yes_probabilities)
plot(gain_chart, main = "Gain Chart", col = "purple", lwd = 2)
This visualization helps assess the effectiveness of the predictive model by showing how well it ranks clients in terms of likelihood to subscribe.The Mean Response and Mean Predicted Response lines show how well the model is capturing potential subscribers compared to the actual distribution.The steep decline suggests that the model successfully identifies high-probability subscribers early on.
Random Forest
# Random Forest Metrics
rf_results <- data.frame(
Experiment = c("R.F:50 Trees", "R.F:200 Trees", "R.F:mtry = 6"),
Accuracy = NA,
Precision = NA,
Recall = NA,
F1_Score = NA,
AUC_ROC = NA
)
Experiment 1 (Baseline Random Forest): The hypothesis is that using a random forest classifier with 50 trees will provide a reasonable balance between model complexity and predictive performance.
Variation: None (baseline model).
### Experiment 1: Random Forest with 50 Trees ###
set.seed(123)
rf_50 <- randomForest(y ~ ., data = train_data, ntree = 50)
# Predictions
rf_50_pred <- predict(rf_50, test_data) # Class predictions
rf_50_prob <- predict(rf_50, test_data, type = "prob") # Probability predictions
yes_prob_rf_50 <- rf_50_prob[, "yes"]
pred_rf_50 <- prediction(yes_prob_rf_50, test_y_numeric)
auc_rf_50 <- performance(pred_rf_50, "auc")@y.values[[1]]
conf_matrix_rf_50 <- confusionMatrix(rf_50_pred, test_data$y, positive = "yes")
rf_results[1, "Accuracy"] <- conf_matrix_rf_50$overall["Accuracy"]
rf_results[1, "Precision"] <- conf_matrix_rf_50$byClass["Precision"]
rf_results[1, "Recall"] <- conf_matrix_rf_50$byClass["Recall"]
rf_results[1, "F1_Score"] <- conf_matrix_rf_50$byClass["F1"]
rf_results[1, "AUC_ROC"] <- auc_rf_50
print(rf_results)
## Experiment Accuracy Precision Recall F1_Score AUC_ROC
## 1 R.F:50 Trees 0.894986 0.5737705 0.2640086 0.3616236 0.7600701
## 2 R.F:200 Trees NA NA NA NA NA
## 3 R.F:mtry = 6 NA NA NA NA NA
Results:
Conclusion: The baseline model performs reasonably well, with an ROC-AUC of 0.76, providing a benchmark.
Recommendation: Moving on with hyperparameter tuning to improve performance
Experiment 2: 200 Trees The objective of this experiment is to evaluate the performance of a Random Forest model with 200 trees on predicting whether a client will subscribe to a term deposit (the target variable y). The hypothesis is that increasing the number of trees in the Random Forest model from 50 to 200 will enhance the model’s predictive power, reduce overfitting, and improve the accuracy and generalization of the model.
Variation: Increased n_estimators from
50 to 200
## Experiment 2: 200 Tree
set.seed(123)
rf_200 <- randomForest(y ~ ., data = train_data, ntree = 200)
# Predictions
rf_200_pred <- predict(rf_200, test_data)
rf_200_prob <- predict(rf_200, test_data, type = "prob")
yes_prob_rf_200 <- rf_200_prob[, "yes"]
pred_rf_200 <- prediction(yes_prob_rf_200, test_y_numeric)
auc_rf_200 <- performance(pred_rf_200, "auc")@y.values[[1]]
conf_matrix_rf_200 <- confusionMatrix(rf_200_pred, test_data$y, positive = "yes")
rf_results[2, "Accuracy"] <- conf_matrix_rf_200$overall["Accuracy"]
rf_results[2, "Precision"] <- conf_matrix_rf_200$byClass["Precision"]
rf_results[2, "Recall"] <- conf_matrix_rf_200$byClass["Recall"]
rf_results[2, "F1_Score"] <- conf_matrix_rf_200$byClass["F1"]
rf_results[2, "AUC_ROC"] <- auc_rf_200
print(rf_results)
## Experiment Accuracy Precision Recall F1_Score AUC_ROC
## 1 R.F:50 Trees 0.8949860 0.5737705 0.2640086 0.3616236 0.7600701
## 2 R.F:200 Trees 0.8959573 0.5819861 0.2715517 0.3703159 0.7656355
## 3 R.F:mtry = 6 NA NA NA NA NA
Results:
Conclusion: Increasing the number of trees improved ROC-AUC from 0.760 to 0.765, indicating reduced variance and better predictive performance.
Recommendation: Increasing trees is beneficial. Further tuning could optimize performance.
Experiment 3 (Tuning max_features): mtry = 6 he hypothesis is that tuning the mtry parameter (which controls the number of features considered for each tree split) will improve the model’s predictive performance by potentially reducing overfitting or enhancing generalization. Using fewer features at each split can prevent individual trees from fitting too closely to the training data, which may improve generalization to unseen data.
Variation: Set max_features to 6.
# Experiment 3 (Tuning max_features): mtry = 6
set.seed(123)
rf_mtry6 <- randomForest(y ~ ., data = train_data, ntree = 200, mtry = 6)
# Predictions
rf_mtry6_pred <- predict(rf_mtry6, test_data)
rf_mtry6_prob <- predict(rf_mtry6, test_data, type = "prob")
yes_prob_rf_mtry6 <- rf_mtry6_prob[, "yes"]
pred_rf_mtry6 <- prediction(yes_prob_rf_mtry6, test_y_numeric)
auc_rf_mtry6 <- performance(pred_rf_mtry6, "auc")@y.values[[1]]
conf_matrix_rf_mtry6 <- confusionMatrix(rf_mtry6_pred, test_data$y, positive = "yes")
# Store results in rf_results
rf_results[3, "Accuracy"] <- conf_matrix_rf_mtry6$overall["Accuracy"]
rf_results[3, "Precision"] <- conf_matrix_rf_mtry6$byClass["Precision"]
rf_results[3, "Recall"] <- conf_matrix_rf_mtry6$byClass["Recall"]
rf_results[3, "F1_Score"] <- conf_matrix_rf_mtry6$byClass["F1"]
rf_results[3, "AUC_ROC"] <- auc_rf_mtry6
print(rf_results)
## Experiment Accuracy Precision Recall F1_Score AUC_ROC
## 1 R.F:50 Trees 0.8949860 0.5737705 0.2640086 0.3616236 0.7600701
## 2 R.F:200 Trees 0.8959573 0.5819861 0.2715517 0.3703159 0.7656355
## 3 R.F:mtry = 6 0.8931650 0.5502092 0.2834052 0.3741110 0.7568994
Results:
Conclusion: Limiting mtry (max features) slightly reduced performance, with ROC-AUC dropping to 0.7569 compared to 0.7656 from 200 trees, possibly due to increased bias.
Recommendation: Using 200 trees without restricting mtry provides the best balance of recall (0.2716), F1 score (0.3703), and ROC-AUC (0.7656).
radar_data_rf <- rbind(
rep(1, 5),
rep(0, 5),
rf_results[,-1]
)
rownames(radar_data_rf) <- c("Max", "Min", "50 Trees", "200 Trees", "mtry = 6")
# Radar Plot
radarchart(radar_data_rf,
axistype = 1,
pcol = c("blue", "green", "red"),
plwd = 2,
plty = 1,
cglcol = "grey",
cglty = 1,
axislabcol = "grey",
vlcex = 0.8,
title = "Random Forest Models Performance Comparison")
legend("topright", legend = c("50 Trees", "200 Trees", "mtry = 6"),
col = c("blue", "green", "red"), lty = 1, lwd = 2)
As we can see from the radar plot above Accuracy and AUC-ROC are high across all models, while Precision and Recall show trade-offs. The higher recall for some models suggests better identification of potential subscribers, while the slight drop in precision might indicate more false positives.
Conclusion Increasing the number of trees from 50 to 200 results in only a slight improvement across metrics. Recall increases from 0.2640 to 0.2716, F1 score from 0.3616 to 0.3703, and ROC-AUC from 0.7601 to 0.7656, showing minimal gains. Tuning mtry to 6 further improves recall to 0.2834, but at the cost of lower precision (0.5502) and a slight decrease in ROC-AUC (0.7569). This trade-off may be beneficial if recall is prioritized over precision.
Overall, all Random Forest models perform similarly, with only minor variations. The best choice depends on the emphasis placed on recall versus precision.
AdaBoost
# AdaBoost Metrics
adaboost_results <- data.frame(
Experiment = c("Default AdaBoost", "Ada: = 0.5, iter = 100", "Ada: Feature Selection & Scaling"),
Accuracy = NA,
Precision = NA,
Recall = NA,
F1_Score = NA,
AUC_ROC = NA
)
Experiment 1 (Baseline AdaBoost) : Baseline with 50 iterations we hypothesize that by applying AdaBoost with 50 iterations, we can achieve a high level of accuracy, precision, and recall, with an AUC ROC that reflects the model’s ability to distinguish between the two classes (subscribed vs. not subscribed).
Variation: None (baseline model).
## Experiment 1 (Baseline AdaBoost) : Baseline with 50 iterations
# Train an Adaboost model
adaboost_model <- ada(y ~ ., data = train_data, iter = 50, nu = 1)
print(adaboost_model)
## Call:
## ada(y ~ ., data = train_data, iter = 50, nu = 1)
##
## Loss: exponential Method: discrete Iteration: 50
##
## Final Confusion Matrix for Data:
## Final Prediction
## True value no yes
## no 28802 437
## yes 2863 849
##
## Train Error: 0.1
##
## Out-Of-Bag Error: 0.099 iteration= 6
##
## Additional Estimates of number of iterations:
##
## train.err1 train.kap1
## 41 44
adaboost_pred <- predict(adaboost_model, test_data)
adaboost_prob <- predict(adaboost_model, test_data, type = "prob") # Probability predictions
yes_prob_ada_1 <- adaboost_prob[, 2]
pred_ada_1 <- prediction(yes_prob_ada_1, test_y_numeric)
auc_ada_1 <- performance(pred_ada_1, "auc")@y.values[[1]]
conf_matrix_ada <- confusionMatrix(adaboost_pred, test_data$y, positive = "yes")
adaboost_results[1, "Accuracy"] <- conf_matrix_ada$overall["Accuracy"]
adaboost_results[1, "Precision"] <- conf_matrix_ada$byClass["Precision"]
adaboost_results[1, "Recall"] <- conf_matrix_ada$byClass["Recall"]
adaboost_results[1, "F1_Score"] <- conf_matrix_ada$byClass["F1"]
adaboost_results[1, "AUC_ROC"] <- auc_ada_1
print(adaboost_results)
## Experiment Accuracy Precision Recall F1_Score
## 1 Default AdaBoost 0.9011776 0.6925676 0.2209052 0.3349673
## 2 Ada: = 0.5, iter = 100 NA NA NA NA
## 3 Ada: Feature Selection & Scaling NA NA NA NA
## AUC_ROC
## 1 0.7719639
## 2 NA
## 3 NA
Results:
Conclusion: The baseline performance is solid, with high accuracy but lower precision and recall. There’s room for improvement in model performance.
Recommendation: Tune n_estimators and learning rate to enhance performance.
Experiment 2: Hyperparameter Tuning (nu and iter) Objective is to assess how changing the learning rate (nu) and the number of boosting iterations (iter) will impact the performance of the AdaBoost model. Specifically, we aim to explore whether tuning these hyperparameters improves the accuracy and F1-score compared to the baseline model (with default nu = 1 and iter = 50).
Variation:The nu (learning rate) will be tested at different values (0.5).The iter (iterations) will be tested at different values (100).
# Experiment 2: Hyperparameter tuning (nu and iter)
# Using nu = 0.5 and iter = 100
adaboost_model_1 <- ada(y ~ ., data = train_data, iter = 100, nu = 0.5)
adaboost_pred_1 <- predict(adaboost_model_1, test_data)
adaboost_prob_2 <- predict(adaboost_model_1, test_data, type = "prob")
yes_prob_ada_2 <- adaboost_prob_2[, 2]
pred_ada_2 <- prediction(yes_prob_ada_2, test_y_numeric)
auc_ada_2 <- performance(pred_ada_2, "auc")@y.values[[1]]
# Confusion Matrix
conf_matrix_ada_2 <- confusionMatrix(adaboost_pred_1, test_data$y, positive = "yes")
adaboost_results[2, "Accuracy"] <- conf_matrix_ada_2$overall["Accuracy"]
adaboost_results[2, "Precision"] <- conf_matrix_ada_2$byClass["Precision"]
adaboost_results[2, "Recall"] <- conf_matrix_ada_2$byClass["Recall"]
adaboost_results[2, "F1_Score"] <- conf_matrix_ada_2$byClass["F1"]
adaboost_results[2, "AUC_ROC"] <- auc_ada_2
print(adaboost_results)
## Experiment Accuracy Precision Recall F1_Score
## 1 Default AdaBoost 0.9011776 0.6925676 0.2209052 0.3349673
## 2 Ada: = 0.5, iter = 100 0.8995994 0.6391185 0.2500000 0.3594113
## 3 Ada: Feature Selection & Scaling NA NA NA NA
## AUC_ROC
## 1 0.7719639
## 2 0.7799235
## 3 NA
Results:
Conclusion: Hyperparameter Tuning slightly improved AUC-ROC and F1-Score, but accuracy stayed nearly the same. There’s still a trade-off between precision and recall.
Recommendation: The increase in estimators helps slightly, but further tuning of the learning rate could lead to better overall performance.
Experiment 3: Data Preprocessing (Normalization and Feature Selection) The objective is to evaluate if applying data preprocessing techniques—normalization of continuous features and feature selection—improves the performance of the AdaBoost model compared to the previous experiments.
Variation:
# Experiment 3: Data Preprocessing (Normalization and Feature Selection)
# Normalize the continuous variables
pre_process <- preProcess(train_data, method = c("center", "scale"))
train_data_normalized <- predict(pre_process, train_data)
test_data_normalized <- predict(pre_process, test_data)
# Feature Selection: Select the top 10 features based on importance
rf_model <- randomForest(y ~ ., data = train_data_normalized)
importance_scores <- importance(rf_model)
top_features <- names(sort(importance_scores[, 1], decreasing = TRUE))[1:10]
train_data_selected <- train_data_normalized[, c(top_features, "y")]
test_data_selected <- test_data_normalized[, c(top_features, "y")]
# Train the Adaboost model on the selected features
adaboost_model_3 <- ada(y ~ ., data = train_data_selected, iter = 50, nu = 1)
adaboost_pred_3 <- predict(adaboost_model_3, test_data_selected)
adaboost_prob_3 <- predict(adaboost_model_3, test_data_selected, type = "prob")
yes_prob_ada_3 <- adaboost_prob_3[, 2]
pred_ada_3 <- prediction(yes_prob_ada_3, test_y_numeric)
auc_ada_3 <- performance(pred_ada_3, "auc")@y.values[[1]]
conf_matrix_ada_3 <- confusionMatrix(adaboost_pred_3, test_data_selected$y, positive = "yes")
# Store results in adaboost_results (Row 3)
adaboost_results[3, "Accuracy"] <- conf_matrix_ada_3$overall["Accuracy"]
adaboost_results[3, "Precision"] <- conf_matrix_ada_3$byClass["Precision"]
adaboost_results[3, "Recall"] <- conf_matrix_ada_3$byClass["Recall"]
adaboost_results[3, "F1_Score"] <- conf_matrix_ada_3$byClass["F1"]
adaboost_results[3, "AUC_ROC"] <- auc_ada_3
print(adaboost_results)
## Experiment Accuracy Precision Recall F1_Score
## 1 Default AdaBoost 0.9011776 0.6925676 0.2209052 0.3349673
## 2 Ada: = 0.5, iter = 100 0.8995994 0.6391185 0.2500000 0.3594113
## 3 Ada: Feature Selection & Scaling 0.8985067 0.6729323 0.1928879 0.2998325
## AUC_ROC
## 1 0.7719639
## 2 0.7799235
## 3 0.7717384
Results:
Conclusion:Feature Selection & Scaling resulted in higher precision (0.7211) compared to the baseline and hyperparameter tuning models. However, recall dropped to 0.1950, indicating that the model is missing more true positives. AUC-ROC also decreased slightly compared to the other experiments, suggesting that some removed features may have contained important predictive information.
Recommendation:Further testing could involve different normalization methods or keeping more than 10 features to balance recall and precision.But if the goal is high precision, this model is a good choice as it reduces false positives
adaboost_results_long <- melt(adaboost_results, id.vars = "Experiment")
ggplot(adaboost_results_long, aes(x = Experiment, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "AdaBoost Performance Comparison", y = "Metric Value", x = "Experiment") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Hyperparameter Tuning (nu = 0.5, iter = 100) provides a slight improvement in precision, and AUC-ROC, making it the best performing configuration based on these metrics. In summary, hyperparameter tuning (nu = 0.5, iter = 100) appears to be the most beneficial change to the AdaBoost model, offering a small but meaningful performance boost over the default model.
# Combine the results into one DataFrame
all_results <- rbind(dt_results, rf_results, adaboost_results)
kable(all_results, caption = "Model Performance Comparison", digits = 3, format = "markdown")
| Experiment | Accuracy | Precision | Recall | F1_Score | AUC_ROC |
|---|---|---|---|---|---|
| Default Decision Tree | 0.897 | 0.576 | 0.339 | 0.427 | 0.739 |
| D.T:Max Depth = 5 | 0.901 | 0.746 | 0.180 | 0.290 | 0.699 |
| D.T:Pruned Tree | 0.902 | 0.689 | 0.239 | 0.355 | 0.739 |
| R.F:50 Trees | 0.895 | 0.574 | 0.264 | 0.362 | 0.760 |
| R.F:200 Trees | 0.896 | 0.582 | 0.272 | 0.370 | 0.766 |
| R.F:mtry = 6 | 0.893 | 0.550 | 0.283 | 0.374 | 0.757 |
| Default AdaBoost | 0.901 | 0.693 | 0.221 | 0.335 | 0.772 |
| Ada: = 0.5, iter = 100 | 0.900 | 0.639 | 0.250 | 0.359 | 0.780 |
| Ada: Feature Selection & Scaling | 0.899 | 0.673 | 0.193 | 0.300 | 0.772 |
# Highlighting the best result in each metric
highlighted_results <- formattable(
all_results,
list(
Accuracy = formatter("span",
style = function(x) ifelse(x == max(all_results$Accuracy),
style(color = "green", font.weight = "bold"),
NA)),
Precision = formatter("span",
style = function(x) ifelse(x == max(all_results$Precision),
style(color = "blue", font.weight = "bold"),
NA)),
Recall = formatter("span",
style = function(x) ifelse(x == max(all_results$Recall),
style(color = "red", font.weight = "bold"),
NA)),
F1_Score = formatter("span",
style = function(x) ifelse(x == max(all_results$F1_Score),
style(color = "orange", font.weight = "bold"),
NA)),
AUC_ROC = formatter("span",
style = function(x) ifelse(x == max(all_results$AUC_ROC),
style(color = "purple", font.weight = "bold"),
NA))
)
)
highlighted_results
| Experiment | Accuracy | Precision | Recall | F1_Score | AUC_ROC |
|---|---|---|---|---|---|
| Default Decision Tree | 0.8974141 | 0.5758684 | 0.3394397 | 0.4271186 | 0.7390987 |
| D.T:Max Depth = 5 | 0.9006920 | 0.7455357 | 0.1799569 | 0.2899306 | 0.6989344 |
| D.T:Pruned Tree | 0.9021488 | 0.6894410 | 0.2392241 | 0.3552000 | 0.7390662 |
| R.F:50 Trees | 0.8949860 | 0.5737705 | 0.2640086 | 0.3616236 | 0.7600701 |
| R.F:200 Trees | 0.8959573 | 0.5819861 | 0.2715517 | 0.3703159 | 0.7656355 |
| R.F:mtry = 6 | 0.8931650 | 0.5502092 | 0.2834052 | 0.3741110 | 0.7568994 |
| Default AdaBoost | 0.9011776 | 0.6925676 | 0.2209052 | 0.3349673 | 0.7719639 |
| Ada: = 0.5, iter = 100 | 0.8995994 | 0.6391185 | 0.2500000 | 0.3594113 | 0.7799235 |
| Ada: Feature Selection & Scaling | 0.8985067 | 0.6729323 | 0.1928879 | 0.2998325 | 0.7717384 |
In conclusion,we observed that in decision trees pruning improved precision (0.7455) but reduced recall (0.1800). This model struggled with overfitting. We also noticed that in our random forest experiments, increased trees improved recall to 0.2866 and ROC-AUC to 0.7656, showing better generalization.But AdaBoost performed best with 100 iterations and a learning rate of 0.5, achieving the highest ROC-AUC of 0.7788 and a strong balance between precision and recall.AdaBoost is our optimal model as it provided the best overall performance. Further tuning and resampling techniques could improve recall without compromising precision. For business decisions, choosing between precision-focused or recall-focused models depends on marketing priorities.