In Machine Learning, Experimentation refers to the systematic process of designing, executing, and analyzing different configurations to identify the optimal settings that performs best on a given task. Experimentation is learning by doing. It involves systematically changing parameters, evaluating results with metrics, and comparing different approaches to find the best solution; essentially, it’s the practice of testing and refining machine learning models through controlled experiments to improve their performance.
The key is to modify only one or a few variables at a time to isolate the impact of each change and understand its effect on model performance. In the assignment you will conduct at least 6 experiments. In real life, data scientists run anywhere from a dozen to hundreds of experiments (depending on the dataset and problem domain).
library(tidyverse)
library(openintro)
library(reshape2)
library(infer)
library(dplyr)
library(knitr)
library(corrplot)
library(ggcorrplot)
library(ggthemes)
library(caret)
library(kableExtra)
library(ROSE)
library(rpart)
library(rpart.plot)
library(randomForest)
library(adabag)
library(pROC)
This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment you will define what you are trying to achieve (before each run), conduct the experiment, and at the end you will review how your experiment went. These experiments will allow you to compare algorithms and choose the optimal model.
Using the dataset and EDA from the previous assignment
# Load dataset
dataBank <- read.csv("C:/Users/vitug/OneDrive/Desktop/CUNY Masters/DATA_622/bank data.csv", stringsAsFactors = TRUE)
kable(head(dataBank, 10), caption = "Bank Dataset")
| X | age | job | marital | education | default | balance | housing | loan | contact | day | month | campaign | previous | term | age_group | credit_risk | Subscription |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 1 | 0 | no | Senior | Medium Risk | no |
| 2 | 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 1 | 0 | no | Middle-aged | Medium Risk | no |
| 3 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 1 | 0 | no | Middle-aged | High Risk | no |
| 4 | 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 1 | 0 | no | Middle-aged | Medium Risk | no |
| 5 | 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 1 | 0 | no | Middle-aged | Medium Risk | no |
| 6 | 35 | management | married | tertiary | no | 231 | yes | no | unknown | 5 | may | 1 | 0 | no | Middle-aged | Medium Risk | no |
| 7 | 28 | management | single | tertiary | no | 447 | yes | yes | unknown | 5 | may | 1 | 0 | no | Middle-aged | High Risk | no |
| 8 | 42 | entrepreneur | divorced | tertiary | yes | 2 | yes | no | unknown | 5 | may | 1 | 0 | no | Middle-aged | Medium Risk | no |
| 9 | 58 | retired | married | primary | no | 121 | yes | no | unknown | 5 | may | 1 | 0 | no | Senior | Medium Risk | no |
| 10 | 43 | technician | single | secondary | no | 593 | yes | no | unknown | 5 | may | 1 | 0 | no | Middle-aged | Medium Risk | no |
# Check target variable distribution
table(dataBank$term, dataBank$Subscription)
##
## no yes
## no 39922 0
## yes 0 5289
# Check for missing values
sum(is.na(dataBank))
## [1] 0
In the previous assignment we performed some EDA in the dataset, for some reason a new index column was added (X), I will remove that, and also remove the “Subscription” column which is related to the “term” variable to prevent data leakage.
#remove unnecessary columns
dataBank$X <- NULL
# Find relationship between 'term' and 'Subscription', remove if necessary.
if(all(dataBank$term == dataBank$Subscription) ||
cor(as.numeric(dataBank$term), as.numeric(dataBank$Subscription)) > 0.9) {
dataBank$Subscription <- NULL
print("Removed Subscription feature due to high correlation with target variable")
}
## [1] "Removed Subscription feature due to high correlation with target variable"
# Create train/test split (70/30)
set.seed(123)
trainIndex <- createDataPartition(dataBank$term, p = 0.7, list = FALSE)
trainData <- dataBank[trainIndex, ]
testData <- dataBank[-trainIndex, ]
# check the balance between classes
prop.table(table(trainData$term))
##
## no yes
## 0.8829979 0.1170021
perform the following:
You will perform experiments using the following algorithms:
# Decision Tree Baseline
dt_baseline <- rpart(term ~ ., data = trainData, method = "class")
dt_pred <- predict(dt_baseline, testData, type = "class")
dt_cm <- confusionMatrix(dt_pred, testData$term, positive = "yes")
dt_roc <- roc(testData$term, predict(dt_baseline, testData, type = "prob")[, "yes"])
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
# Random Forest Baseline
rf_baseline <- randomForest(term ~ ., data = trainData)
rf_pred <- predict(rf_baseline, testData)
rf_cm <- confusionMatrix(rf_pred, testData$term, positive = "yes")
rf_roc <- roc(testData$term, predict(rf_baseline, testData, type = "prob")[, "yes"])
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
# AdaBoost Baseline
ada_baseline <- boosting(term ~ ., data = trainData, mfinal = 50)
ada_pred <- predict(ada_baseline, testData)
ada_cm <- confusionMatrix(as.factor(ada_pred$class), testData$term, positive = "yes")
# Store baseline results
baseline_results <- data.frame(
Algorithm = c("Decision Tree", "Random Forest", "AdaBoost"),
Accuracy = c(dt_cm$overall["Accuracy"], rf_cm$overall["Accuracy"], ada_cm$overall["Accuracy"]),
F1_Score = c(dt_cm$byClass["F1"], rf_cm$byClass["F1"], ada_cm$byClass["F1"]),
Sensitivity = c(dt_cm$byClass["Sensitivity"], rf_cm$byClass["Sensitivity"], ada_cm$byClass["Sensitivity"]),
Specificity = c(dt_cm$byClass["Specificity"], rf_cm$byClass["Specificity"], ada_cm$byClass["Specificity"])
)
print(baseline_results)
## Algorithm Accuracy F1_Score Sensitivity Specificity
## 1 Decision Tree 0.8830556 NA 0.0000000 1.0000000
## 2 Random Forest 0.8868161 0.2817033 0.1897856 0.9791249
## 3 AdaBoost 0.8829819 0.2886598 0.2030265 0.9730294
based on the table above, the Random Forest model has the highest accuracy rate with 88.7, also the best Specificity value with .979. AdaBoost has the best F1 Score with 288,9 as well as Sensitivity rate of .203.
For each of the algorithms (above), perform at least two (2) experiments. In a typical experiment you should: -Define the objective of the experiment (hypothesis) -Decide what will change, and what will stay the same -Select the evaluation metric (what you want to measure) -Perform the experiment -Document the experiment so you compare results (track progress)
In the first Decision Tree experiment I will Tuning the complexity parameter to improve model performance
# Experiment 1:
dt_exp1 <- function() {
# Set up cross-validation
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Define parameter grid
grid <- expand.grid(cp = seq(0.001, 0.05, by = 0.005))
# Train with parameter tuning
dt_tuned <- train(
term ~ .,
data = trainData,
method = "rpart",
trControl = ctrl,
tuneGrid = grid,
metric = "ROC"
)
# Best parameter
best_cp <- dt_tuned$bestTune$cp
print(paste("Best cp value:", best_cp))
# Train final model with best parameter
dt_final <- rpart(term ~ ., data = trainData, method = "class", cp = best_cp)
# Evaluate
dt_pred <- predict(dt_final, testData, type = "class")
dt_cm <- confusionMatrix(dt_pred, testData$term, positive = "yes")
dt_roc <- roc(testData$term, predict(dt_final, testData, type = "prob")[, "yes"])
# Print results
print("Decision Tree - Experiment 1 (CP Tuning) Results:")
print(dt_cm)
print(paste("AUC:", auc(dt_roc)))
# Return model and metrics
return(list(
model = dt_final,
confusion_matrix = dt_cm,
roc = dt_roc,
auc = auc(dt_roc),
best_param = best_cp
))
}
In the second Decision Tree experiment I will use only the most important features will improve model generalization
# Experiment 2:
dt_exp2 <- function() {
# Train initial model
dt_init <- rpart(term ~ ., data = trainData, method = "class")
# Check if variable importance exists
if (!exists("variable.importance", dt_init) || length(dt_init$variable.importance) == 0) {
print("No variable importance found. Using all features.")
return(dt_exp1()) # Fall back to experiment 1
}
# Get variable importance
importance <- dt_init$variable.importance
# Determine how many features to select (min of 5 or what's available)
n_features <- min(5, length(importance))
# Make sure we have at least one feature
if (n_features < 1) {
print("Not enough important features found. Using all features.")
return(dt_exp1()) # Fall back to experiment 1
}
# Select top important features
top_features <- names(importance)[1:n_features]
print("Top features for Decision Tree:")
print(top_features)
# Create formula with only important features
formula_str <- paste("term ~", paste(top_features, collapse = " + "))
print(paste("Formula:", formula_str))
formula <- as.formula(formula_str)
# Train model with selected features
dt_features <- rpart(formula, data = trainData, method = "class")
# Evaluate
dt_pred <- predict(dt_features, testData, type = "class")
dt_cm <- confusionMatrix(dt_pred, testData$term, positive = "yes")
dt_roc <- roc(testData$term, predict(dt_features, testData, type = "prob")[, "yes"])
# Print results
print("Decision Tree - Experiment 2 (Feature Selection) Results:")
print(dt_cm)
print(paste("AUC:", auc(dt_roc)))
# Return model and metrics
return(list(
model = dt_features,
confusion_matrix = dt_cm,
roc = dt_roc,
auc = auc(dt_roc),
features = top_features
))
}
# Run experiments
dt_result1 <- dt_exp1()
## [1] "Best cp value: 0.001"
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## [1] "Decision Tree - Experiment 1 (CP Tuning) Results:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11767 1388
## yes 209 198
##
## Accuracy : 0.8822
## 95% CI : (0.8767, 0.8876)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.6219
##
## Kappa : 0.1585
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.12484
## Specificity : 0.98255
## Pos Pred Value : 0.48649
## Neg Pred Value : 0.89449
## Prevalence : 0.11694
## Detection Rate : 0.01460
## Detection Prevalence : 0.03001
## Balanced Accuracy : 0.55370
##
## 'Positive' Class : yes
##
## [1] "AUC: 0.651224738253304"
dt_result2 <- dt_exp2()
## [1] "No variable importance found. Using all features."
## [1] "Best cp value: 0.001"
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## [1] "Decision Tree - Experiment 1 (CP Tuning) Results:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11767 1388
## yes 209 198
##
## Accuracy : 0.8822
## 95% CI : (0.8767, 0.8876)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.6219
##
## Kappa : 0.1585
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.12484
## Specificity : 0.98255
## Pos Pred Value : 0.48649
## Neg Pred Value : 0.89449
## Prevalence : 0.11694
## Detection Rate : 0.01460
## Detection Prevalence : 0.03001
## Balanced Accuracy : 0.55370
##
## 'Positive' Class : yes
##
## [1] "AUC: 0.651224738253304"
In Random Forest experiment one, I will tune the “mtry” parameter to optimize the number of variables at each split will improve performance
# Experiment 1:
rf_exp1 <- function() {
# Set up cross-validation
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Define parameter grid - try a range of mtry values
mtry_values <- c(2, sqrt(ncol(trainData) - 1), ncol(trainData)/3)
grid <- expand.grid(mtry = mtry_values)
# Train with parameter tuning
rf_tuned <- train(
term ~ .,
data = trainData,
method = "rf",
trControl = ctrl,
tuneGrid = grid,
metric = "ROC",
ntree = 200
)
# Best parameter
best_mtry <- rf_tuned$bestTune$mtry
print(paste("Best mtry value:", best_mtry))
# Train final model with best parameter
rf_final <- randomForest(term ~ ., data = trainData, mtry = best_mtry, ntree = 200)
# Evaluate
rf_pred <- predict(rf_final, testData)
rf_cm <- confusionMatrix(rf_pred, testData$term, positive = "yes")
rf_roc <- roc(testData$term, predict(rf_final, testData, type = "prob")[, "yes"])
# Print results
print("Random Forest - Experiment 1 (mtry Tuning) Results:")
print(rf_cm)
print(paste("AUC:", auc(rf_roc)))
# Return model and metrics
return(list(
model = rf_final,
confusion_matrix = rf_cm,
roc = rf_roc,
auc = auc(rf_roc),
best_param = best_mtry
))
}
In the Random Forest experiment 2, I will Increase the number of trees in order to improve model stability and accuracy
# Experiment 2:
rf_exp2 <- function() {
# Try different numbers of trees
ntree_values <- c(50, 100, 200, 300, 500)
results <- data.frame(ntree = integer(), accuracy = numeric(), f1 = numeric(), auc = numeric())
for (n in ntree_values) {
# Train model
rf_model <- randomForest(term ~ ., data = trainData, ntree = n)
# Evaluate
rf_pred <- predict(rf_model, testData)
rf_cm <- confusionMatrix(rf_pred, testData$term, positive = "yes")
rf_roc <- roc(testData$term, predict(rf_model, testData, type = "prob")[, "yes"])
# Store results
results <- rbind(results, data.frame(
ntree = n,
accuracy = rf_cm$overall["Accuracy"],
f1 = rf_cm$byClass["F1"],
auc = auc(rf_roc)
))
}
# Find best ntree value based on F1 score
best_ntree <- results$ntree[which.max(results$f1)]
print(paste("Best ntree value:", best_ntree))
# Train final model with best parameter
rf_final <- randomForest(term ~ ., data = trainData, ntree = best_ntree)
# Evaluate
rf_pred <- predict(rf_final, testData)
rf_cm <- confusionMatrix(rf_pred, testData$term, positive = "yes")
rf_roc <- roc(testData$term, predict(rf_final, testData, type = "prob")[, "yes"])
# Print results
print("Random Forest - Experiment 2 (ntree Tuning) Results:")
print(rf_cm)
print(paste("AUC:", auc(rf_roc)))
# Return model and metrics
return(list(
model = rf_final,
confusion_matrix = rf_cm,
roc = rf_roc,
auc = auc(rf_roc),
ntree_results = results,
best_param = best_ntree
))
}
# Run experiments
rf_result1 <- rf_exp1()
## [1] "Best mtry value: 5.33333333333333"
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## [1] "Random Forest - Experiment 1 (mtry Tuning) Results:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11653 1221
## yes 323 365
##
## Accuracy : 0.8862
## 95% CI : (0.8807, 0.8915)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.1336
##
## Kappa : 0.2693
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.23014
## Specificity : 0.97303
## Pos Pred Value : 0.53052
## Neg Pred Value : 0.90516
## Prevalence : 0.11694
## Detection Rate : 0.02691
## Detection Prevalence : 0.05073
## Balanced Accuracy : 0.60158
##
## 'Positive' Class : yes
##
## [1] "AUC: 0.773992789066995"
rf_result2 <- rf_exp2()
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## [1] "Best ntree value: 300"
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## [1] "Random Forest - Experiment 2 (ntree Tuning) Results:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11731 1274
## yes 245 312
##
## Accuracy : 0.888
## 95% CI : (0.8826, 0.8933)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.03717
##
## Kappa : 0.2453
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.19672
## Specificity : 0.97954
## Pos Pred Value : 0.56014
## Neg Pred Value : 0.90204
## Prevalence : 0.11694
## Detection Rate : 0.02301
## Detection Prevalence : 0.04107
## Balanced Accuracy : 0.58813
##
## 'Positive' Class : yes
##
## [1] "AUC: 0.776913036876612"
In Adaboost experiment 1, I will try to find the optimal number of boosting iterations that might help me improve model performance
# Experiment 1:
ada_exp1 <- function() {
# Try different numbers of iterations
mfinal_values <- c(10, 30, 50, 100, 150)
results <- data.frame(mfinal = integer(), accuracy = numeric(), f1 = numeric())
for (m in mfinal_values) {
# Train model
ada_model <- boosting(term ~ ., data = trainData, mfinal = m)
# Evaluate
ada_pred <- predict(ada_model, testData)
ada_cm <- confusionMatrix(as.factor(ada_pred$class), testData$term, positive = "yes")
# Store results
results <- rbind(results, data.frame(
mfinal = m,
accuracy = ada_cm$overall["Accuracy"],
f1 = ada_cm$byClass["F1"]
))
}
# Find best mfinal value based on F1 score
best_mfinal <- results$mfinal[which.max(results$f1)]
print(paste("Best mfinal value:", best_mfinal))
# Train final model with best parameter
ada_final <- boosting(term ~ ., data = trainData, mfinal = best_mfinal)
# Evaluate
ada_pred <- predict(ada_final, testData)
ada_cm <- confusionMatrix(as.factor(ada_pred$class), testData$term, positive = "yes")
# Print results
print("AdaBoost - Experiment 1 (mfinal Tuning) Results:")
print(ada_cm)
# Return model and metrics
return(list(
model = ada_final,
confusion_matrix = ada_cm,
mfinal_results = results,
best_param = best_mfinal
))
}
In Adaboost experiment rate 2, I will adjust weights for minority class to help improve classification of minority class
# Experiment 2:
ada_exp2 <- function() {
# Calculate class weights inversely proportional to class frequencies
class_weights <- 1 / table(trainData$term)
class_weights <- class_weights / sum(class_weights)
# Create weighted version of training data
# We'll create a weight vector for boosting
weights <- ifelse(trainData$term == "yes",
class_weights["yes"],
class_weights["no"])
# Train with adjusted weights
ada_weighted <- boosting(term ~ ., data = trainData, mfinal = 50, control = list(weights = weights))
# Evaluate
ada_pred <- predict(ada_weighted, testData)
ada_cm <- confusionMatrix(as.factor(ada_pred$class), testData$term, positive = "yes")
# Print results
print("AdaBoost - Experiment 2 (Class Weighting) Results:")
print(ada_cm)
print(paste("Class weights used - yes:", class_weights["yes"], "no:", class_weights["no"]))
# Return model and metrics
return(list(
model = ada_weighted,
confusion_matrix = ada_cm,
weights = class_weights
))
}
# Run experiments
ada_result1 <- ada_exp1()
## [1] "Best mfinal value: 50"
## [1] "AdaBoost - Experiment 1 (mfinal Tuning) Results:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11683 1280
## yes 293 306
##
## Accuracy : 0.884
## 95% CI : (0.8785, 0.8894)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.3703
##
## Kappa : 0.2308
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.19294
## Specificity : 0.97553
## Pos Pred Value : 0.51085
## Neg Pred Value : 0.90126
## Prevalence : 0.11694
## Detection Rate : 0.02256
## Detection Prevalence : 0.04417
## Balanced Accuracy : 0.58424
##
## 'Positive' Class : yes
##
ada_result2 <- ada_exp2()
## [1] "AdaBoost - Experiment 2 (Class Weighting) Results:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 11684 1285
## yes 292 301
##
## Accuracy : 0.8837
## 95% CI : (0.8782, 0.8891)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.4114
##
## Kappa : 0.2271
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.18979
## Specificity : 0.97562
## Pos Pred Value : 0.50759
## Neg Pred Value : 0.90092
## Prevalence : 0.11694
## Detection Rate : 0.02219
## Detection Prevalence : 0.04373
## Balanced Accuracy : 0.58270
##
## 'Positive' Class : yes
##
## [1] "Class weights used - yes: 0.882997883029479 no: 0.11700211697052"
Since I performed all the experiments in all three algorithms, I will create some tables and visualizations to find out the variances between experiments in all models, as well as to find the best model for this project.
# Compile all results
all_results <- data.frame(
Algorithm = c(
"Decision Tree (Baseline)", "Decision Tree (CP Tuning)", "Decision Tree (Feature Selection)",
"Random Forest (Baseline)", "Random Forest (mtry Tuning)", "Random Forest (ntree Tuning)",
"AdaBoost (Baseline)", "AdaBoost (mfinal Tuning)", "AdaBoost (Class Weighting)"
),
Accuracy = c(
dt_cm$overall["Accuracy"], dt_result1$confusion_matrix$overall["Accuracy"], dt_result2$confusion_matrix$overall["Accuracy"],
rf_cm$overall["Accuracy"], rf_result1$confusion_matrix$overall["Accuracy"], rf_result2$confusion_matrix$overall["Accuracy"],
ada_cm$overall["Accuracy"], ada_result1$confusion_matrix$overall["Accuracy"], ada_result2$confusion_matrix$overall["Accuracy"]
),
F1_Score = c(
dt_cm$byClass["F1"], dt_result1$confusion_matrix$byClass["F1"], dt_result2$confusion_matrix$byClass["F1"],
rf_cm$byClass["F1"], rf_result1$confusion_matrix$byClass["F1"], rf_result2$confusion_matrix$byClass["F1"],
ada_cm$byClass["F1"], ada_result1$confusion_matrix$byClass["F1"], ada_result2$confusion_matrix$byClass["F1"]
),
Sensitivity = c(
dt_cm$byClass["Sensitivity"], dt_result1$confusion_matrix$byClass["Sensitivity"], dt_result2$confusion_matrix$byClass["Sensitivity"],
rf_cm$byClass["Sensitivity"], rf_result1$confusion_matrix$byClass["Sensitivity"], rf_result2$confusion_matrix$byClass["Sensitivity"],
ada_cm$byClass["Sensitivity"], ada_result1$confusion_matrix$byClass["Sensitivity"], ada_result2$confusion_matrix$byClass["Sensitivity"]
),
Specificity = c(
dt_cm$byClass["Specificity"], dt_result1$confusion_matrix$byClass["Specificity"], dt_result2$confusion_matrix$byClass["Specificity"],
rf_cm$byClass["Specificity"], rf_result1$confusion_matrix$byClass["Specificity"], rf_result2$confusion_matrix$byClass["Specificity"],
ada_cm$byClass["Specificity"], ada_result1$confusion_matrix$byClass["Specificity"], ada_result2$confusion_matrix$byClass["Specificity"]
)
)
# Add AUC where available
all_results$AUC <- c(
auc(dt_roc), dt_result1$auc, dt_result2$auc,
auc(rf_roc), rf_result1$auc, rf_result2$auc,
NA, NA, NA # AdaBoost doesn't provide probabilities directly for ROC
)
# Display results
print(all_results)
## Algorithm Accuracy F1_Score Sensitivity Specificity
## 1 Decision Tree (Baseline) 0.8830556 NA 0.0000000 1.0000000
## 2 Decision Tree (CP Tuning) 0.8822445 0.1986954 0.1248424 0.9825484
## 3 Decision Tree (Feature Selection) 0.8822445 0.1986954 0.1248424 0.9825484
## 4 Random Forest (Baseline) 0.8868161 0.2817033 0.1897856 0.9791249
## 5 Random Forest (mtry Tuning) 0.8861525 0.3210202 0.2301387 0.9730294
## 6 Random Forest (ntree Tuning) 0.8879959 0.2911806 0.1967213 0.9795424
## 7 AdaBoost (Baseline) 0.8829819 0.2886598 0.2030265 0.9730294
## 8 AdaBoost (mfinal Tuning) 0.8840142 0.2800915 0.1929382 0.9755344
## 9 AdaBoost (Class Weighting) 0.8837192 0.2762735 0.1897856 0.9756179
## AUC
## 1 0.5000000
## 2 0.6512247
## 3 0.6512247
## 4 0.7748837
## 5 0.7739928
## 6 0.7769130
## 7 NA
## 8 NA
## 9 NA
# Create performance comparison visualization
all_results_long <- tidyr::pivot_longer(
all_results,
cols = c("Accuracy", "F1_Score", "Sensitivity", "Specificity", "AUC"),
names_to = "Metric",
values_to = "Value"
)
# Plot metrics comparison
ggplot(all_results_long, aes(x = reorder(Algorithm, Value), y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
theme_minimal() +
labs(title = "Performance Metrics Comparison Across Models",
x = "Algorithm",
y = "Metric Value") +
theme(legend.position = "bottom")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
#### Graph of ROC curves
# Plot ROC curves for models where available
plot(dt_roc, col = "blue", main = "ROC Curves Comparison")
plot(dt_result1$roc, add = TRUE, col = "lightblue")
plot(dt_result2$roc, add = TRUE, col = "darkblue")
plot(rf_roc, add = TRUE, col = "red")
plot(rf_result1$roc, add = TRUE, col = "pink")
plot(rf_result2$roc, add = TRUE, col = "darkred")
legend("bottomright",
legend = c("DT Baseline", "DT CP Tuned", "DT Feature Selection",
"RF Baseline", "RF mtry Tuned", "RF ntree Tuned"),
col = c("blue", "lightblue", "darkblue", "red", "pink", "darkred"),
lwd = 2)
# Find the best model based on F1 score (good for imbalanced data)
best_model_index <- which.max(all_results$F1_Score)
best_model_name <- all_results$Algorithm[best_model_index]
cat("The best performing model based on F1 Score is:", best_model_name,
"with F1 Score of", all_results$F1_Score[best_model_index], "\n")
## The best performing model based on F1 Score is: Random Forest (mtry Tuning) with F1 Score of 0.3210202
# If you want to consider multiple metrics, create a weighted average
# For example: 0.4*Accuracy + 0.4*F1_Score + 0.2*AUC
all_results$Combined_Score <- 0.4 * all_results$Accuracy +
0.4 * all_results$F1_Score +
0.2 * all_results$AUC
all_results$Combined_Score[is.na(all_results$Combined_Score)] <-
0.5 * all_results$Accuracy[is.na(all_results$Combined_Score)] +
0.5 * all_results$F1_Score[is.na(all_results$Combined_Score)]
best_overall_index <- which.max(all_results$Combined_Score)
best_overall_name <- all_results$Algorithm[best_overall_index]
cat("The best overall model based on combined metrics is:", best_overall_name,
"with Combined Score of", all_results$Combined_Score[best_overall_index], "\n")
## The best overall model based on combined metrics is: Random Forest (mtry Tuning) with Combined Score of 0.6376676
# For the best model, analyze feature importance
if(grepl("Random Forest", best_overall_name)) {
# For Random Forest
if(best_overall_name == "Random Forest (Baseline)") {
importance_plot <- varImpPlot(rf_baseline, main = "Variable Importance - Random Forest Baseline")
} else if(best_overall_name == "Random Forest (mtry Tuning)") {
importance_plot <- varImpPlot(rf_result1$model, main = "Variable Importance - RF mtry Tuned")
} else if(best_overall_name == "Random Forest (ntree Tuning)") {
importance_plot <- varImpPlot(rf_result2$model, main = "Variable Importance - RF ntree Tuned")
}
} else if(grepl("Decision Tree", best_overall_name)) {
# For Decision Tree
if(best_overall_name == "Decision Tree (Baseline)") {
importance <- dt_baseline$variable.importance
} else if(best_overall_name == "Decision Tree (CP Tuning)") {
importance <- dt_result1$model$variable.importance
} else if(best_overall_name == "Decision Tree (Feature Selection)") {
importance <- dt_result2$model$variable.importance
}
# Plot importance
barplot(sort(importance, decreasing = TRUE),
main = "Variable Importance - Decision Tree",
col = "skyblue",
las = 2,
cex.names = 0.7)
}
#### Conclusion
Based on the tables and graphs above, the best performing model based
on F1 Score is: Random Forest (mtry Tuning) with F1 Score of 0.3210202,
the Decision Tree is the one with the lowest F1 Score. Random Forest
(ntree Tunning) has the highest accuracy rate with a value of 8879959,
while both Decision Tree experiments has the lowest values 8822445.
Random Forest ntree tunning is the one with the highest AUC with a value
of 0.7769130 while decision tree baseline has the lowest with a value of
0.5000000
In conclusion, the best overall model based on combined metrics is the
Random Forest (mtry Tuning) with Combined Score of 0.6376676